How to Deploy LLM for Free of Cost.

A step-by-step guide to deploy a large language models on production at zero cost.

5 min readDec 8, 2023

In the previous article, we saw the production-ready RAG system at zero cost. in that we utilized the hugging face space-free tier for deployment.

Build your own Production ready Retrival Augmented Genration-System at Zero Cost.

In this article, we have implemented a production-ready Advanced RAG System , at zero cost😎.

gathnex.medium.com

We are going to use the same Hugging Face space to deploy our LLM. Hugging Face provides CPU Basic, 2 vCPU with 16GB RAM, and 50GB Storage Free!

Wait! What? Can we deploy the LLM on this resource? Is this possible?

The answer was yes! It is possible with some catch. We can run an LLM on the CPU with the help of Ctransformers. CTransformer library is a Python package that provides access to Transformer models implemented in C/C++ using the GGML library.

GGML

GGML is a C library for machine learning. It helps run large language models (LLMs) on regular computer chips (CPUs). It uses a special way of representing data (binary format) to share these models. To make it work well on common hardware, GGML uses a technique called quantization. This technique comes in different levels, like 4-bit, 5-bit, and 8-bit quantization. Each level has its own balance between efficiency and performance.

Drawbacks

The main drawback here is latency. Since this model will run on CPUs, we won’t receive responses as quickly as a model deployed on a GPU , but the latency is not excessive. On average, it takes around 1 minute to generate 140–150 tokens on huggingface space. Actually, it performed quite well on local system with a 16-core CPU, providing responses is less than 15 sec.

Model

For demo purposes, we are going to deploy the Zephyr-7B-Beta SOTA model. You can find various sizes of Zephyr GGUF quantized format in Huggingface.

Don’t get confused. GGUF is updated version of GGML, offering more flexibility, extensibility, and compatibility. It aims to simplify the user experience and accommodate various models. GGML, while a valuable early effort, had limitations that GGUF seeks to overcome.

Zephyr GGUF : https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/tree/main

We have selected Zephyr-7b-beta.Q4_K_S.gguf. It’s a 4-bit quantized small model, which is fast and has minimal loss of quality.

Note : Response, latency, and accuracy may change according to the size of the models you select.

Coding Time

Let’s explore deployment in detail. We’ve deployed the Zephyr model in two environments:

One designed for creating APIs.
Another serving as a playground, enabling you to interact with the model.

1. Deployment LLM for API

Deployment structure

LLM_Deployment_at_zerocost
├── Dockerfile
├── main.py
├── requirements.txt
└── zephyr-7b-beta.Q4_K_S.gguf

Refer to our Hugging Face space files structure to get an idea : https://huggingface.co/spaces/gathnex/LLM-deployment-zerocost-api/tree/main

requirements.txt

python-multipart
fastapi
pydantic
uvicorn
requests
python-dotenv
ctransformers

zephyr-7b-beta.Q4_K_S.gguf

Download zephyr-7b-beta.Q4_K_S.gguf model on huggingface.

main.py

The main.py file contains a FastAPI function that returns a Zephyr-7B completion.

from ctransformers import AutoModelForCausalLM
from fastapi import FastAPI
from pydantic import BaseModel


llm = AutoModelForCausalLM.from_pretrained("zephyr-7b-beta.Q4_K_S.gguf",
model_type='mistral',
max_new_tokens = 1096,
threads = 3,
)

#Pydantic object
class validation(BaseModel):
    prompt: str
#Fast API
app = FastAPI()

@app.post("/llm_on_cpu")
async def stream(item: validation):
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.'
    E_INST = "</s>"
    user, assistant = "<|user|>", "<|assistant|>"
    prompt = f"{system_prompt}{E_INST}\n{user}\n{item.prompt}{E_INST}\n{assistant}\n"
    return llm(prompt)

Dockerfile

Finally, the Dockerfile was used to containerize our application for deployment.

FROM python:3.9

WORKDIR /code

COPY ./requirements.txt /code/requirements.txt

RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt

COPY ./zephyr-7b-beta.Q4_K_S.gguf /code/zephyr-7b-beta.Q4_K_S.gguf
COPY ./main.py /code/main.py

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

Now, our files are ready for deployment.

Deployment

Huggingface space

Step 1: Create a new Hugging Face space.

Step 2: Enter the space name and choose the license type.

Step 3: Select the Space SDK-Docker (Blank) option.

Step 4: Choose ‘Public’ initially to copy our deployment endpoint, and then hit ‘Create Space.

Refer to our Hugging Face space files structure to get an idea : https://huggingface.co/spaces/gathnex/LLM-deployment-zerocost-api/tree/main

Step 5: Upload the files we have created above.

LLM_Deployment_at_zerocost
├── Dockerfile
├── main.py
├── requirements.txt
└── zephyr-7b-beta.Q4_K_S.gguf

Congratulations! you have deployed your LLM.

If there are no errors in the code, you’ll see that it’s running, and you can check the logs if an error occurs.

Step 6: Now the space created, then click on “Embed this space” and copy the space link (This serves as your API key for production).

Copy the direct URL.

Step 7: Now make space visibility private ⚠️ in setting (Note: Your data and credentials will be exposed in a public space).

Everything is now completed, and we have deployed our LLM successfully. Let’s proceed to test our API.

This is our Space link : https://gathnex-llm-deployment-zerocost-api.hf.space

Add /docs to the URL to enable Swagger UI and test our API.

Fast API swagger : https://gathnex-llm-deployment-zerocost-api.hf.space/docs

Endpoint : https://gathnex-llm-deployment-zerocost-api.hf.space/llm_on_cpu

We also deployed Zephyr in the Gradio space for the chat playground.

2. Chat UI Playground on HF CPU : https://huggingface.co/spaces/gathnex/LLM_deployment_zerocost_playground

Conclusion

We have successfully deployed LLM in production with an API. You can use it for small automation tasks, text generation, agent tasks, etc. It’s very useful for those looking to deploy their own customized model without the high costs associated with GPUs. GPU costs are often too high, making it difficult for many people to afford.

Opportunity is everywhere; stop thinking and start doing with what you have.

Follow us on medium for more update.

LinkedIn : https://www.linkedin.com/company/gathnex/

If you have any questions, feel free to reach out to us on LinkedIn.