Fine-Tuning Llama-2 LLM on Google Colab: A Step-by-Step Guide.

12 min readSep 18, 2023

Llama, Llama, Llama: 🦙 A Highly Speakable Model in Recent Times. 🗣️ Llama 2: 🌟 It’s like the rockstar of language models, developed by the brilliant minds over at Meta. But what makes it so special? Well, it’s not just one model; it’s a family of models, ranging from 7 billion to 70 billion parameters. Think of it as the Swiss Army knife of AI language models. While it’s built on the trusted Google transformer architecture, Meta added their own secret sauce to make it even more powerful. Whether you’re looking to chat with an AI assistant or need a language model for a specific task, Llama 2 has you covered.

Now, let’s talk about Llama 2’s journey to greatness. It started with some serious training — we’re talking a massive dataset of text and code, including books, articles, code repositories, and more. But what truly sets it apart is the fine-tuning process, where it learned from over 1 million human annotations. This is where it honed its skills, becoming incredibly accurate and fluent. And guess what? Llama 2 doesn’t just shine in the lab; it outperforms other open-source language models in various real-world tests. The best part is that you can use it for research and commercial applications, making it a versatile tool with boundless potential. So, buckle up, because Llama 2 is on a mission to redefine the AI landscape.

Let’s understand the LLM’s Training process.

There is mainly 2 steps:-

Pre-training: It’s like teaching a language model the ABCs of language by exposing it to a massive amount of text from the 🌐 internet. Think of it as giving the model a broad understanding of grammar 📝, vocabulary, and common patterns in language . During this phase, the model learns to predict what comes next in a sentence 🤖, helping it grasp the structure of language 🧠. It’s like teaching a student the ABCs before moving on to reading books 📖.

Fine-tuning ✨: Fine-tuning on the other hand, is where the magic happens. After the model has a general understanding of language from pre-training, fine-tuning narrows its focus. It’s like taking that well-rounded student and giving them specific lessons for a particular task. For example, you might fine-tune the model to be an expert in answering questions or generating code. It’s like guiding that student to excel in a specific subject in school. Fine-tuning adapts the general language knowledge gained during pre-training to perform specific tasks accurately and effectively.

After Fine-tuning, the model still we had a problem. These include occasional generation of incorrect or nonsensical information, sensitivity to input phrasing, susceptibility to bias present in the fine-tuning data, and difficulty handling nuanced context in complex conversations. Additionally, models can struggle with generating coherent long-form content, which can affect their suitability for certain applications like content generation and chatbots. These limitations highlight the need for ongoing research and development efforts to refine fine-tuned models and address these issues for more reliable and ethical AI applications.

Responsible AI is our goal🎯 not only a Fine-tuned model.

Reinforcement Learning from Human Feedback : RLHF is like giving your language model a tutor🎓. After pre-training and fine-tuning, RLHF steps in to provide additional training. It’s a bit like having a teacher review and grade the model’s responses to improve them further. Human feedback, in the form of evaluations and corrections✅, helps the model learn from its mistakes and refine its language skills. Just as students benefit from feedback to excel in their studies, RLHF helps language models become even better at specific tasks by learning from human guidance.

so look like lot of hard work needed for RLHF. so new buddy entering the game🎮.

DPO Direct Preference Optimization is a new technique🤝 designed to address the limitations of Reinforcement Learning from Human Feedback (RLHF) in fine-tuning large language models (LLMs). Unlike RLHF, which relies on complex reward function learning, DPO simplifies the process by treating it as a classification problem based on human preference data.

Limitation of Google colab.

The Biggest Misconception: Fine-Tuning LLM on Google Colab 🧐.

Let’s debunk a common myth 🌟: Yes, you can fine-tune a Language Model (LLM) on the free version of Google Colab, but with a catch! 🙅‍♂️

Here’s the scoop: Google Colab offers a free environment, but there are time limits ⏳. You get a generous 12-hour window to run your code continuously after that it automatically disconnected , but here’s the twist — if there’s no activity, it disconnects after just 15–30 minutes of inactivity ⏱️. Colab also has a GPU limitation; you can only use GPUs for around 12 hours/day.

Fine-tuning a large LLM on Google Colab’s free version? Not the easiest feat! 🤯 Due to these constraints, you might find yourself limited to fine-tuning smaller LLMs with smaller datasets, often maxing out at around 2 epochs ⚙️ with 10k samples will be difficult. So, while it’s possible, it can be quite challenging to fine-tune a substantial LLM using Google Colab’s free tier. 🚀

Step-by-Step Guide to Fine-Tuning Llama 2

We are going to use 🦙Llama-2–7B-HF, a pre-trained small model in the Llama-2 family, for fine-tuning with Qlora technique.

QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA (Low-Rank Adapters) that uses quantization to improve parameter efficiency during fine-tuning. QLoRA is more memory efficient than LoRA because it loads the pretrained model to GPU memory as 4-bit weights, compared to 8-bits in LoRA. This reduces memory demands and speeds up calculations.

In simple terms, we’re not going to train the entire model. 🚂 Instead, we’ll add an adapter in between the model and train only that adapter. 🧩 This way, we can fine-tune the LLM on the consumer GPU, 🎮 and it’s also a faster training process. ⏩

System setup

The system setup we used to fine-tune a model included a Tesla V100 32GB GPU, and it ran on an Ubuntu VM. If you want to set up a similar VM for training LLMs, feel free to reach out to us Email : gathnexorg@gmail.com📧.

Let’s enter into the code.

Install required packages

!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece bitsandbytes

transformers: This library provides APIs for downloading pre-trained models.
bitsandbytes: It’s a library for quantizing a large language model to reduce the memory footprint of the model, especially on GPUs.
peft: This is used to add a LoRA adapter to the LLM.
trl: This library contains an SFT (Supervised Fine-Tuning) class to fine-tune a model.
accelerate and xformers: These libraries are used to increase the inference speed of the model.
wandb: It’s used for monitoring the training process.
datasets: This library is used to load datasets from Hugging Face.
gradio: It’s used for designing simple user interfaces.

Import libraries

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login

Check system spec

def print_system_specs():
    # Check if CUDA is available
    is_cuda_available = torch.cuda.is_available()
    print("CUDA Available:", is_cuda_available)
# Get the number of available CUDA devices
    num_cuda_devices = torch.cuda.device_count()
    print("Number of CUDA devices:", num_cuda_devices)
    if is_cuda_available:
        for i in range(num_cuda_devices):
            # Get CUDA device properties
            device = torch.device('cuda', i)
            print(f"--- CUDA Device {i} ---")
            print("Name:", torch.cuda.get_device_name(i))
            print("Compute Capability:", torch.cuda.get_device_capability(i))
            print("Total Memory:", torch.cuda.get_device_properties(i).total_memory, "bytes")
    # Get CPU information
    print("--- CPU Information ---")
    print("Processor:", platform.processor())
    print("System:", platform.system(), platform.release())
    print("Python Version:", platform.python_version())
print_system_specs()

output:-

CUDA Available: True
Number of CUDA devices: 1
--- CUDA Device 0 ---
Name: Tesla T4
Compute Capability: (7, 5)
Total Memory: 15835398144 bytes
--- CPU Information ---
Processor: x86_64
System: Linux 5.15.109+
Python Version: 3.10.12

Setting the model variable

# Pre trained model
model_name = "meta-llama/Llama-2-7b-hf" 

# Dataset name
dataset_name = "vicgalle/alpaca-gpt4"

# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "Repository link here"

Log into hugging face hub

notebook_login()

Note : You need to enter the access token, before that you need to apply for access the llama-2 model in Meta website.

Load dataset

We are utilizing the pre-processed dataset vicgalle/alpaca-gpt4 from Hugging Face.

# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train[0:10000]")
dataset["text"][0]

Loading the model and tokenizer

We are going to load a Llama-2–7B-HF pre-trained model with 4-bit quantization, and the computed data type will be BFloat16.

# Load base model(llama-2-7b-hf) and tokenizer
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.float16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token

Monitoring

Apart from training, monitoring is a crucial part we need to consider in LLM training🚧.

To get started, create a WandB account. Click here to log in🔗. After creating your account, enter the authorization token🔑 here.

#monitering login
wandb.login(key="Enter the Authorization code here")
run = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")

Lora config

peft_config = LoraConfig(
    lora_alpha= 8,
    lora_dropout= 0.1,
    r= 16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]
)

Training arguments

training_arguments = TrainingArguments(
    output_dir= "./results",
    num_train_epochs= 1,
    per_device_train_batch_size= 8,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    save_steps= 1000,
    logging_steps= 30,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= False,
    bf16= False,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "linear",
    report_to="wandb",
)

SFTT Trainer arguments

# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field="text",
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)

We’re all set to begin the training process.

# Train model
trainer.train()

You can now monitor various training metrics, including loss, GPU usage, RAM usage, and more, directly on the WandB website. The link will be provided when you initiate above code.

The UI look like,

Now, at this crucial phase, it’s imperative to closely monitor the training loss. If the loss starts to exhibit unusual behaviour or anomalies🚨, it’s a signal to consider stopping the training. Overfitting is a common concern in such cases, and it may be necessary to fine-tune hyperparameters and retry to achieve the best results📉.

Good training loss

This is an image depicting our training loss, showcasing favourable trends. 📈While there may be occasional spikes, we have effectively applied exponential moving average to mitigate them. The total number of epochs used for training was 5. So the primary goal is being the gradual reduction of the loss curve over time🎯.

Bad Training loss

This training graph indicates overfitting as the training loss oscillates between 2 and 3. This could be due to issues such as inadequately pre-processed data or suboptimal hyperparameter settings🛠️ etc.

In many articles📝, we often come across training loss curves that appear unusual and resemble overfitting, yet they are included in the blog posts🌐. This highlights the importance of exercising caution when utilizing code from the internet.

What after training ?

So, after training is completed, we need to save the model for testing. It only saves the trained adapter.

# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()

Let’s test the model

def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
    B_INST, E_INST = "### Instruction:\n", "### Response:\n"

    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    # Despite returning the usual output, the streamer will also print the generated text to stdout.
    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

stream("what is newtons 2rd law and its formula")

Output:-

Newton's second law is an equation that describes the relationship between an object's mass, acceleration, and the force acting on it. The equation is F = ma, where F is the force, m is the mass, and a is the acceleration. This means that the force acting on an object is equal to its mass times its acceleration. In other words, the more massive an object is, the more force is required to accelerate it, and the more force is applied to an object, the more it will accelerate. This law is fundamental to understanding how objects move and interact with one another. Its use is essential in many areas of physics, such as Newtonian mechanics, celestial mechanics, and fluid mechanics. Newton's second law is also known as the law of acceleration or the law of motion. It is named after Sir Isaac Newton, the English physicist and mathematician who first formulated the law in the 17th century. Newton's second law is a cornerstone of classical mechanics and is still used today in modern physics and engineering. The law has been proven to be accurate over a wide range of conditions and has been used to explain and predict many phenomena. However, it is important to note that Newton's second law only applies to objects that are not subject to any external forces, such as gravity or friction. When external forces are present, the force acting on an object will be equal to the net force, which is the sum of the external forces acting on the object. This is known as Newton's third law, and it states that for every action, there is an equal and opposite reaction. Together, these laws form the basis of classical mechanics and have been used to explain and predict the motion of objects for centuries. They remain a fundamental part of our understanding of the physical world.

### Formula:
F = ma

Where:
F = Force
m = Mass
a = Acceleration

This equation states that the force acting on an object is equal to its mass times its acceleration. The units of force and mass are newtons (N) and kilograms (kg), respectively. The units of acceleration are meters per second squared (m/s²) or newtons per kilogram (N/kg). The formula can be used to calculate the acceleration of an object if you know its mass and the

The results we’ve obtained 📊 reflect the model’s performance during testing. However, one of the primary challenges we’re facing with this model pertains to the stopping criteria used during its training . We’re committed to addressing this issue through ongoing research and development efforts , with the aim of identifying and implementing more optimal parameters ⚙️. While the model has demonstrated promise , it’s important to acknowledge that there is room for improvement in its performance 📈.

Upload a model to hugging face repository

Step 1 : Once you are finished training your model, you can use the code you provided to free this memory. This is important because it can help to prevent your computer from running out of memory, and it can also improve the performance of other programs that you are running.

# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()

Step 2: Merging the adapter with model.

base_model = AutoModelForCausalLM.from_pretrained(
    model_name, low_cpu_mem_usage=True,
    return_dict=True,torch_dtype=torch.float16,
    device_map= {"": 0})
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Step 2 : Pushing the merged model to hugging face hub

model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)

Conclusion

In conclusion, our assessment indicates that the model’s performance is promising but falls short of being outstanding. Recognizing this, our team remains dedicated to continuous Research and Development (R&D) efforts to craft a superior model. 🌟 We are committed to providing more effective solutions for Language Models (LLMs) that cater to the needs of AI enthusiasts and practitioners.

It’s essential to highlight that fine-tuning a model on platforms like Google Colab comes with its set of challenges🤯. The time🕒 limitations and resource constraints can make this task a formidable one. However, our team is actively exploring ways to navigate these difficulties, aiming to make fine-tuning on such platforms more accessible and efficient for everyone.

In essence, our journey in the world of LLMs continues, fueled by the desire to deliver superior models and streamline the fine-tuning process. 💡 Stay tuned for more exciting updates! 📢

Thanks & Regards to Gathnex team🎉.

Additionally, we’d like to clarify that we’ve utilized certain images from the internet to enhance our explanations for the audience’s better understanding. We want to extend our credits and appreciation🙏 to the original owners of these images🖼️.

Resource : Google colab and Huggingface repo

Reference : Huggingface

Fine-Tuning Llama-2 LLM on Google Colab: A Step-by-Step Guide.

Limitation of Google colab.

Step-by-Step Guide to Fine-Tuning Llama 2

System setup

Let’s enter into the code.

Written by Gathnex