Fine-Tuning Llama-2 LLM on Google Colab: A Step-by-Step Guide.
Llama, Llama, Llama: š¦ A Highly Speakable Model in Recent Times. š£ļø Llama 2: š Itās like the rockstar of language models, developed by the brilliant minds over at Meta. But what makes it so special? Well, itās not just one model; itās a family of models, ranging from 7 billion to 70 billion parameters. Think of it as the Swiss Army knife of AI language models. While itās built on the trusted Google transformer architecture, Meta added their own secret sauce to make it even more powerful. Whether youāre looking to chat with an AI assistant or need a language model for a specific task, Llama 2 has you covered.
Now, letās talk about Llama 2ās journey to greatness. It started with some serious training ā weāre talking a massive dataset of text and code, including books, articles, code repositories, and more. But what truly sets it apart is the fine-tuning process, where it learned from over 1 million human annotations. This is where it honed its skills, becoming incredibly accurate and fluent. And guess what? Llama 2 doesnāt just shine in the lab; it outperforms other open-source language models in various real-world tests. The best part is that you can use it for research and commercial applications, making it a versatile tool with boundless potential. So, buckle up, because Llama 2 is on a mission to redefine the AI landscape.
Letās understand the LLMās Training process.
There is mainly 2 steps:-
Pre-training: Itās like teaching a language model the ABCs of language by exposing it to a massive amount of text from the š internet. Think of it as giving the model a broad understanding of grammar š, vocabulary, and common patterns in language . During this phase, the model learns to predict what comes next in a sentence š¤, helping it grasp the structure of language š§ . Itās like teaching a student the ABCs before moving on to reading books š.
Fine-tuning āØ: Fine-tuning on the other hand, is where the magic happens. After the model has a general understanding of language from pre-training, fine-tuning narrows its focus. Itās like taking that well-rounded student and giving them specific lessons for a particular task. For example, you might fine-tune the model to be an expert in answering questions or generating code. Itās like guiding that student to excel in a specific subject in school. Fine-tuning adapts the general language knowledge gained during pre-training to perform specific tasks accurately and effectively.
After Fine-tuning, the model still we had a problem. These include occasional generation of incorrect or nonsensical information, sensitivity to input phrasing, susceptibility to bias present in the fine-tuning data, and difficulty handling nuanced context in complex conversations. Additionally, models can struggle with generating coherent long-form content, which can affect their suitability for certain applications like content generation and chatbots. These limitations highlight the need for ongoing research and development efforts to refine fine-tuned models and address these issues for more reliable and ethical AI applications.
Responsible AI is our goalšÆ not only a Fine-tuned model.
Reinforcement Learning from Human Feedback : RLHF is like giving your language model a tutorš. After pre-training and fine-tuning, RLHF steps in to provide additional training. Itās a bit like having a teacher review and grade the modelās responses to improve them further. Human feedback, in the form of evaluations and correctionsā , helps the model learn from its mistakes and refine its language skills. Just as students benefit from feedback to excel in their studies, RLHF helps language models become even better at specific tasks by learning from human guidance.
so look like lot of hard work needed for RLHF. so new buddy entering the gameš®.
DPO Direct Preference Optimization is a new techniqueš¤ designed to address the limitations of Reinforcement Learning from Human Feedback (RLHF) in fine-tuning large language models (LLMs). Unlike RLHF, which relies on complex reward function learning, DPO simplifies the process by treating it as a classification problem based on human preference data.
Limitation of Google colab.
The Biggest Misconception: Fine-Tuning LLM on Google Colab š§.
Letās debunk a common myth š: Yes, you can fine-tune a Language Model (LLM) on the free version of Google Colab, but with a catch! š āāļø
Hereās the scoop: Google Colab offers a free environment, but there are time limits ā³. You get a generous 12-hour window to run your code continuously after that it automatically disconnected , but hereās the twist ā if thereās no activity, it disconnects after just 15ā30 minutes of inactivity ā±ļø. Colab also has a GPU limitation; you can only use GPUs for around 12 hours/day.
Fine-tuning a large LLM on Google Colabās free version? Not the easiest feat! š¤Æ Due to these constraints, you might find yourself limited to fine-tuning smaller LLMs with smaller datasets, often maxing out at around 2 epochs āļø with 10k samples will be difficult. So, while itās possible, it can be quite challenging to fine-tune a substantial LLM using Google Colabās free tier. š
Step-by-Step Guide to Fine-Tuning Llama 2
We are going to use š¦Llama-2ā7B-HF, a pre-trained small model in the Llama-2 family, for fine-tuning with Qlora technique.
QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA (Low-Rank Adapters) that uses quantization to improve parameter efficiency during fine-tuning. QLoRA is more memory efficient than LoRA because it loads the pretrained model to GPU memory as 4-bit weights, compared to 8-bits in LoRA. This reduces memory demands and speeds up calculations.
In simple terms, weāre not going to train the entire model. š Instead, weāll add an adapter in between the model and train only that adapter. š§© This way, we can fine-tune the LLM on the consumer GPU, š® and itās also a faster training process. ā©
System setup
The system setup we used to fine-tune a model included a Tesla V100 32GB GPU, and it ran on an Ubuntu VM. If you want to set up a similar VM for training LLMs, feel free to reach out to us Email : gathnexorg@gmail.comš§.
Letās enter into the code.
Install required packages
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece bitsandbytes
- transformers: This library provides APIs for downloading pre-trained models.
- bitsandbytes: Itās a library for quantizing a large language model to reduce the memory footprint of the model, especially on GPUs.
- peft: This is used to add a LoRA adapter to the LLM.
- trl: This library contains an SFT (Supervised Fine-Tuning) class to fine-tune a model.
- accelerate and xformers: These libraries are used to increase the inference speed of the model.
- wandb: Itās used for monitoring the training process.
- datasets: This library is used to load datasets from Hugging Face.
- gradio: Itās used for designing simple user interfaces.
Import libraries
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, HfArgumentParser, TrainingArguments, pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
Check system spec
def print_system_specs():
# Check if CUDA is available
is_cuda_available = torch.cuda.is_available()
print("CUDA Available:", is_cuda_available)
# Get the number of available CUDA devices
num_cuda_devices = torch.cuda.device_count()
print("Number of CUDA devices:", num_cuda_devices)
if is_cuda_available:
for i in range(num_cuda_devices):
# Get CUDA device properties
device = torch.device('cuda', i)
print(f"--- CUDA Device {i} ---")
print("Name:", torch.cuda.get_device_name(i))
print("Compute Capability:", torch.cuda.get_device_capability(i))
print("Total Memory:", torch.cuda.get_device_properties(i).total_memory, "bytes")
# Get CPU information
print("--- CPU Information ---")
print("Processor:", platform.processor())
print("System:", platform.system(), platform.release())
print("Python Version:", platform.python_version())
print_system_specs()
output:-
CUDA Available: True
Number of CUDA devices: 1
--- CUDA Device 0 ---
Name: Tesla T4
Compute Capability: (7, 5)
Total Memory: 15835398144 bytes
--- CPU Information ---
Processor: x86_64
System: Linux 5.15.109+
Python Version: 3.10.12
Setting the model variable
# Pre trained model
model_name = "meta-llama/Llama-2-7b-hf"
# Dataset name
dataset_name = "vicgalle/alpaca-gpt4"
# Hugging face repository link to save fine-tuned model(Create new repository in huggingface,copy and paste here)
new_model = "Repository link here"
Log into hugging face hub
notebook_login()
Note : You need to enter the access token, before that you need to apply for access the llama-2 model in Meta website.
Load dataset
We are utilizing the pre-processed dataset vicgalle/alpaca-gpt4 from Hugging Face.
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train[0:10000]")
dataset["text"][0]
Loading the model and tokenizer
We are going to load a Llama-2ā7B-HF pre-trained model with 4-bit quantization, and the computed data type will be BFloat16.
# Load base model(llama-2-7b-hf) and tokenizer
bnb_config = BitsAndBytesConfig(
load_in_4bit= True,
bnb_4bit_quant_type= "nf4",
bnb_4bit_compute_dtype= torch.float16,
bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map={"": 0}
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
Monitoring
Apart from training, monitoring is a crucial part we need to consider in LLM trainingš§.
To get started, create a WandB account. Click here to log inš. After creating your account, enter the authorization tokenš here.
#monitering login
wandb.login(key="Enter the Authorization code here")
run = wandb.init(project='Fine tuning llama-2-7B', job_type="training", anonymous="allow")
Lora config
peft_config = LoraConfig(
lora_alpha= 8,
lora_dropout= 0.1,
r= 16,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj", "up_proj"]
)
Training arguments
training_arguments = TrainingArguments(
output_dir= "./results",
num_train_epochs= 1,
per_device_train_batch_size= 8,
gradient_accumulation_steps= 2,
optim = "paged_adamw_8bit",
save_steps= 1000,
logging_steps= 30,
learning_rate= 2e-4,
weight_decay= 0.001,
fp16= False,
bf16= False,
max_grad_norm= 0.3,
max_steps= -1,
warmup_ratio= 0.3,
group_by_length= True,
lr_scheduler_type= "linear",
report_to="wandb",
)
SFTT Trainer arguments
# Setting sft parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
max_seq_length= None,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing= False,
)
Weāre all set to begin the training process.
# Train model
trainer.train()
You can now monitor various training metrics, including loss, GPU usage, RAM usage, and more, directly on the WandB website. The link will be provided when you initiate above code.
The UI look like,
Now, at this crucial phase, itās imperative to closely monitor the training loss. If the loss starts to exhibit unusual behaviour or anomaliesšØ, itās a signal to consider stopping the training. Overfitting is a common concern in such cases, and it may be necessary to fine-tune hyperparameters and retry to achieve the best resultsš.
Good training loss
This is an image depicting our training loss, showcasing favourable trends. šWhile there may be occasional spikes, we have effectively applied exponential moving average to mitigate them. The total number of epochs used for training was 5. So the primary goal is being the gradual reduction of the loss curve over timešÆ.
Bad Training loss
This training graph indicates overfitting as the training loss oscillates between 2 and 3. This could be due to issues such as inadequately pre-processed data or suboptimal hyperparameter settingsš ļø etc.
In many articlesš, we often come across training loss curves that appear unusual and resemble overfitting, yet they are included in the blog postsš. This highlights the importance of exercising caution when utilizing code from the internet.
What after training ?
So, after training is completed, we need to save the model for testing. It only saves the trained adapter.
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()
Letās test the model
def stream(user_prompt):
runtimeFlag = "cuda:0"
system_prompt = 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n'
B_INST, E_INST = "### Instruction:\n", "### Response:\n"
prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n\n{E_INST}"
inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# Despite returning the usual output, the streamer will also print the generated text to stdout.
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)
stream("what is newtons 2rd law and its formula")
Output:-
Newton's second law is an equation that describes the relationship between an object's mass, acceleration, and the force acting on it. The equation is F = ma, where F is the force, m is the mass, and a is the acceleration. This means that the force acting on an object is equal to its mass times its acceleration. In other words, the more massive an object is, the more force is required to accelerate it, and the more force is applied to an object, the more it will accelerate. This law is fundamental to understanding how objects move and interact with one another. Its use is essential in many areas of physics, such as Newtonian mechanics, celestial mechanics, and fluid mechanics. Newton's second law is also known as the law of acceleration or the law of motion. It is named after Sir Isaac Newton, the English physicist and mathematician who first formulated the law in the 17th century. Newton's second law is a cornerstone of classical mechanics and is still used today in modern physics and engineering. The law has been proven to be accurate over a wide range of conditions and has been used to explain and predict many phenomena. However, it is important to note that Newton's second law only applies to objects that are not subject to any external forces, such as gravity or friction. When external forces are present, the force acting on an object will be equal to the net force, which is the sum of the external forces acting on the object. This is known as Newton's third law, and it states that for every action, there is an equal and opposite reaction. Together, these laws form the basis of classical mechanics and have been used to explain and predict the motion of objects for centuries. They remain a fundamental part of our understanding of the physical world.
### Formula:
F = ma
Where:
F = Force
m = Mass
a = Acceleration
This equation states that the force acting on an object is equal to its mass times its acceleration. The units of force and mass are newtons (N) and kilograms (kg), respectively. The units of acceleration are meters per second squared (m/sĀ²) or newtons per kilogram (N/kg). The formula can be used to calculate the acceleration of an object if you know its mass and the
The results weāve obtained š reflect the modelās performance during testing. However, one of the primary challenges weāre facing with this model pertains to the stopping criteria used during its training . Weāre committed to addressing this issue through ongoing research and development efforts , with the aim of identifying and implementing more optimal parameters āļø. While the model has demonstrated promise , itās important to acknowledge that there is room for improvement in its performance š.
Upload a model to hugging face repository
Step 1 : Once you are finished training your model, you can use the code you provided to free this memory. This is important because it can help to prevent your computer from running out of memory, and it can also improve the performance of other programs that you are running.
# Clear the memory footprint
del model, trainer
torch.cuda.empty_cache()
Step 2: Merging the adapter with model.
base_model = AutoModelForCausalLM.from_pretrained(
model_name, low_cpu_mem_usage=True,
return_dict=True,torch_dtype=torch.float16,
device_map= {"": 0})
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()
# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Step 2 : Pushing the merged model to hugging face hub
model.push_to_hub(new_model)
tokenizer.push_to_hub(new_model)
Conclusion
In conclusion, our assessment indicates that the modelās performance is promising but falls short of being outstanding. Recognizing this, our team remains dedicated to continuous Research and Development (R&D) efforts to craft a superior model. š We are committed to providing more effective solutions for Language Models (LLMs) that cater to the needs of AI enthusiasts and practitioners.
Itās essential to highlight that fine-tuning a model on platforms like Google Colab comes with its set of challengesš¤Æ. The timeš limitations and resource constraints can make this task a formidable one. However, our team is actively exploring ways to navigate these difficulties, aiming to make fine-tuning on such platforms more accessible and efficient for everyone.
In essence, our journey in the world of LLMs continues, fueled by the desire to deliver superior models and streamline the fine-tuning process. š” Stay tuned for more exciting updates! š¢
Thanks & Regards to Gathnex teamš.
Additionally, weād like to clarify that weāve utilized certain images from the internet to enhance our explanations for the audienceās better understanding. We want to extend our credits and appreciationš to the original owners of these imagesš¼ļø.
Resource : Google colab and Huggingface repo
Reference : Huggingface