Mistral-7B Fine-Tuning: A Step-by-Step Guide

5 min readOct 4


Introducing Mistral 7B: The Powerhouse of Language Models

The Mistral AI team is thrilled to unveil the latest addition to the world of generative AI — the Mistral 7B model. With a staggering 7.3 billion parameters, Mistral 7B is a true giant in the realm of language models, and it comes with a host of remarkable features and capabilities.

One of the most impressive aspects of Mistral 7B is its ability to outperform other prominent language models. It surpasses Llama 2 13B on all benchmark tests and even competes head-to-head with Llama 1 34B on many of them. Additionally, Mistral 7B excels in code-related tasks while maintaining its proficiency in English language tasks. This remarkable performance is a testament to its versatility and power.

Efficiency of Mistral 7B

Mistral 7B isn’t just powerful; it’s also efficient. It utilizes Grouped-query attention (GQA) for faster inference, making it an excellent choice for real-time applications. Moreover, it incorporates Sliding Window Attention (SWA) to handle longer sequences with a smaller computational cost. This innovative approach ensures that Mistral 7B remains at the cutting edge of AI technology.

Open Source and Accessible

The Mistral AI team believes in the spirit of collaboration and knowledge sharing. That’s why they’ve released Mistral 7B under the Apache 2.0 license, allowing developers and researchers to use it without restrictions. You can easily download and implement Mistral 7B using their reference implementation, deploy it on popular cloud platforms like AWS, GCP, or Azure, or access it through Hugging Face.

Fine-Tuning Made Easy

Mistral 7B is designed to be fine-tuned for various tasks effortlessly. As a demonstration, the Mistral AI team has provided a fine-tuned model for chat applications, showcasing its superiority over Llama 2 13B in chat-related tasks. This flexibility makes Mistral 7B an ideal choice for a wide range of natural language processing tasks.

Unparalleled Performance Across Benchmarks

Detailed evaluations show that Mistral 7B consistently outperforms Llama 2 13B on a wide range of benchmarks. It excels in tasks related to common sense reasoning, world knowledge, reading comprehension, mathematics, and code-related challenges. Its performance in these domains makes it a go-to choice for AI applications that demand a deep understanding of language and context.

Efficiency Meets Performance

Mistral 7B’s unique sliding window attention mechanism not only enhances performance but also ensures efficient use of resources. It can perform as well as a Llama 2 model three times its size in reasoning, comprehension, and STEM reasoning tasks. This translates to significant memory savings and improved throughput.

Let’s enter into the model building part.

We need an updated library, so we are installing a package directly from a Git repository. If you try to install the package from PyPI , you will encounter an error.

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops sentencepiece

We created a Gath_baize dataset comprising approximately 210k prompts to train Mistral-7b. The dataset consists of a mixture of data from Alpaca, Stack Overflow, medical, and Quora datasets

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os, torch, wandb, platform, warnings
from datasets import load_dataset
from trl import SFTTrainer
from huggingface_hub import notebook_login
#Use a sharded model to fine-tune in the free version of Google Colab.
base_model = "mistralai/Mistral-7B-v0.1" #bn22/Mistral-7B-Instruct-v0.1-sharded
dataset_name, new_model = "gathnex/Gath_baize", "gathnex/Gath_mistral_7b"
# Loading a Gath_baize dataset
dataset = load_dataset(dataset_name, split="train")
# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
load_in_4bit= True,
bnb_4bit_quant_type= "nf4",
bnb_4bit_compute_dtype= torch.bfloat16,
bnb_4bit_use_double_quant= False,
model = AutoModelForCausalLM.from_pretrained(
device_map={"": 0}
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
wandb.login(key = "Wandb authorization key")
run = wandb.init(project='Fine tuning mistral 7B', job_type="training", anonymous="allow")
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
model = get_peft_model(model, peft_config)
# Training Arguments
# Hyperparameters should beadjusted based on the hardware you using
training_arguments = TrainingArguments(
output_dir= "./results",
num_train_epochs= 1,
per_device_train_batch_size= 8,
gradient_accumulation_steps= 2,
optim = "paged_adamw_8bit",
save_steps= 5000,
logging_steps= 30,
learning_rate= 2e-4,
weight_decay= 0.001,
fp16= False,
bf16= False,
max_grad_norm= 0.3,
max_steps= -1,
warmup_ratio= 0.3,
group_by_length= True,
lr_scheduler_type= "constant",
# Setting sft parameters
trainer = SFTTrainer(
max_seq_length= None,
packing= False,
# Save the fine-tuned model
model.config.use_cache = True

Let’s test the model

def stream(user_prompt):
runtimeFlag = "cuda:0"
system_prompt = 'The conversation between Human and AI assisatance named Gathnex\n'
B_INST, E_INST = "[INST]", "[/INST]"

prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n{E_INST}"

inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

_ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)
stream("Explain large language models")
# Clear the memory footprint
del model, trainer

# Reload the base model
base_model_reload = AutoModelForCausalLM.from_pretrained(
base_model, low_cpu_mem_usage=True,
device_map= {"": 0})
model = PeftModel.from_pretrained(base_model_reload, new_model)
model = model.merge_and_unload()

# Reload tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
#push the model to hub
!huggingface-cli login
model.push_to_hub(new_model, use_temp_dir=False)
tokenizer.push_to_hub(new_model, use_temp_dir=False)

Resource : Google colab and Huggingface

The output from Fine-tuned Mistral model was good. We trained it on a vast dataset of around 210,000 prompt-response pairs, a significant expansion from the previous Alpaca dataset. This training process, which occurred on a Tesla V100 32GB GPU, took approximately 45 hours. In our upcoming article, we’ll provide insights into designing the user interface (UI) for our model and the steps involved in deploying this powerful model for production use.




🤖 Exploring Generative AI & LLM. Join the Gathnex community for cutting-edge discussions and updates! LinkedIn : https://www.linkedin.com/company/gathnex/ 🌟