Blog Details Page | Free Next.js Template for Startup and SaaS

Finetuning Mixtral 8x7B with SFT and DPO

By Bryan Li

24 Feb 2024

In this article, we describe the finetuning experiments we have done to illustrate the effectiveness of SFT and DPO with the public state-of-the-art Mixtral 8x7B model. We start from the pretrained mistralai/Mixtral-8x7B-v0.1. Following HuggingFaceH4/zephyr-7b-beta, we use the publicly available dataset for SFT and DPO training. The datasets involved in this experiment include:

	Dataset	Volumn
SFT	UltraChat-200K	200K
DPO	UltraFeedback	61K

All experiments are performed with our internally developed distributed training framework based on deepspeed and pytorch_lightning, and are trained with precision bfloat16. We released the SFT and DPO checkpoint:

SFT

We perform chat instruction finetuning on Ultrachat-200K dataset, a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β. We used prompt template same as Claude models, i.e. prompt template = "\n\nHuman: {prompt}\n\nAssistant: {response}". The hyperparameters are in the following table. We train for 1 epoch. The released sft checkpoint is https://huggingface.co/vistagi/Mixtral-8x7b-v0.1-sft.

Model	batch size	lr	optimizer	steps	grad_clip
SFT	64	5e-5	adamw	3200	1.0
DPO	64	5e-6	adamw	2000	10.0

DPO

We perform DPO training on top of the above SFT on the UltraFeedback dataset, total 61.1K preference pairs. We show the loss curve on the eval set, along with the accuracy and margin on the eval set.

Results

We evaluate the SFT and DPO model on a set of popular LLM benchmarks and compared with pretrained Mixtral model and the public mistral.ai released Mixtral-Instruct model, which is trained using similar methods as we do in this article but the training dataset is unknown. We show the results on the following table.

With the SFT training, the model slightly outperforms the pretrained model. With DPO, the improvement is significant. We are roughly on par with the public Mixtral Instruct model.

Capability	Benchmark	Eval Type	Mixtral Pretrained	Mixtral SFT	Mixtral DPO	Mixtral Instruct
General	MMLU	multi-choice	67.76%	67.21%	67.94%	68.79%
Reasoning	BBH	generation	69.39%	69.16%	70.08%	68.16%
	HellaSwag	multi-choice	84.09%	84.05%	85.18%	85.97%
Math	GSM8K	generation	57.62%	61.79%	63.38%	63.91%
	MATH	generation	25.86%	26.70%	26.44%	25.48%

Popular Tags :

LLM SFT DPO

Finetuning Mixtral 8x7B with SFT and DPO

SFT

DPO

Results

Popular Tags :

Share this post :