Finetuning Mixtral 8x7B with SFT and DPO

author
By Bryan Li

24 Feb 2024

50

35

In this article, we describe the finetuning experiments we have done to illustrate the effectiveness of SFT and DPO with the public state-of-the-art Mixtral 8x7B model. We start from the pretrained mistralai/Mixtral-8x7B-v0.1. Following HuggingFaceH4/zephyr-7b-beta, we use the publicly available dataset for SFT and DPO training. The datasets involved in this experiment include:

DatasetVolumn
SFTUltraChat-200K200K
DPOUltraFeedback61K

All experiments are performed with our internally developed distributed training framework based on deepspeed and pytorch_lightning, and are trained with precision bfloat16. We released the SFT and DPO checkpoint:

SFT

We perform chat instruction finetuning on Ultrachat-200K dataset, a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β. We used prompt template same as Claude models, i.e. prompt template = "\n\nHuman: {prompt}\n\nAssistant: {response}". The hyperparameters are in the following table. We train for 1 epoch. The released sft checkpoint is https://huggingface.co/vistagi/Mixtral-8x7b-v0.1-sft.

Modelbatch sizelroptimizerstepsgrad_clip
SFT645e-5adamw32001.0
DPO645e-6adamw200010.0

DPO

We perform DPO training on top of the above SFT on the UltraFeedback dataset, total 61.1K preference pairs. We show the loss curve on the eval set, along with the accuracy and margin on the eval set.

eval loss
eval accuracy
eval margin

Results

We evaluate the SFT and DPO model on a set of popular LLM benchmarks and compared with pretrained Mixtral model and the public mistral.ai released Mixtral-Instruct model, which is trained using similar methods as we do in this article but the training dataset is unknown. We show the results on the following table.

With the SFT training, the model slightly outperforms the pretrained model. With DPO, the improvement is significant. We are roughly on par with the public Mixtral Instruct model.

CapabilityBenchmarkEval TypeMixtral PretrainedMixtral SFTMixtral DPOMixtral Instruct
GeneralMMLUmulti-choice67.76%67.21%67.94%68.79%
ReasoningBBHgeneration69.39%69.16%70.08%68.16%
HellaSwagmulti-choice84.09%84.05%85.18%85.97%
MathGSM8Kgeneration57.62%61.79%63.38%63.91%
MATHgeneration25.86%26.70%26.44%25.48%

Popular Tags :

Share this post :