Continuous PreTraining or FineTuning? Let’s resolve this —
In the coliseum of machine learning, two gladiators face off in an epic battle: Continuous Pre-Training and Fine-Tuning. The crowd roars as these titans of model optimisation clash, their neural networks sparking with electricity. Continuous Pre-Training, the seasoned veteran, boasts of its ability to endlessly absorb new knowledge, growing ever wiser with each passing day. Fine-Tuning, the agile specialist, counters with its laser-focused approach, honing in on specific tasks with surgical precision. As they grapple, the very fabric of artificial intelligence trembles. Will the jack-of-all-trades triumph, or will the master of one domain emerge victorious? The referee — a team of data scientists — watches closely, knowing that in this arena, perhaps the true champion lies not in choosing a side, but in finding the perfect harmony between these two formidable techniques — Claude.AI
The introduction given above was so cool that it has to be included in the start of this blog. I am currently working through the dilemma of whether a model (LLM) should be pretrained more using generalised data or it should be fine tuned using small relative set of specialised data.
Paper in support for continual pre-training is this — Don’t Stop Pre-training: Aapt Language Models to Domains and Tasks. Let’s talk briefly about it —
- Paper talks about two things DAPT ( Domain Adaptive Pre Training ) and TAPT ( Task Adaptive Pre Training ). Domain is something in broader sense unlabelled data which is relevant to the taske e.g. Biomedical papers for classification of biomedical data. Task adaptive pre training is unlabelled data curated from specific task like classification features.
- DAPT and TAPT both have shown to give better results than baseline model ( here it is RoBERTa). Both algorithms can be used for enhancement individually or can be used together in two continuous phases.
- Paper has shown the similarity of data distribution of pre-training unlabelled corpus in a very subtle way which is kind of new to me. This is a pair wise similarity of 10K vocab words from different domains where PT is the baseline RoBERTa corpus.
4. TAPT is generally a smaller text corpus which is the data from task distribution and hence is less compute intensive as compared to DAPT.
5. The results from the experimentation has been shown below —
6. Pre-training on task-specific or small domain-specific corpora can significantly improve performance. This suggests that alongside developing larger language models, there’s value in specialising models using relevant domain and task data.
Another paper in the favour of continuous pre-training of small models is Well-Read Students Learn Better. Covering some of the highlights are given below —
- Big models are difficult to run for smaller companies with low or zero funding thorough bogus self marketing. So instead of getting the professor models like an 80B param model, smaller models are trained to mimic their teachers on various tasks. This is called knowledge distillation. The student predicts the output from softmax distribution of teacher model.
- Paper argues that when you build smaller, before distillation either you can initialise student/small model using random params, or copy some of the layer’s weights from teacher model or else you can pre-train the student model on the corpus of teacher model with same masked language modelling technique. The resultant model is 31 times smaller but 16 times faster.
- Let’s break and study this algo quickly. It says pre-train your student model on unlabelled dataset such as Wikipedia articles, news articles, and books etc using MLM technique. Now for each example on task relevant unlabelled dataset D_T, calculate the loss L using the teacher’s predictions and the student’s predictions. Finally update the student model using backpropagation with this loss. In the optional part, you can fine tune the model using supervised training on labelled data D_L.
4. Example of datasets —
D_LM: “The quick brown fox jumps over the lazy dog. Climate change is a pressing global issue.”
D_T: “This movie had amazing special effects. The plot twist at the end was unexpected. The lead actor’s performance was disappointing.”
D_L: “This film was a masterpiece!” (Label: Positive) “I fell asleep during the movie.” (Label: Negative) “The movie was okay, nothing special.” (Label: Neutral)
5. The authors observe that pre-trained students can leverage depth much better than width; in contrast, this property is not visible for randomly-initialised models. Pre-trained Distillation is more robust to variations of amount of unlabelled data in the transfer set than standard distillation.
6. End result — Pre-Trainig + Distillation outperforms : more sophisticated distillation of task knowledge (Sun et al., 2019a), more sophisticated pre-training from unlabelled text (Sanh, 2019).
Verdict: After reading these two papers knowledge distillation looks like a progressive choice for compact models. Now the problem statement comes, how to distill a single domain knowledge ? The quest to find the best model will go on forever ( till AGI ;) !