ULMFiT Paper Notes: Paper that laid foundation of fine tuning
2 min readJul 30, 2024
- Universal Language Model Fine-tuning for Text Classification — Excellent paper by Jeremy Howard and Sebastian Ruder
- Based on Transfer Learning. In short how a learned model can perform some unrelated task if we give a little push ( fine tuning ).
- ULMFiT is method that can be applied to any model such as CV model for classification, NLP applications like spam, fraud etc.
- Before this the embeddings of a pre trained models were used in first layers of model with an additional task of training other layers of model from scratch.
- Challenges to LLMs — LMs overfit to small datasets and suffered catastrophic forgetting when fine-tuned with a classifier.
- Three techniques to witness — discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing.
- The method is universal because —
It works across tasks varying in document size, number, and label type
It uses a single architecture and training process
It requires no custom feature engineering or preprocessing
It does not require additional in-domain documents or labels. - PreTraining —
Wikitext-103 (Merity et al., 2017b) consisting of 28,595 preprocessed Wikipedia articles and 103 million words - FineTuning —
Discriminative Fine Tuning — Fine tune different layers to different extent. Treat each layer with different learning rates which have been empirically found to be η L−1 = η L/2.6 for lower layers.
Slanted Triangular Rates — Slanted triangular rate as you can see increases at first for 200 iterations and decreases for the rest of the iterations.
- Once we have the pre-trained model, it is augmented with two additional linear blocks with batch normalization, dropout using ReLU activations for the intermediate layer and a softmax activation for final output of probability distribution for multi class classification.
- Gradual Unfreezing — Similar to pre training using different learning rates, the author proposes gradual unfreezing starting from last layer. It is iterative process where with each epoch, layers are unlocked one after the other.
Analysis Observed —
- Low short training — Transfer learning helps LM training on small number of labels.
- Impact of pre-training — Improvement in performance for small, mid size and even large datasets.
- Impact of LM Fine Tuning — Discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (‘Stlr’) is most beneficial for larger datasets.
- Impact of Classifier Fine Tuning — ULMFiT is the only method that shows excellent performance across the IMDb, TREC-6 and AG datasets when fine tuned with Gradual Unfreeing (Freez), Discriminative fine-tuning (‘Discr’) and slanted triangular learning rates (‘Stlr’).
- Classifier fine-tuning behavior — ULMFiT is more stable and suffers from no such catastrophic forgetting; performance remains similar or improves until late epochs, which shows the positive effect of the learning rate schedule.