Mixture of Models vs Dense Transformers?

Recently meta has released their “Herd of LLama Models” and highlight of which is 405B parameter model contributing significantly to the open source progress of large language models. The significant this that captured my attention was the Dense Transformer and it’s relative comparison with Mixture of Models. This will be brief article on the pros and cons of the two infrastructural modalities of LLM. Of course I love explaining with an analogy for common folks to understand.
Mixture of Models —
Let’s imagine a shop where you are going to purchase a new machine for your inhouse monster that’s going to fine tune and run your llama 3.1 8B model ( since a 405B llama on FP16 will require nearly 800GB of memory and you are GPU poor ). So you have expert that will guide you for motherboard, other one will resolve queries for the RTX GPU H100, some guy will fix your GPU clustering issue and so on. The lesson is —
- The shop under the hood uses different “experts” for different domain knowledge and tasks.
- Each expert will work with high “accuracy” and will allow a modular design approach. Experts can specialize in different types of inputs or tasks.
- Scaling model will increase model size but the standard time of inference will remain same given we are only adding more experts.
- Each expert can be updated with more data for adaptive fine tuning of the model’s fit onto new data. Hence the parameters of the model can be changed.
- More experts can be fit in single pipeline to handle the tasks.
- Topology can follow that of expert network with routing mechanism.
- Another example can be found in employing different experts for decoding documents with different languages for OCR.
- Uneven training of experts can be a challenge. Inference time increases due to selection criteria of selection of experts.
Dense Transformer —
We will continue the analogy given above. You as an intelligent consumer try to find another shop a few steps away which also claims to be an expert of the LLM machines. The owner of the shop is a single person who was once an NVIDIA employee and left after gaining 1000x payout from stocks. He has a wide domain of knowledge and knows both inside out of the LLMs. You give the requirements sheet to him and he will make sure the system plan that he will come up with will cover all the points that you have mentioned. The takeaway is —
- The shop owner is one overall expert on the given range of tasks. Hence here comes a cap on the number of tasks the expert can perform. But the overall aim is to make a generalized model for the range of tasks such as Language Modelling, Summarization etc.
- All parameters of model are used for inference/ forward pass.
- Scaling model can be possible by adding new layers to the existing model and either pretraining the new layers or by fine tuning the model end to end. This can result in model forgetting the knowledge it gained from the past training and developing a bias towards the data distribution of the new model.
- Single model needs multiple replicas to be used for fine tuning different models and hence it is easier to develop one base model and then fine tune using a Language Modelling Head ( in case of ChatGPT ) or a classification head.
- Training a SOTA large model will have challenges as it requires huge resources for training as well as inference. E.g. Llama 3.1 405B has 405 billion parameters, requiring roughly 800 GB memory to be served in its original BF16 precision which is greater than 1 HGX™ 8 NVIDIA® H100 each with 80GB VRAM.
- Data required to train single model is huge. E.g. Llama 3.1 is trained on 16T Tokens where as ensemble models can be trained of specialised datasets and can achieve descent results.
These differences helps us to understand the capabilities of each technique and in what scenarios we can apply to achieve our desired results.