Small Models Can Outperform Large Ones?

Every week (or sometimes twice a month) at work, we gather together on Friday for a one-hour-long online voluntary meeting. It's our little AI club where we geek out over the latest AI trends and vent about work challenges.

Recently, I had the honor (or the misfortune, depending on how you see it) of presenting findings from a research paper called "Training Compute-Optimal Large Language Models," or popularly known as, the Chinchilla Paper.

After diving deep into the research, I ended up with notes scattered everywhere and a solid presentation that earned me some praise, so I thought of documenting it here to make it useful for wider community.

When we consider LLM's, at larger perspective what is goal behind building such a huge model?

This comes down to two points in general
1. To maximize the model's performance,
2. Minimizing the loss when predicting tokens.

So, how do we achieve this? We have a couple of options:

Increase the dataset size (i.e., more tokens).
Increase the model size (i.e., more parameters).

When we consider increasing either dataset size or model size, each step towards increase in these 2 parameters, results in increases the compute budget.

According to paper, let's look at relationship between compute budget and model performance

Based on above image, it is clear that there's a good linear relationship in compute budget and model performance. As we pump up the compute budget, the model performance improves.

Better results are achieved either by using more computer power or training models for longer or combination of both.

But in real world, compute budget is almost always a hard constraint. Well, unless we are the CEO of trillion dollar company 😅 so we can simply throw money to get that perfect model performance. just kidding , excuse me please!

in other words we can not blindly keep on increasing compute budget

In fact lot of people were trying to find answer if given constant compute budget how can one come up with a optimal number of tokens (dataset size) as well as number of parameters (model size) and achieve maximum possible performance that given compute budget can allow, and still this area is being researched.

Chinchilla paper is all about finding right balance between dataset size and model size for given fixed compute budget.

💡 Chinchilla paper was originally published on March 2022 by the research team at DeepMind.

According to paper, they did experimented with two different cases

Case 1: fixed model size and compute budget, varying dataset size which resulted as the volume of training data increases, model performance also continues to improve.

Case 2: fixed dataset size and compute budget, varying, model size. which showed as the model size increases, model performance also continues to improve.

Still question remains the same, What is the ideal balance between these quantities?

To find it, people behind Chinchilla paper did a exhaustive experiment by training over 400 transformer based Large Language models. ranging from 70 million to over 16 billion parameters. and on 5 to 500 billion tokens.

And found that Many of the LLM that we use today are may be:

Over-parameterized (they have more parameters than they need to achieve a good understanding of a language)
Undertrained (they would benefit from seeing more training data)

And, developed new models called Chinchilla. and when compared to other models of same timeline it was observed that compute optimal Chinchilla model outperforms non compute optimal models such as GPT-3 on a large range of downstream evaluation tasks.

models like GPT-3 and BLOOM were trained on datasets that are smaller than the Chinchilla optimal size. on the other hand Llama was trained on a dataset size of 1.4 trillion tokens, which is close to the Chinchilla recommended number.

💡 Author named this compute optimal model as chinchilla, based on small furry rodents. (animal like rat) because they are smart

According to Chinchilla Paper, In pre-training, optimal training dataset size should be about 20X the number of parameters in the model. (“optimal” here, what is meant is “what is the cheapest way to obtain a given loss level)

Tokens : Parameters == 20:1

Below are several evaluations based on Big-bench tasks and common sense benchmarks where except for 4-5 individual tasks chinchilla performed better when compared to Gopher.

All these looks good now, let's look at real world implementation done by Bloomberg with model size 50 billion parameters and dataset size of 700 billion tokens 51% of which was Financial data.

by looking at below graph, we can observe that Chinchilla and Bloomberg GPT are close in terms of model parameters but others have lot more parameters. GPT is having somewhere around 200 number of parameters, Gopher is having around 300 parameters. comparatively Bloomberg's model looks very small :)

But when we look at evaluation of these models, then it can be observed that on Finance specific data, Bloomberg GPT outperforms these huge models.

And when looked at General Purpose NLP Tasks, again Bloomberg GPT performs pretty well. even though it is smaller than others.

Essentially, chinchilla paper questions that could model size be less important than we imagined?

Below is a heatmap which shows number of parameters vs ratio of tokens trained to number of parameters. (Until Nov 2022)

3 key findings from above heat map is:

1. many models were off from chinchilla recommended ratio (even the popular ones)
2. and possibly there were very few models close to chinchilla recommended values.
3. possibly most model were not compute optimal (they were over Over-parameterized 0r Undertrained)

Now, let's look at same graph but from December 2023,

Notice how newly trained models in a span of a year is following chinchilla recommended ratio of tokens to parameters almost closely.

Until recently, we got new model from Meta Llama 3, which is having insane ratio of tokens to parameters. which is 1,875:1

Andrej Karpathy (former OpenAI, Tesla, Stanford) tweeted about it saying:

15T is a very very large dataset to train with for a model as "small" as 8B parameters

Meta mentions that even at this point, the model doesn't seem to be "converging" in a standard sense. In other words, the LLMs we work with all the time are significantly under trained by a factor of maybe 100-1000X or more, nowhere near their point of convergence.

In summary

Current large language models may be significantly undertrained, or over-parameterized.
Chinchilla recommended dataset size and number of parameters may perform better when the compute budget is constant.
In pre-training, optimal training dataset size should be about 20X the number of parameters in the model. (“optimal” here, what is meant is “what is the cheapest way to obtain a given loss level)
We may see the trend and start training and releasing even more long-trained, even smaller models.

This is a continuously evolving field of research, with fresh discoveries popping up every week. Many factors play into a model's performance. I'm not an expert yet, but I'm always eager to learn and chat more. Feel free to drop any questions in the comments.

On that note, I'm signing off. Catch you in the next one! Cheers!

https://arxiv.org/pdf/2203.15556