Reactions to Llama 4 Have Been Mixed
The latest release of Meta's flagship open-source large language model lands in a much more competitive landscape than previous editions
Meta recently released a new version of its flagship Llama open source LLM. It’s been a really interesting release to follow on social media (by which I mostly mean X), as there are a bunch of interesting things about this release:
It’s all massive mixture-of-experts models, a first for Llama. The smallest of these — see below — is 109B parameters!
It dropped on a saturday, for some reason
There have been a bunch of questions from prominent researchers about how good this model is, and how to make sense of the numbers.
So I thought I would try to sum them up here. If you like this, please subscribe:
The Llama 4 Herd
The good:
Massive 10M token context length is completely unbeaten.
All these models are natively multimodal
Individual experts are quite small at 17B and 27B, ostensibly giving you “big model” performance with “small model” inference times and extremely low costs.
Top performance on lmarena (formerly lmsys)
The bad
Weird license - a Llama classic, made more significant now that we have other MIT licensed options like Deepseek R1 and Qwen 2.5
You probably can’t run this model yourself unless you have a ton of compute, at which point, again, why not run Deepseek R1?
If you’re running it on the cloud anyway, especially for an embodied AI task, it seems like you might as well spend the extra money and time using Google Gemini or Openai models.
The unclear
Reasoning model is coming soon. Excited to see how this goes.
Context window is huge, but “Needle in a haystack” is a horrible way to test it, so it’s not clear if that long context is well-utilized.
There’s some question as to whether they trained on the test set and “juiced” thier benchmark numbers. I’d argue they didn’t, but it really looks like their lmsys entry — which is doing quite well on the leaderboard — is not the same as other models, implying they have some level of fine tuning for specific targets.
So, how good is it?
As far as I’m concerned both of the released versions of Llama are “enterprise” scale models — you’re not running them at home for fun, like you would with a Qwen or Gemma or Llama 3.X. That means we might as well focus on Maverick, the larger, 400B parameter version:
And these results are really solid! Despite the huge size of the model, inference is incredibly cheap, and it’s (mostly) state of the art, subject to some caveats (no comparing to “reasoning” models, Gemini 2.5 pro absolutely blows it away, etc). It is probably the best available open-source model.
They also are training a truly massive model - the 2 billion parameter Llama 4 Behemoth - which promises incredible performance for an open model (still noticeably lags Gemini 2.5 pro by a bit, but hey, it’s open-source!).
But of course, Llama 4 has been released, so people can try out the model themselves. One place is lmarena (lmsys), where it’s doing incredibly well. Let’s look at some specifics, with community reactions.
How good is it at coding?
Specifically, let’s look at coding. It’s a high-value application, and one LLMs are extremely well suited for (structured outputs, tons of clean data, easy to verify success programmatically which enables RL)
Deedy on X says it’s not very good at coding. Results from the kscores benchmark put it below models like Gemini 2.0 Flash or Qwen 2.5 Max.
Poor performance on Aider polyglot coding benchmark:
Deepseek, here, clearly retains its open-source crown.
Let’s look at some of the lmarena results:
Huh, wow, that’s weird. Incredibly wordy and… not a great response.
Turns out that with “style control” enabled, it drops quite a bit. What’s going on here?
The Controversy
Performance evaluations for this model are… really strange. First, there’s this weird, chatty, slop-heavy version that’s tearing up the lmsys leaderboards, but that doesn’t seem to appear anywhere else, e.g. on together.ai.
Others note that it’s no better than smaller open-source models like Phi 14B and Qwen-32B on out-of-distribution tasks. This all seems to imply it’s badly overfit.
There are worse allegations floating around, which I don’t personally believe:
More likely, this is a side effect of the Llama-1 data, which was found to be badly contaminated. This isn’t a llama-specific issue; it comes up quite often in LLM training. Meta, when training these models, famously cast a very wide net. It seems to me that this might have backfired.
Another possibility is that there are numerous issues with inference; apparently Meta is seeing some huge differences between internal results and those reported by these external reproductions.
My Thoughts
Set aside benchmarking problems. This is a robotics blog, and I have to admit, I’m really disappointed that Llama is lacking the small models which are so useful for robotics. And I’m not alone there. Nathan Lambert on Interconnects has a great writeup as always; he suggests that they rushed the model out the door.
It’s been entertaining to watch the reaction on Twitter. This is one of those moments like the LK-99 fiasco, where the whole of the internet is piecing together a mystery (why are the results not matching my vibes? wait, what are the results again??)
Personally, I don’t know if there’s anything nefarious going ono here. Lots of very smart people work at meta (I used to be one of them!) and I think that their AI team, to be honest, knows better. What this is, in my opinion, is an important lesson on how bad dataset contamination can be, and how risky it is when you pick and choose your benchmarks.
You can check out my (less dramatic) write up on deepseek if you liked this post:
If you want to see how I’d like to be using Llama 4, check out my writeup on Stretch AI here, where I used Qwen: