Chinese startup Deepseek released a new “reasoning” model, which you can try out on their website. And it’s impressive, as well as kind of delightful. R1 is incredibly charming; its reasoning traces are well-thought-out, and feel like a human trying to make their way through a problem. And it achieves this all through reinforcement learning!
First, how good is it? Deepseek R1’s performance on the really cool ARC-AGI benchmark is on par with OpenAI o1-preview, while being open-source and a fraction of the cost:
This is really starting to look like “intelligence too cheap to meter.” For comparison:
As befits a reasoning model, it also performs very well on PlanBench, a test of challenging symbolic reasoning. It gets 39.8% accuracy on Mystery Blocksworld and 96.6% on Blocksworld, easily beating vanilla LLMs (which do very poorly!) and getting close to the much more expensive OpenAI-o1:
And of course you can try it on Deepseek’s website (choose DeepThink to enable reasoning). You can also run it locally with WebGPU.
The model itself is available under and MIT license, and you can find links to weights + more information on Github, and the company released a number of distilled versions that you can probably fit on your desktop GPU.
Thoughts
There are two really important things about this model:
It’s an open, MIT licensed reasoner — the first such model around, and it’s almost at o1 level.
It was trained via reinforcement learning — a huge paradigm shift from how most LLMs have been trained in the past.
So let’s focus on that second point, since this a huge change from the current, accepted way that these models are trained. According to the R1 paper, it’s trained using Group Relative Policy Optimization:
And that’s basically it. They start with a strong (but not crazy unique) base model, and they apply reward functions which encourage:
Getting verifiable coding and mathematics questions correct
Getting the format correct, i.e. using the correct <think></think> tags
They do not use a neural network to guide this; there’s no GPT sitting around grading answers here. Otherwise there’s no way it would be scalable. So it’s really interesting that this model actually captures semantic intelligence to the extent it does — as, after all, its reinforcement learning is pushing it towards just verifiable stuff.
It’s sort of amazing that this behavior is present in a model that’s so heavily trained on verifiable problems like math and coding, though they do have some more general question and answering in later stages of the training process.
The Aha Moment
One really interesting thing that Deepseek does - that you can see in a lot of the chat traces, including the one above — is that it often will stop and reassess what it has already been thinking about. This “aha moment” is something that appears midway through training, and it’s really fascinating. It’s a big part of Deepseek R1’s adorable personality, as expressed by those long inner monologues.

There’s a nice writeup on how the training works from Nathan Lambert on Interconnects. And, of course, you can check out their paper. I’d recommend reading those for more details, especially on how the training actually works beyond the very abstract (“it uses GRPO”) that I gave here.
Huh, interesting it is open source. It's still weird to me that we don't understand the nuts and bolts of what goes into an AI model the same way we do a normal/traditional program. Like it's more biological than code as we previously knew it.