By
Viewed
23,642
Please choose the correct answer for each question below:
Questions: 0/389
Correct: 0
Translate:
OpenAI recently dropped GPT OSS, its
first open weights model since GPT2 in
2019. It's one of the highest profile
open source model launches since
DeepSeek R1 made waves back in January.
But how does GPT OSS compared to the
other top open source models out there
architecturally? Let's find out.
[Music]
GPT OSS is one of OpenAI's most
anticipated recent launches. a large
fully open weights model from one of the
leading American AI labs. Let's take a
closer look at the paper to find out how
it was actually engineered and trained.
GPT OSS is a mixture of experts model
available in two sizes, 120 billion
parameters and 20 billion parameters.
Each token activates the top four
experts, meaning only a portion of the
total parameters are used at any given
time. This allows for efficient
inference without sacrificing the
benefits of a larger model. Trained as a
decoder only transformer, GPTOSS
incorporates plenty of features typical
to modern LLMs. This includes grouped
query attention, a modified attention
mechanism that lets multiple query heads
share the same key value pairs to reduce
memory use and speed up inference. It
also includes swiggloo activations in
the feed forward network layers which
allow for more nuance transformations
than simpler activations like RLU as
well as rotary positional embeddings or
rope which encode token position
directly into the attention mechanism to
support longer contexts. Finally, the
model also makes use of RMS norm with
pre-normalization, a normalization
method that scales inputs by their root
mean square for more stable training.
One standout capability of the model is
its 131,000 token context window, which
it achieves by applying yarn scaling
during pre-training rather than as an
inference time adjustment. We'll touch
on what this means a little bit later in
the video. For GPTO OSS, OpenAI makes
use of their open- source 0200K harmony
tokenizer. This bite pair encoding
tokenizer has over 200,000 tokens and
builds on the O200K tokenizer used in
models like GPT40. As for the data set
GPT OSS was trained on, OpenAI has only
disclosed the broad strokes. The model
was trained on a textonly corpus in the
trillions of tokens with a focus on stem
coding and general knowledge. Harmful
content was filtered out for safety, but
beyond that, there's little else known
publicly. Once training was complete,
the model was released in a quantized
format by default, making it lightweight
enough for deployment on modest
hardware. This allows it to be run on
consumer grade GPUs, laptops, or other
resource limited hardware. However,
there's no unquantized version
available. GPTOSS also underwent
substantial post-training for safety and
alignment, shaping its default behavior
for more controlled outputs. It's worth
noting that some in the open source
community are experimenting with
reducing or removing these layers in
order to explore the raw models
capabilities. In the broader landscape
of open source AI, GPToss arrives as a
fully equipped long context model ready
for immediate use. As impressive as it
is, however, it's just one of several
models in a rapidly expanding field of
open source LLMs. Quen 3, the newest
family of models developed by Alibaba
Cloud, dropped this past April to
considerable hype with benchmark scores
that rivaled those of leading open
source-based models like DeepSeek V3 or
Llama 4. The Quen 3 family includes both
dense models, which activate all of
their parameters for each query, and
mixture of expert models, which only
activate a small subset of their
parameters for each query. The dense
models come in seven different size
classes, including a6 billion parameter
model, one of the smallest current
generation openweight models around,
while the models come in two different
size classes. Architecturally, Quen 3
dense models are very similar to the
Quen 2.5 models, Alibaba's previous
releases. Like Quen 2.5 and GPOSS, Quen
3 incorporates features like group query
attention, swiggloo, rope, and RMS norm.
Quinn 3's sparse models share the same
fundamental architecture as its dense
models, but add a mixture of experts
layer with 128 total experts of which
eight are activated per token. All Quinn
3 models also use the same tokenizer
used in previous Quen models, which
implements bite level bite pair codings
that allow it to handle any text or
symbol without special pre-processing,
unlike word or character-based
tokenizers. One of the main things that
sets Quen 3 apart from previous Quen
models is the way it controls the scale
of the key query and value projections
to keep attention score stable at scale.
It replaces QKV bias, a static offset
that shifts KQV projections in previous
models with QK norm, a normalization
step that dynamically rescales that
query and key vectors to maintain
constant magnitudes. Data set wise, Quen
3 was trained on 36 trillion
pre-training tokens, twice as many as
the Quen 2.5 models. In addition to
pulling data from multilingual texts,
STEM and coding sources, and reasoning
tasks, Quen 3 also uses Quen 2.5 models
to generate trillions of tokens of
synthetic data in different formats like
textbooks, instructions, and code
snippets. Quen 3's pre-training occurred
in three stages. In stage one, the
general stage, models were trained on
over 30 trillion tokens covering 119
languages at a sequence length of 4096
tokens. In stage two, the reasoning
stage, models were trained on an
additional 5 trillion higher quality
tokens featuring more stem reasoning and
coding problems. And in stage three,
which the Quen team calls the long
context stage, context length was
extended to over 32,000 tokens using a
bunch of clever algorithmic
optimizations, including ABF, a
technique to adjust rope so positional
signals remain accurate over much longer
sequences, yarn to further scale for
longer inputs, and dual chunk attention
to process sequences efficiently.
Together, all of these optimizations
allow the model to reason over much
longer inputs at inference. Finally,
Quen uses a four-step post-training
pipeline with two goals. Giving users
more control over how much reasoning to
use for a given query and letting them
efficiently distill larger model
capabilities into smaller models. The
first step in the post- training
pipeline is a long chain of thought cold
start stage which involves feeding a
model a curated data set of challenging
reasoning problems from math logic and
STEM with verifiable reference answers
and then filtering outputs to ensure
quality. This is followed by a reasoning
RL stage using GRPO an RL algorithm
originally developed by Deepseek
researchers on roughly 4,000 query
verifier pairs to strengthen complex
problem solving. Personally, I think
it's fascinating that it only takes
4,000 pairs to get great results. The
third step in the post-training
pipeline, thinking mode fusion, is a key
Quen 3 innovation that integrates
reasoning and non-reasoning into a
single model, letting users switch modes
without changing models. Essentially,
what developers did in this step was
fine-tune the model on a mix of thinking
data, which includes intermediate
reasoning steps, and non-thinking data,
which omits them, and then build a chat
interface to let users toggle modes.
Though this was unique to Quinn when the
model first launched, GPT5 now features
a similar toggle. The final step,
general RL, broadens capabilities in
instruction following, formatting,
preference alignment, tool use, and
specialized scenarios. Quinn's
developers then use strong to weak
distillation, which allows for the
training of smaller models from larger
ones. All in all, Quen 3's performance
is very impressive, especially given its
relatively small size. But just months
earlier, a different model had already
raised the stakes in open source.
Released in December of last year,
Deepseek's V3 model was one of the most
ambitious open source LLMs to come out
of a major lab in recent years.
The chatbot developed in China called
Deep Seek.
Deepseek is such a fundamental change to
the economics of what's going on.
The most downloaded free app in the US.
This is an update in what people think
is possible. At 671 billion parameters
is a massive generalpurposebased model
designed for efficiency as much as
capability laying the groundwork for the
reasoning focused R1 model that would
follow. We're not going to get into a
ton of detail about V3's architecture or
training pipeline here because we put
out a comprehensive deep dive into it
back in February. But high level the
thing to know about V3 is that it's a
mixture of experts model with several
hardware and algorithmic optimizations
including training V3 natively in 8bit
rather than 16 or 32-bit. a huge unlock
for cutting training costs. And just
recently, DeepSeek pushed V3 even
further with an updated version. The
newly releasleased V3.1 builds directly
on the original V3based checkpoint,
extending it with a two-phase long
context training approach and adding a
hybrid thinking mode that lets the same
model switch between reasoning heavy and
lightweight inference. It also improves
tool use and agent performance thanks to
a more advanced post- training. In
practice, this means V3.1 keeps the same
core architecture as V3, but delivers
stronger reasoning, smarter tool use,
and greater performance. One thing that
sets V3 apart is that it uses a
different attention mechanism than GPOSS
and Quen 3. In modern LLMs, a lot of the
compute and memory is tied up in the KV
cache, and so V3 makes use of MLA, which
compresses keys and values into a
smaller latent space before caching
them, then decompresses them during
inference. Although MLA is a bit more
complex to implement, the previous
Deepseek V2 paper found it delivers
greater memory savings and better
modeling performance than GQA,
especially in huge long context models
like this one. And that's just one of
several areas where Deepseek V3 takes a
different path. With all that in mind,
let's take a step back from V3 to Quen
to GPDoss. How should we think about at
a high level the differences between
these models? One big difference is
size. The Quen 3 model family is the
only one of the three to offer both
dense and mixture of expert variants.
with dense models from 6 billion to 32
billion parameters and a mixture of
experts lineup that includes a 30
billion parameter model and a 235
billion parameter model. Notably, Quen's
mixture of experts base models matched
the dense models performance with only a
fifth as many active parameters. On the
other hand, Deepseek V3 only comes in a
mixture of experts architecture with 671
billion parameters of which 37 billion
are activated for a given token
prediction. So considerably larger than
even the biggest Quen 3 model. GPT OSS
sits in the middle. It offers twoe
models. One with 117 billion parameters
of which 5.1 billion are activated for a
given token and a smaller one with 21
billion parameters of which 3.6 billion
are activated for a given token. One of
the most interesting technical
differences lies in how each model
extends its context length. Yarn short
for yet another rope extension is a
technique for stretching the model's
rotary positional embeddings so that it
can handle far longer sequences than it
was originally trained on. Normally rope
starts to break down when you feed it
more tokens than its base frequency was
set for. But yarn tweaks that frequency.
So the same embedding space covers much
more ground. What's interesting is how
the three models here use it
differently. GPTOSS applies yarn right
from pre-training. So its weights have
learned to work natively with 131,000
token contexts. Deepsee takes a staged
approach fine-tuning after pre-training
to first reach 32,000 tokens, then
further training to achieve 128,000.
Quen also fine-tunes to 32,000, but
skips that additional retraining step.
Instead, at inference time, they apply
yarn scaling again, increasing the rope
base frequency by a factor of four to
reach 128,000 tokens without extra
retraining. In other words, GPTOSS is
born with long context ability. DeepSeek
is trained into it step by step, and
Quen pushes the limits of what a 32,000
train model can do without more long
context training. Personally, I think
one of the most interesting things about
these papers and the state-of-the-art in
deep learning more generally is that a
lot of these read as empirical findings.
Each lab describes the combination of
tools that works well for them, but
almost no one gives a first principles
justification of why one tool is better
than the other. For instance, why MLA is
better than GQA full stop. This is much
different from domains like math or
theoretical physics which are all about
providing first principles explanations
that derive results from axioms or laws.
Also, it's interesting that even though
most of these models have similar
topline benchmark statistics and use
broadly the same tools like attention
mechanisms, activation functions,
positional embeddings, and so on, they
achieve these similar results using
often very different techniques. This is
quite surprising. You'd expect that very
different training methods would lead to
very different results. Also, all of the
major models heavily use reinforcement
learning as part of the post-raining and
reasoning portions of their model
training efforts. And it's fascinating
and pretty surprising how some of these
RL efforts require very little amounts
of data. just 4,000 data pairs in the
case of Quinn. Another point here is
that it's very opaque what the
differences in data sets are between the
labs. It's clear from the papers that
there's an enormous amount of work
happening behind the scenes in data set
engineering. This work is probably a
significant aspect of the moat that
makes these companies comfortable
releasing their models. It's very
difficult to replicate what they're
releasing. So the big takeaway when
reading these papers is you shouldn't
focus too much on just the benchmark
performance or topline stats like
context size. Instead, look at the
specific methods that these labs are
using to achieve those results. There
are tons of high performing open source
models that we didn't discuss in this
video, like Kim K2 or Google Gemma 3.
But when you peek under the hood of many
of these, you'll find nuance differences
that I find really interesting. I hope
this gives you a framework for how to
understand the latest open source
releases and gives you a toolkit to
start tinkering with them yourself.
Thanks for watching. See you in the next
episode.
[Music]
Related Songs