Display Bilingual:

OpenAI recently dropped GPT OSS, its 00:00
first open weights model since GPT2 in 00:02
2019. It's one of the highest profile 00:05
open source model launches since 00:07
DeepSeek R1 made waves back in January. 00:08
But how does GPT OSS compared to the 00:11
other top open source models out there 00:12
architecturally? Let's find out. 00:14
[Music] 00:18
GPT OSS is one of OpenAI's most 00:23
anticipated recent launches. a large 00:25
fully open weights model from one of the 00:27
leading American AI labs. Let's take a 00:29
closer look at the paper to find out how 00:31
it was actually engineered and trained. 00:33
GPT OSS is a mixture of experts model 00:35
available in two sizes, 120 billion 00:37
parameters and 20 billion parameters. 00:40
Each token activates the top four 00:42
experts, meaning only a portion of the 00:44
total parameters are used at any given 00:45
time. This allows for efficient 00:47
inference without sacrificing the 00:48
benefits of a larger model. Trained as a 00:50
decoder only transformer, GPTOSS 00:52
incorporates plenty of features typical 00:54
to modern LLMs. This includes grouped 00:56
query attention, a modified attention 00:58
mechanism that lets multiple query heads 01:00
share the same key value pairs to reduce 01:02
memory use and speed up inference. It 01:04
also includes swiggloo activations in 01:06
the feed forward network layers which 01:08
allow for more nuance transformations 01:09
than simpler activations like RLU as 01:11
well as rotary positional embeddings or 01:13
rope which encode token position 01:15
directly into the attention mechanism to 01:17
support longer contexts. Finally, the 01:19
model also makes use of RMS norm with 01:21
pre-normalization, a normalization 01:23
method that scales inputs by their root 01:25
mean square for more stable training. 01:27
One standout capability of the model is 01:29
its 131,000 token context window, which 01:31
it achieves by applying yarn scaling 01:33
during pre-training rather than as an 01:35
inference time adjustment. We'll touch 01:37
on what this means a little bit later in 01:39
the video. For GPTO OSS, OpenAI makes 01:40
use of their open- source 0200K harmony 01:43
tokenizer. This bite pair encoding 01:45
tokenizer has over 200,000 tokens and 01:47
builds on the O200K tokenizer used in 01:50
models like GPT40. As for the data set 01:52
GPT OSS was trained on, OpenAI has only 01:54
disclosed the broad strokes. The model 01:57
was trained on a textonly corpus in the 01:59
trillions of tokens with a focus on stem 02:01
coding and general knowledge. Harmful 02:03
content was filtered out for safety, but 02:05
beyond that, there's little else known 02:07
publicly. Once training was complete, 02:08
the model was released in a quantized 02:10
format by default, making it lightweight 02:11
enough for deployment on modest 02:13
hardware. This allows it to be run on 02:15
consumer grade GPUs, laptops, or other 02:16
resource limited hardware. However, 02:19
there's no unquantized version 02:21
available. GPTOSS also underwent 02:22
substantial post-training for safety and 02:24
alignment, shaping its default behavior 02:26
for more controlled outputs. It's worth 02:28
noting that some in the open source 02:30
community are experimenting with 02:31
reducing or removing these layers in 02:33
order to explore the raw models 02:35
capabilities. In the broader landscape 02:36
of open source AI, GPToss arrives as a 02:38
fully equipped long context model ready 02:41
for immediate use. As impressive as it 02:43
is, however, it's just one of several 02:45
models in a rapidly expanding field of 02:47
open source LLMs. Quen 3, the newest 02:49
family of models developed by Alibaba 02:52
Cloud, dropped this past April to 02:53
considerable hype with benchmark scores 02:55
that rivaled those of leading open 02:57
source-based models like DeepSeek V3 or 02:59
Llama 4. The Quen 3 family includes both 03:01
dense models, which activate all of 03:03
their parameters for each query, and 03:05
mixture of expert models, which only 03:06
activate a small subset of their 03:08
parameters for each query. The dense 03:10
models come in seven different size 03:11
classes, including a6 billion parameter 03:13
model, one of the smallest current 03:15
generation openweight models around, 03:17
while the models come in two different 03:19
size classes. Architecturally, Quen 3 03:21
dense models are very similar to the 03:23
Quen 2.5 models, Alibaba's previous 03:24
releases. Like Quen 2.5 and GPOSS, Quen 03:26
3 incorporates features like group query 03:29
attention, swiggloo, rope, and RMS norm. 03:31
Quinn 3's sparse models share the same 03:34
fundamental architecture as its dense 03:36
models, but add a mixture of experts 03:37
layer with 128 total experts of which 03:39
eight are activated per token. All Quinn 03:41
3 models also use the same tokenizer 03:44
used in previous Quen models, which 03:46
implements bite level bite pair codings 03:47
that allow it to handle any text or 03:49
symbol without special pre-processing, 03:51
unlike word or character-based 03:52
tokenizers. One of the main things that 03:53
sets Quen 3 apart from previous Quen 03:55
models is the way it controls the scale 03:57
of the key query and value projections 03:59
to keep attention score stable at scale. 04:01
It replaces QKV bias, a static offset 04:03
that shifts KQV projections in previous 04:06
models with QK norm, a normalization 04:09
step that dynamically rescales that 04:11
query and key vectors to maintain 04:13
constant magnitudes. Data set wise, Quen 04:15
3 was trained on 36 trillion 04:18
pre-training tokens, twice as many as 04:19
the Quen 2.5 models. In addition to 04:21
pulling data from multilingual texts, 04:23
STEM and coding sources, and reasoning 04:25
tasks, Quen 3 also uses Quen 2.5 models 04:27
to generate trillions of tokens of 04:30
synthetic data in different formats like 04:31
textbooks, instructions, and code 04:33
snippets. Quen 3's pre-training occurred 04:35
in three stages. In stage one, the 04:37
general stage, models were trained on 04:39
over 30 trillion tokens covering 119 04:40
languages at a sequence length of 4096 04:43
tokens. In stage two, the reasoning 04:45
stage, models were trained on an 04:47
additional 5 trillion higher quality 04:48
tokens featuring more stem reasoning and 04:50
coding problems. And in stage three, 04:52
which the Quen team calls the long 04:54
context stage, context length was 04:55
extended to over 32,000 tokens using a 04:57
bunch of clever algorithmic 05:00
optimizations, including ABF, a 05:01
technique to adjust rope so positional 05:03
signals remain accurate over much longer 05:05
sequences, yarn to further scale for 05:06
longer inputs, and dual chunk attention 05:09
to process sequences efficiently. 05:11
Together, all of these optimizations 05:12
allow the model to reason over much 05:14
longer inputs at inference. Finally, 05:15
Quen uses a four-step post-training 05:18
pipeline with two goals. Giving users 05:20
more control over how much reasoning to 05:22
use for a given query and letting them 05:24
efficiently distill larger model 05:26
capabilities into smaller models. The 05:28
first step in the post- training 05:30
pipeline is a long chain of thought cold 05:31
start stage which involves feeding a 05:33
model a curated data set of challenging 05:34
reasoning problems from math logic and 05:36
STEM with verifiable reference answers 05:38
and then filtering outputs to ensure 05:41
quality. This is followed by a reasoning 05:42
RL stage using GRPO an RL algorithm 05:44
originally developed by Deepseek 05:47
researchers on roughly 4,000 query 05:48
verifier pairs to strengthen complex 05:50
problem solving. Personally, I think 05:53
it's fascinating that it only takes 05:55
4,000 pairs to get great results. The 05:56
third step in the post-training 05:58
pipeline, thinking mode fusion, is a key 05:59
Quen 3 innovation that integrates 06:02
reasoning and non-reasoning into a 06:03
single model, letting users switch modes 06:05
without changing models. Essentially, 06:07
what developers did in this step was 06:09
fine-tune the model on a mix of thinking 06:10
data, which includes intermediate 06:12
reasoning steps, and non-thinking data, 06:14
which omits them, and then build a chat 06:15
interface to let users toggle modes. 06:17
Though this was unique to Quinn when the 06:19
model first launched, GPT5 now features 06:21
a similar toggle. The final step, 06:23
general RL, broadens capabilities in 06:24
instruction following, formatting, 06:27
preference alignment, tool use, and 06:28
specialized scenarios. Quinn's 06:30
developers then use strong to weak 06:32
distillation, which allows for the 06:34
training of smaller models from larger 06:36
ones. All in all, Quen 3's performance 06:37
is very impressive, especially given its 06:40
relatively small size. But just months 06:41
earlier, a different model had already 06:43
raised the stakes in open source. 06:45
Released in December of last year, 06:46
Deepseek's V3 model was one of the most 06:48
ambitious open source LLMs to come out 06:50
of a major lab in recent years. 06:52
The chatbot developed in China called 06:53
Deep Seek. 06:56
Deepseek is such a fundamental change to 06:57
the economics of what's going on. 06:58
The most downloaded free app in the US. 07:00
This is an update in what people think 07:03
is possible. At 671 billion parameters 07:05
is a massive generalpurposebased model 07:08
designed for efficiency as much as 07:10
capability laying the groundwork for the 07:12
reasoning focused R1 model that would 07:14
follow. We're not going to get into a 07:16
ton of detail about V3's architecture or 07:17
training pipeline here because we put 07:19
out a comprehensive deep dive into it 07:21
back in February. But high level the 07:22
thing to know about V3 is that it's a 07:24
mixture of experts model with several 07:26
hardware and algorithmic optimizations 07:28
including training V3 natively in 8bit 07:30
rather than 16 or 32-bit. a huge unlock 07:32
for cutting training costs. And just 07:35
recently, DeepSeek pushed V3 even 07:36
further with an updated version. The 07:38
newly releasleased V3.1 builds directly 07:40
on the original V3based checkpoint, 07:43
extending it with a two-phase long 07:45
context training approach and adding a 07:47
hybrid thinking mode that lets the same 07:48
model switch between reasoning heavy and 07:50
lightweight inference. It also improves 07:52
tool use and agent performance thanks to 07:54
a more advanced post- training. In 07:56
practice, this means V3.1 keeps the same 07:58
core architecture as V3, but delivers 08:01
stronger reasoning, smarter tool use, 08:03
and greater performance. One thing that 08:05
sets V3 apart is that it uses a 08:07
different attention mechanism than GPOSS 08:09
and Quen 3. In modern LLMs, a lot of the 08:11
compute and memory is tied up in the KV 08:14
cache, and so V3 makes use of MLA, which 08:16
compresses keys and values into a 08:19
smaller latent space before caching 08:21
them, then decompresses them during 08:22
inference. Although MLA is a bit more 08:24
complex to implement, the previous 08:26
Deepseek V2 paper found it delivers 08:28
greater memory savings and better 08:29
modeling performance than GQA, 08:31
especially in huge long context models 08:33
like this one. And that's just one of 08:35
several areas where Deepseek V3 takes a 08:36
different path. With all that in mind, 08:38
let's take a step back from V3 to Quen 08:40
to GPDoss. How should we think about at 08:42
a high level the differences between 08:45
these models? One big difference is 08:46
size. The Quen 3 model family is the 08:48
only one of the three to offer both 08:50
dense and mixture of expert variants. 08:52
with dense models from 6 billion to 32 08:54
billion parameters and a mixture of 08:56
experts lineup that includes a 30 08:57
billion parameter model and a 235 08:59
billion parameter model. Notably, Quen's 09:01
mixture of experts base models matched 09:03
the dense models performance with only a 09:05
fifth as many active parameters. On the 09:07
other hand, Deepseek V3 only comes in a 09:09
mixture of experts architecture with 671 09:11
billion parameters of which 37 billion 09:13
are activated for a given token 09:16
prediction. So considerably larger than 09:17
even the biggest Quen 3 model. GPT OSS 09:19
sits in the middle. It offers twoe 09:21
models. One with 117 billion parameters 09:23
of which 5.1 billion are activated for a 09:26
given token and a smaller one with 21 09:28
billion parameters of which 3.6 billion 09:30
are activated for a given token. One of 09:32
the most interesting technical 09:33
differences lies in how each model 09:34
extends its context length. Yarn short 09:36
for yet another rope extension is a 09:38
technique for stretching the model's 09:40
rotary positional embeddings so that it 09:41
can handle far longer sequences than it 09:43
was originally trained on. Normally rope 09:45
starts to break down when you feed it 09:47
more tokens than its base frequency was 09:48
set for. But yarn tweaks that frequency. 09:50
So the same embedding space covers much 09:52
more ground. What's interesting is how 09:54
the three models here use it 09:56
differently. GPTOSS applies yarn right 09:57
from pre-training. So its weights have 10:00
learned to work natively with 131,000 10:01
token contexts. Deepsee takes a staged 10:04
approach fine-tuning after pre-training 10:07
to first reach 32,000 tokens, then 10:08
further training to achieve 128,000. 10:11
Quen also fine-tunes to 32,000, but 10:14
skips that additional retraining step. 10:16
Instead, at inference time, they apply 10:18
yarn scaling again, increasing the rope 10:20
base frequency by a factor of four to 10:22
reach 128,000 tokens without extra 10:24
retraining. In other words, GPTOSS is 10:26
born with long context ability. DeepSeek 10:29
is trained into it step by step, and 10:31
Quen pushes the limits of what a 32,000 10:33
train model can do without more long 10:35
context training. Personally, I think 10:37
one of the most interesting things about 10:39
these papers and the state-of-the-art in 10:40
deep learning more generally is that a 10:42
lot of these read as empirical findings. 10:43
Each lab describes the combination of 10:45
tools that works well for them, but 10:47
almost no one gives a first principles 10:48
justification of why one tool is better 10:50
than the other. For instance, why MLA is 10:52
better than GQA full stop. This is much 10:54
different from domains like math or 10:56
theoretical physics which are all about 10:58
providing first principles explanations 10:59
that derive results from axioms or laws. 11:01
Also, it's interesting that even though 11:04
most of these models have similar 11:05
topline benchmark statistics and use 11:07
broadly the same tools like attention 11:08
mechanisms, activation functions, 11:10
positional embeddings, and so on, they 11:12
achieve these similar results using 11:13
often very different techniques. This is 11:15
quite surprising. You'd expect that very 11:17
different training methods would lead to 11:19
very different results. Also, all of the 11:20
major models heavily use reinforcement 11:22
learning as part of the post-raining and 11:24
reasoning portions of their model 11:26
training efforts. And it's fascinating 11:27
and pretty surprising how some of these 11:29
RL efforts require very little amounts 11:30
of data. just 4,000 data pairs in the 11:32
case of Quinn. Another point here is 11:34
that it's very opaque what the 11:36
differences in data sets are between the 11:37
labs. It's clear from the papers that 11:39
there's an enormous amount of work 11:40
happening behind the scenes in data set 11:41
engineering. This work is probably a 11:43
significant aspect of the moat that 11:44
makes these companies comfortable 11:46
releasing their models. It's very 11:47
difficult to replicate what they're 11:49
releasing. So the big takeaway when 11:50
reading these papers is you shouldn't 11:52
focus too much on just the benchmark 11:53
performance or topline stats like 11:55
context size. Instead, look at the 11:57
specific methods that these labs are 11:59
using to achieve those results. There 12:01
are tons of high performing open source 12:03
models that we didn't discuss in this 12:04
video, like Kim K2 or Google Gemma 3. 12:06
But when you peek under the hood of many 12:08
of these, you'll find nuance differences 12:10
that I find really interesting. I hope 12:12
this gives you a framework for how to 12:14
understand the latest open source 12:15
releases and gives you a toolkit to 12:16
start tinkering with them yourself. 12:18
Thanks for watching. See you in the next 12:20
episode. 12:21
[Music] 12:23

– English Lyrics

🎧 Learn and chill with "" – open the app to catch every cool phrase and structure!
By
Viewed
23,642
Language
Learn this song

Lyrics & Translation

[English]
OpenAI recently dropped GPT OSS, its
first open weights model since GPT2 in
2019. It's one of the highest profile
open source model launches since
DeepSeek R1 made waves back in January.
But how does GPT OSS compared to the
other top open source models out there
architecturally? Let's find out.
[Music]
GPT OSS is one of OpenAI's most
anticipated recent launches. a large
fully open weights model from one of the
leading American AI labs. Let's take a
closer look at the paper to find out how
it was actually engineered and trained.
GPT OSS is a mixture of experts model
available in two sizes, 120 billion
parameters and 20 billion parameters.
Each token activates the top four
experts, meaning only a portion of the
total parameters are used at any given
time. This allows for efficient
inference without sacrificing the
benefits of a larger model. Trained as a
decoder only transformer, GPTOSS
incorporates plenty of features typical
to modern LLMs. This includes grouped
query attention, a modified attention
mechanism that lets multiple query heads
share the same key value pairs to reduce
memory use and speed up inference. It
also includes swiggloo activations in
the feed forward network layers which
allow for more nuance transformations
than simpler activations like RLU as
well as rotary positional embeddings or
rope which encode token position
directly into the attention mechanism to
support longer contexts. Finally, the
model also makes use of RMS norm with
pre-normalization, a normalization
method that scales inputs by their root
mean square for more stable training.
One standout capability of the model is
its 131,000 token context window, which
it achieves by applying yarn scaling
during pre-training rather than as an
inference time adjustment. We'll touch
on what this means a little bit later in
the video. For GPTO OSS, OpenAI makes
use of their open- source 0200K harmony
tokenizer. This bite pair encoding
tokenizer has over 200,000 tokens and
builds on the O200K tokenizer used in
models like GPT40. As for the data set
GPT OSS was trained on, OpenAI has only
disclosed the broad strokes. The model
was trained on a textonly corpus in the
trillions of tokens with a focus on stem
coding and general knowledge. Harmful
content was filtered out for safety, but
beyond that, there's little else known
publicly. Once training was complete,
the model was released in a quantized
format by default, making it lightweight
enough for deployment on modest
hardware. This allows it to be run on
consumer grade GPUs, laptops, or other
resource limited hardware. However,
there's no unquantized version
available. GPTOSS also underwent
substantial post-training for safety and
alignment, shaping its default behavior
for more controlled outputs. It's worth
noting that some in the open source
community are experimenting with
reducing or removing these layers in
order to explore the raw models
capabilities. In the broader landscape
of open source AI, GPToss arrives as a
fully equipped long context model ready
for immediate use. As impressive as it
is, however, it's just one of several
models in a rapidly expanding field of
open source LLMs. Quen 3, the newest
family of models developed by Alibaba
Cloud, dropped this past April to
considerable hype with benchmark scores
that rivaled those of leading open
source-based models like DeepSeek V3 or
Llama 4. The Quen 3 family includes both
dense models, which activate all of
their parameters for each query, and
mixture of expert models, which only
activate a small subset of their
parameters for each query. The dense
models come in seven different size
classes, including a6 billion parameter
model, one of the smallest current
generation openweight models around,
while the models come in two different
size classes. Architecturally, Quen 3
dense models are very similar to the
Quen 2.5 models, Alibaba's previous
releases. Like Quen 2.5 and GPOSS, Quen
3 incorporates features like group query
attention, swiggloo, rope, and RMS norm.
Quinn 3's sparse models share the same
fundamental architecture as its dense
models, but add a mixture of experts
layer with 128 total experts of which
eight are activated per token. All Quinn
3 models also use the same tokenizer
used in previous Quen models, which
implements bite level bite pair codings
that allow it to handle any text or
symbol without special pre-processing,
unlike word or character-based
tokenizers. One of the main things that
sets Quen 3 apart from previous Quen
models is the way it controls the scale
of the key query and value projections
to keep attention score stable at scale.
It replaces QKV bias, a static offset
that shifts KQV projections in previous
models with QK norm, a normalization
step that dynamically rescales that
query and key vectors to maintain
constant magnitudes. Data set wise, Quen
3 was trained on 36 trillion
pre-training tokens, twice as many as
the Quen 2.5 models. In addition to
pulling data from multilingual texts,
STEM and coding sources, and reasoning
tasks, Quen 3 also uses Quen 2.5 models
to generate trillions of tokens of
synthetic data in different formats like
textbooks, instructions, and code
snippets. Quen 3's pre-training occurred
in three stages. In stage one, the
general stage, models were trained on
over 30 trillion tokens covering 119
languages at a sequence length of 4096
tokens. In stage two, the reasoning
stage, models were trained on an
additional 5 trillion higher quality
tokens featuring more stem reasoning and
coding problems. And in stage three,
which the Quen team calls the long
context stage, context length was
extended to over 32,000 tokens using a
bunch of clever algorithmic
optimizations, including ABF, a
technique to adjust rope so positional
signals remain accurate over much longer
sequences, yarn to further scale for
longer inputs, and dual chunk attention
to process sequences efficiently.
Together, all of these optimizations
allow the model to reason over much
longer inputs at inference. Finally,
Quen uses a four-step post-training
pipeline with two goals. Giving users
more control over how much reasoning to
use for a given query and letting them
efficiently distill larger model
capabilities into smaller models. The
first step in the post- training
pipeline is a long chain of thought cold
start stage which involves feeding a
model a curated data set of challenging
reasoning problems from math logic and
STEM with verifiable reference answers
and then filtering outputs to ensure
quality. This is followed by a reasoning
RL stage using GRPO an RL algorithm
originally developed by Deepseek
researchers on roughly 4,000 query
verifier pairs to strengthen complex
problem solving. Personally, I think
it's fascinating that it only takes
4,000 pairs to get great results. The
third step in the post-training
pipeline, thinking mode fusion, is a key
Quen 3 innovation that integrates
reasoning and non-reasoning into a
single model, letting users switch modes
without changing models. Essentially,
what developers did in this step was
fine-tune the model on a mix of thinking
data, which includes intermediate
reasoning steps, and non-thinking data,
which omits them, and then build a chat
interface to let users toggle modes.
Though this was unique to Quinn when the
model first launched, GPT5 now features
a similar toggle. The final step,
general RL, broadens capabilities in
instruction following, formatting,
preference alignment, tool use, and
specialized scenarios. Quinn's
developers then use strong to weak
distillation, which allows for the
training of smaller models from larger
ones. All in all, Quen 3's performance
is very impressive, especially given its
relatively small size. But just months
earlier, a different model had already
raised the stakes in open source.
Released in December of last year,
Deepseek's V3 model was one of the most
ambitious open source LLMs to come out
of a major lab in recent years.
The chatbot developed in China called
Deep Seek.
Deepseek is such a fundamental change to
the economics of what's going on.
The most downloaded free app in the US.
This is an update in what people think
is possible. At 671 billion parameters
is a massive generalpurposebased model
designed for efficiency as much as
capability laying the groundwork for the
reasoning focused R1 model that would
follow. We're not going to get into a
ton of detail about V3's architecture or
training pipeline here because we put
out a comprehensive deep dive into it
back in February. But high level the
thing to know about V3 is that it's a
mixture of experts model with several
hardware and algorithmic optimizations
including training V3 natively in 8bit
rather than 16 or 32-bit. a huge unlock
for cutting training costs. And just
recently, DeepSeek pushed V3 even
further with an updated version. The
newly releasleased V3.1 builds directly
on the original V3based checkpoint,
extending it with a two-phase long
context training approach and adding a
hybrid thinking mode that lets the same
model switch between reasoning heavy and
lightweight inference. It also improves
tool use and agent performance thanks to
a more advanced post- training. In
practice, this means V3.1 keeps the same
core architecture as V3, but delivers
stronger reasoning, smarter tool use,
and greater performance. One thing that
sets V3 apart is that it uses a
different attention mechanism than GPOSS
and Quen 3. In modern LLMs, a lot of the
compute and memory is tied up in the KV
cache, and so V3 makes use of MLA, which
compresses keys and values into a
smaller latent space before caching
them, then decompresses them during
inference. Although MLA is a bit more
complex to implement, the previous
Deepseek V2 paper found it delivers
greater memory savings and better
modeling performance than GQA,
especially in huge long context models
like this one. And that's just one of
several areas where Deepseek V3 takes a
different path. With all that in mind,
let's take a step back from V3 to Quen
to GPDoss. How should we think about at
a high level the differences between
these models? One big difference is
size. The Quen 3 model family is the
only one of the three to offer both
dense and mixture of expert variants.
with dense models from 6 billion to 32
billion parameters and a mixture of
experts lineup that includes a 30
billion parameter model and a 235
billion parameter model. Notably, Quen's
mixture of experts base models matched
the dense models performance with only a
fifth as many active parameters. On the
other hand, Deepseek V3 only comes in a
mixture of experts architecture with 671
billion parameters of which 37 billion
are activated for a given token
prediction. So considerably larger than
even the biggest Quen 3 model. GPT OSS
sits in the middle. It offers twoe
models. One with 117 billion parameters
of which 5.1 billion are activated for a
given token and a smaller one with 21
billion parameters of which 3.6 billion
are activated for a given token. One of
the most interesting technical
differences lies in how each model
extends its context length. Yarn short
for yet another rope extension is a
technique for stretching the model's
rotary positional embeddings so that it
can handle far longer sequences than it
was originally trained on. Normally rope
starts to break down when you feed it
more tokens than its base frequency was
set for. But yarn tweaks that frequency.
So the same embedding space covers much
more ground. What's interesting is how
the three models here use it
differently. GPTOSS applies yarn right
from pre-training. So its weights have
learned to work natively with 131,000
token contexts. Deepsee takes a staged
approach fine-tuning after pre-training
to first reach 32,000 tokens, then
further training to achieve 128,000.
Quen also fine-tunes to 32,000, but
skips that additional retraining step.
Instead, at inference time, they apply
yarn scaling again, increasing the rope
base frequency by a factor of four to
reach 128,000 tokens without extra
retraining. In other words, GPTOSS is
born with long context ability. DeepSeek
is trained into it step by step, and
Quen pushes the limits of what a 32,000
train model can do without more long
context training. Personally, I think
one of the most interesting things about
these papers and the state-of-the-art in
deep learning more generally is that a
lot of these read as empirical findings.
Each lab describes the combination of
tools that works well for them, but
almost no one gives a first principles
justification of why one tool is better
than the other. For instance, why MLA is
better than GQA full stop. This is much
different from domains like math or
theoretical physics which are all about
providing first principles explanations
that derive results from axioms or laws.
Also, it's interesting that even though
most of these models have similar
topline benchmark statistics and use
broadly the same tools like attention
mechanisms, activation functions,
positional embeddings, and so on, they
achieve these similar results using
often very different techniques. This is
quite surprising. You'd expect that very
different training methods would lead to
very different results. Also, all of the
major models heavily use reinforcement
learning as part of the post-raining and
reasoning portions of their model
training efforts. And it's fascinating
and pretty surprising how some of these
RL efforts require very little amounts
of data. just 4,000 data pairs in the
case of Quinn. Another point here is
that it's very opaque what the
differences in data sets are between the
labs. It's clear from the papers that
there's an enormous amount of work
happening behind the scenes in data set
engineering. This work is probably a
significant aspect of the moat that
makes these companies comfortable
releasing their models. It's very
difficult to replicate what they're
releasing. So the big takeaway when
reading these papers is you shouldn't
focus too much on just the benchmark
performance or topline stats like
context size. Instead, look at the
specific methods that these labs are
using to achieve those results. There
are tons of high performing open source
models that we didn't discuss in this
video, like Kim K2 or Google Gemma 3.
But when you peek under the hood of many
of these, you'll find nuance differences
that I find really interesting. I hope
this gives you a framework for how to
understand the latest open source
releases and gives you a toolkit to
start tinkering with them yourself.
Thanks for watching. See you in the next
episode.
[Music]

Key Vocabulary

Start Practicing
Vocabulary Meanings

open

/ˈoʊpən/

A1
  • adjective
  • - not closed or fastened
  • verb
  • - to allow access

source

/sɔːrs/

B1
  • noun
  • - a place, person, or thing from which something comes or can be obtained.

model

/ˈmɑːdl/

A2
  • noun
  • - a representation of something

train

/treɪn/

A2
  • verb
  • - to teach a person or animal a particular skill or type of behavior

large

/lɑːrdʒ/

A1
  • adjective
  • - of considerable or relatively great size, extent, or capacity

expert

/ˈekspɜːrt/

B2
  • noun
  • - a person who has a comprehensive and authoritative knowledge of or skill in a particular area

efficient

/ɪˈfɪʃənt/

B2
  • adjective
  • - achieving maximum productivity with minimum wasted effort or expense.

memory

/ˈmeməri/

B1
  • noun
  • - the faculty by which the brain stores and remembers information

context

/ˈkɑːntekst/

B2
  • noun
  • - the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed

harmful

/ˈhɑːrmfl/

B2
  • adjective
  • - causing or likely to cause harm

quantized

/ˈkwɒntaɪzd/

C1
  • adjective
  • - having values that are restricted to a discrete set

capability

/ˌkeɪpəˈbɪləti/

B2
  • noun
  • - the power or ability to do something

architecture

/ˈɑːrkɪtektʃər/

B2
  • noun
  • - the complex or carefully designed structure of something

dense

/dens/

B2
  • adjective
  • - closely compacted in substance.

sparse

/spɑːrs/

B2
  • adjective
  • - thinly dispersed or scattered

reasoning

/ˈriːzənɪŋ/

B2
  • noun
  • - the action of thinking about something in a logical, sensible way

empirical

/ɪmˈpɪrɪkl/

C1
  • adjective
  • - based on, concerned with, or verifiable by observation or experience rather than theory or pure logic

💡 Which new word in “” caught your eye?

📱 Open the app to check meanings, build sentences, and try them out in real convos!

Key Grammar Structures

Coming Soon!

We're updating this section. Stay tuned!

Related Songs