Trending Songs Recently Updated Songs Popular Music Genres Add Songs

Explore

Display Bilingual:

Off 한국어 日本語 Español Português Tiếng Việt 中文 Français

OpenAI recently dropped GPT OSS, its 00:00

first open weights model since GPT2 in 00:02

2019. It's one of the highest profile 00:05

open source model launches since 00:07

DeepSeek R1 made waves back in January. 00:08

But how does GPT OSS compared to the 00:11

other top open source models out there 00:12

architecturally? Let's find out. 00:14

[Music] 00:18

GPT OSS is one of OpenAI's most 00:23

anticipated recent launches. a large 00:25

fully open weights model from one of the 00:27

leading American AI labs. Let's take a 00:29

closer look at the paper to find out how 00:31

it was actually engineered and trained. 00:33

GPT OSS is a mixture of experts model 00:35

available in two sizes, 120 billion 00:37

parameters and 20 billion parameters. 00:40

Each token activates the top four 00:42

experts, meaning only a portion of the 00:44

total parameters are used at any given 00:45

time. This allows for efficient 00:47

inference without sacrificing the 00:48

benefits of a larger model. Trained as a 00:50

decoder only transformer, GPTOSS 00:52

incorporates plenty of features typical 00:54

to modern LLMs. This includes grouped 00:56

query attention, a modified attention 00:58

mechanism that lets multiple query heads 01:00

share the same key value pairs to reduce 01:02

memory use and speed up inference. It 01:04

also includes swiggloo activations in 01:06

the feed forward network layers which 01:08

allow for more nuance transformations 01:09

than simpler activations like RLU as 01:11

well as rotary positional embeddings or 01:13

rope which encode token position 01:15

directly into the attention mechanism to 01:17

support longer contexts. Finally, the 01:19

model also makes use of RMS norm with 01:21

pre-normalization, a normalization 01:23

method that scales inputs by their root 01:25

mean square for more stable training. 01:27

One standout capability of the model is 01:29

its 131,000 token context window, which 01:31

it achieves by applying yarn scaling 01:33

during pre-training rather than as an 01:35

inference time adjustment. We'll touch 01:37

on what this means a little bit later in 01:39

the video. For GPTO OSS, OpenAI makes 01:40

use of their open- source 0200K harmony 01:43

tokenizer. This bite pair encoding 01:45

tokenizer has over 200,000 tokens and 01:47

builds on the O200K tokenizer used in 01:50

models like GPT40. As for the data set 01:52

GPT OSS was trained on, OpenAI has only 01:54

disclosed the broad strokes. The model 01:57

was trained on a textonly corpus in the 01:59

trillions of tokens with a focus on stem 02:01

coding and general knowledge. Harmful 02:03

content was filtered out for safety, but 02:05

beyond that, there's little else known 02:07

publicly. Once training was complete, 02:08

the model was released in a quantized 02:10

format by default, making it lightweight 02:11

enough for deployment on modest 02:13

hardware. This allows it to be run on 02:15

consumer grade GPUs, laptops, or other 02:16

resource limited hardware. However, 02:19

there's no unquantized version 02:21

available. GPTOSS also underwent 02:22

substantial post-training for safety and 02:24

alignment, shaping its default behavior 02:26

for more controlled outputs. It's worth 02:28

noting that some in the open source 02:30

community are experimenting with 02:31

reducing or removing these layers in 02:33

order to explore the raw models 02:35

capabilities. In the broader landscape 02:36

of open source AI, GPToss arrives as a 02:38

fully equipped long context model ready 02:41

for immediate use. As impressive as it 02:43

is, however, it's just one of several 02:45

models in a rapidly expanding field of 02:47

open source LLMs. Quen 3, the newest 02:49

family of models developed by Alibaba 02:52

Cloud, dropped this past April to 02:53

considerable hype with benchmark scores 02:55

that rivaled those of leading open 02:57

source-based models like DeepSeek V3 or 02:59

Llama 4. The Quen 3 family includes both 03:01

dense models, which activate all of 03:03

their parameters for each query, and 03:05

mixture of expert models, which only 03:06

activate a small subset of their 03:08

parameters for each query. The dense 03:10

models come in seven different size 03:11

classes, including a6 billion parameter 03:13

model, one of the smallest current 03:15

generation openweight models around, 03:17

while the models come in two different 03:19

size classes. Architecturally, Quen 3 03:21

dense models are very similar to the 03:23

Quen 2.5 models, Alibaba's previous 03:24

releases. Like Quen 2.5 and GPOSS, Quen 03:26

3 incorporates features like group query 03:29

attention, swiggloo, rope, and RMS norm. 03:31

Quinn 3's sparse models share the same 03:34

fundamental architecture as its dense 03:36

models, but add a mixture of experts 03:37

layer with 128 total experts of which 03:39

eight are activated per token. All Quinn 03:41

3 models also use the same tokenizer 03:44

used in previous Quen models, which 03:46

implements bite level bite pair codings 03:47

that allow it to handle any text or 03:49

symbol without special pre-processing, 03:51

unlike word or character-based 03:52

tokenizers. One of the main things that 03:53

sets Quen 3 apart from previous Quen 03:55

models is the way it controls the scale 03:57

of the key query and value projections 03:59

to keep attention score stable at scale. 04:01

It replaces QKV bias, a static offset 04:03

that shifts KQV projections in previous 04:06

models with QK norm, a normalization 04:09

step that dynamically rescales that 04:11

query and key vectors to maintain 04:13

constant magnitudes. Data set wise, Quen 04:15

3 was trained on 36 trillion 04:18

pre-training tokens, twice as many as 04:19

the Quen 2.5 models. In addition to 04:21

pulling data from multilingual texts, 04:23

STEM and coding sources, and reasoning 04:25

tasks, Quen 3 also uses Quen 2.5 models 04:27

to generate trillions of tokens of 04:30

synthetic data in different formats like 04:31

textbooks, instructions, and code 04:33

snippets. Quen 3's pre-training occurred 04:35

in three stages. In stage one, the 04:37

general stage, models were trained on 04:39

over 30 trillion tokens covering 119 04:40

languages at a sequence length of 4096 04:43

tokens. In stage two, the reasoning 04:45

stage, models were trained on an 04:47

additional 5 trillion higher quality 04:48

tokens featuring more stem reasoning and 04:50

coding problems. And in stage three, 04:52

which the Quen team calls the long 04:54

context stage, context length was 04:55

extended to over 32,000 tokens using a 04:57

bunch of clever algorithmic 05:00

optimizations, including ABF, a 05:01

technique to adjust rope so positional 05:03

signals remain accurate over much longer 05:05

sequences, yarn to further scale for 05:06

longer inputs, and dual chunk attention 05:09

to process sequences efficiently. 05:11

Together, all of these optimizations 05:12

allow the model to reason over much 05:14

longer inputs at inference. Finally, 05:15

Quen uses a four-step post-training 05:18

pipeline with two goals. Giving users 05:20

more control over how much reasoning to 05:22

use for a given query and letting them 05:24

efficiently distill larger model 05:26

capabilities into smaller models. The 05:28

first step in the post- training 05:30

pipeline is a long chain of thought cold 05:31

start stage which involves feeding a 05:33

model a curated data set of challenging 05:34

reasoning problems from math logic and 05:36

STEM with verifiable reference answers 05:38

and then filtering outputs to ensure 05:41

quality. This is followed by a reasoning 05:42

RL stage using GRPO an RL algorithm 05:44

originally developed by Deepseek 05:47

researchers on roughly 4,000 query 05:48

verifier pairs to strengthen complex 05:50

problem solving. Personally, I think 05:53

it's fascinating that it only takes 05:55

4,000 pairs to get great results. The 05:56

third step in the post-training 05:58

pipeline, thinking mode fusion, is a key 05:59

Quen 3 innovation that integrates 06:02

reasoning and non-reasoning into a 06:03

single model, letting users switch modes 06:05

without changing models. Essentially, 06:07

what developers did in this step was 06:09

fine-tune the model on a mix of thinking 06:10

data, which includes intermediate 06:12

reasoning steps, and non-thinking data, 06:14

which omits them, and then build a chat 06:15

interface to let users toggle modes. 06:17

Though this was unique to Quinn when the 06:19

model first launched, GPT5 now features 06:21

a similar toggle. The final step, 06:23

general RL, broadens capabilities in 06:24

instruction following, formatting, 06:27

preference alignment, tool use, and 06:28

specialized scenarios. Quinn's 06:30

developers then use strong to weak 06:32

distillation, which allows for the 06:34

training of smaller models from larger 06:36

ones. All in all, Quen 3's performance 06:37

is very impressive, especially given its 06:40

relatively small size. But just months 06:41

earlier, a different model had already 06:43

raised the stakes in open source. 06:45

Released in December of last year, 06:46

Deepseek's V3 model was one of the most 06:48

ambitious open source LLMs to come out 06:50

of a major lab in recent years. 06:52

The chatbot developed in China called 06:53

Deep Seek. 06:56

Deepseek is such a fundamental change to 06:57

the economics of what's going on. 06:58

The most downloaded free app in the US. 07:00

This is an update in what people think 07:03

is possible. At 671 billion parameters 07:05

is a massive generalpurposebased model 07:08

designed for efficiency as much as 07:10

capability laying the groundwork for the 07:12

reasoning focused R1 model that would 07:14

follow. We're not going to get into a 07:16

ton of detail about V3's architecture or 07:17

training pipeline here because we put 07:19

out a comprehensive deep dive into it 07:21

back in February. But high level the 07:22

thing to know about V3 is that it's a 07:24

mixture of experts model with several 07:26

hardware and algorithmic optimizations 07:28

including training V3 natively in 8bit 07:30

rather than 16 or 32-bit. a huge unlock 07:32

for cutting training costs. And just 07:35

recently, DeepSeek pushed V3 even 07:36

further with an updated version. The 07:38

newly releasleased V3.1 builds directly 07:40

on the original V3based checkpoint, 07:43

extending it with a two-phase long 07:45

context training approach and adding a 07:47

hybrid thinking mode that lets the same 07:48

model switch between reasoning heavy and 07:50

lightweight inference. It also improves 07:52

tool use and agent performance thanks to 07:54

a more advanced post- training. In 07:56

practice, this means V3.1 keeps the same 07:58

core architecture as V3, but delivers 08:01

stronger reasoning, smarter tool use, 08:03

and greater performance. One thing that 08:05

sets V3 apart is that it uses a 08:07

different attention mechanism than GPOSS 08:09

and Quen 3. In modern LLMs, a lot of the 08:11

compute and memory is tied up in the KV 08:14

cache, and so V3 makes use of MLA, which 08:16

compresses keys and values into a 08:19

smaller latent space before caching 08:21

them, then decompresses them during 08:22

inference. Although MLA is a bit more 08:24

complex to implement, the previous 08:26

Deepseek V2 paper found it delivers 08:28

greater memory savings and better 08:29

modeling performance than GQA, 08:31

especially in huge long context models 08:33

like this one. And that's just one of 08:35

several areas where Deepseek V3 takes a 08:36

different path. With all that in mind, 08:38

let's take a step back from V3 to Quen 08:40

to GPDoss. How should we think about at 08:42

a high level the differences between 08:45

these models? One big difference is 08:46

size. The Quen 3 model family is the 08:48

only one of the three to offer both 08:50

dense and mixture of expert variants. 08:52

with dense models from 6 billion to 32 08:54

billion parameters and a mixture of 08:56

experts lineup that includes a 30 08:57

billion parameter model and a 235 08:59

billion parameter model. Notably, Quen's 09:01

mixture of experts base models matched 09:03

the dense models performance with only a 09:05

fifth as many active parameters. On the 09:07

other hand, Deepseek V3 only comes in a 09:09

mixture of experts architecture with 671 09:11

billion parameters of which 37 billion 09:13

are activated for a given token 09:16

prediction. So considerably larger than 09:17

even the biggest Quen 3 model. GPT OSS 09:19

sits in the middle. It offers twoe 09:21

models. One with 117 billion parameters 09:23

of which 5.1 billion are activated for a 09:26

given token and a smaller one with 21 09:28

billion parameters of which 3.6 billion 09:30

are activated for a given token. One of 09:32

the most interesting technical 09:33

differences lies in how each model 09:34

extends its context length. Yarn short 09:36

for yet another rope extension is a 09:38

technique for stretching the model's 09:40

rotary positional embeddings so that it 09:41

can handle far longer sequences than it 09:43

was originally trained on. Normally rope 09:45

starts to break down when you feed it 09:47

more tokens than its base frequency was 09:48

set for. But yarn tweaks that frequency. 09:50

So the same embedding space covers much 09:52

more ground. What's interesting is how 09:54

the three models here use it 09:56

differently. GPTOSS applies yarn right 09:57

from pre-training. So its weights have 10:00

learned to work natively with 131,000 10:01

token contexts. Deepsee takes a staged 10:04

approach fine-tuning after pre-training 10:07

to first reach 32,000 tokens, then 10:08

further training to achieve 128,000. 10:11

Quen also fine-tunes to 32,000, but 10:14

skips that additional retraining step. 10:16

Instead, at inference time, they apply 10:18

yarn scaling again, increasing the rope 10:20

base frequency by a factor of four to 10:22

reach 128,000 tokens without extra 10:24

retraining. In other words, GPTOSS is 10:26

born with long context ability. DeepSeek 10:29

is trained into it step by step, and 10:31

Quen pushes the limits of what a 32,000 10:33

train model can do without more long 10:35

context training. Personally, I think 10:37

one of the most interesting things about 10:39

these papers and the state-of-the-art in 10:40

deep learning more generally is that a 10:42

lot of these read as empirical findings. 10:43

Each lab describes the combination of 10:45

tools that works well for them, but 10:47

almost no one gives a first principles 10:48

justification of why one tool is better 10:50

than the other. For instance, why MLA is 10:52

better than GQA full stop. This is much 10:54

different from domains like math or 10:56

theoretical physics which are all about 10:58

providing first principles explanations 10:59

that derive results from axioms or laws. 11:01

Also, it's interesting that even though 11:04

most of these models have similar 11:05

topline benchmark statistics and use 11:07

broadly the same tools like attention 11:08

mechanisms, activation functions, 11:10

positional embeddings, and so on, they 11:12

achieve these similar results using 11:13

often very different techniques. This is 11:15

quite surprising. You'd expect that very 11:17

different training methods would lead to 11:19

very different results. Also, all of the 11:20

major models heavily use reinforcement 11:22

learning as part of the post-raining and 11:24

reasoning portions of their model 11:26

training efforts. And it's fascinating 11:27

and pretty surprising how some of these 11:29

RL efforts require very little amounts 11:30

of data. just 4,000 data pairs in the 11:32

case of Quinn. Another point here is 11:34

that it's very opaque what the 11:36

differences in data sets are between the 11:37

labs. It's clear from the papers that 11:39

there's an enormous amount of work 11:40

happening behind the scenes in data set 11:41

engineering. This work is probably a 11:43

significant aspect of the moat that 11:44

makes these companies comfortable 11:46

releasing their models. It's very 11:47

difficult to replicate what they're 11:49

releasing. So the big takeaway when 11:50

reading these papers is you shouldn't 11:52

focus too much on just the benchmark 11:53

performance or topline stats like 11:55

context size. Instead, look at the 11:57

specific methods that these labs are 11:59

using to achieve those results. There 12:01

are tons of high performing open source 12:03

models that we didn't discuss in this 12:04

video, like Kim K2 or Google Gemma 3. 12:06

But when you peek under the hood of many 12:08

of these, you'll find nuance differences 12:10

that I find really interesting. I hope 12:12

this gives you a framework for how to 12:14

understand the latest open source 12:15

releases and gives you a toolkit to 12:16

start tinkering with them yourself. 12:18

Thanks for watching. See you in the next 12:20

episode. 12:21

[Music] 12:23

– English Lyrics

🎧 Learn and chill with "" – open the app to catch every cool phrase and structure!

By

Viewed

23,642

Language

English

Learn this song

Lyrics & Translation

[English]

OpenAI recently dropped GPT OSS, its

first open weights model since GPT2 in

2019. It's one of the highest profile

open source model launches since

DeepSeek R1 made waves back in January.

But how does GPT OSS compared to the

other top open source models out there

architecturally? Let's find out.

[Music]

GPT OSS is one of OpenAI's most

anticipated recent launches. a large

fully open weights model from one of the

leading American AI labs. Let's take a

closer look at the paper to find out how

it was actually engineered and trained.

GPT OSS is a mixture of experts model

available in two sizes, 120 billion

parameters and 20 billion parameters.

Each token activates the top four

experts, meaning only a portion of the

total parameters are used at any given

time. This allows for efficient

inference without sacrificing the

benefits of a larger model. Trained as a

decoder only transformer, GPTOSS

incorporates plenty of features typical

to modern LLMs. This includes grouped

query attention, a modified attention

mechanism that lets multiple query heads

share the same key value pairs to reduce

memory use and speed up inference. It

also includes swiggloo activations in

the feed forward network layers which

allow for more nuance transformations

than simpler activations like RLU as

well as rotary positional embeddings or

rope which encode token position

directly into the attention mechanism to

support longer contexts. Finally, the

model also makes use of RMS norm with

pre-normalization, a normalization

method that scales inputs by their root

mean square for more stable training.

One standout capability of the model is

its 131,000 token context window, which

it achieves by applying yarn scaling

during pre-training rather than as an

inference time adjustment. We'll touch

on what this means a little bit later in

the video. For GPTO OSS, OpenAI makes

use of their open- source 0200K harmony

tokenizer. This bite pair encoding

tokenizer has over 200,000 tokens and

builds on the O200K tokenizer used in

models like GPT40. As for the data set

GPT OSS was trained on, OpenAI has only

disclosed the broad strokes. The model

was trained on a textonly corpus in the

trillions of tokens with a focus on stem

coding and general knowledge. Harmful

content was filtered out for safety, but

beyond that, there's little else known

publicly. Once training was complete,

the model was released in a quantized

format by default, making it lightweight

enough for deployment on modest

hardware. This allows it to be run on

consumer grade GPUs, laptops, or other

resource limited hardware. However,

there's no unquantized version

available. GPTOSS also underwent

substantial post-training for safety and

alignment, shaping its default behavior

for more controlled outputs. It's worth

noting that some in the open source

community are experimenting with

reducing or removing these layers in

order to explore the raw models

capabilities. In the broader landscape

of open source AI, GPToss arrives as a

fully equipped long context model ready

for immediate use. As impressive as it

is, however, it's just one of several

models in a rapidly expanding field of

open source LLMs. Quen 3, the newest

family of models developed by Alibaba

Cloud, dropped this past April to

considerable hype with benchmark scores

that rivaled those of leading open

source-based models like DeepSeek V3 or

Llama 4. The Quen 3 family includes both

dense models, which activate all of

their parameters for each query, and

mixture of expert models, which only

activate a small subset of their

parameters for each query. The dense

models come in seven different size

classes, including a6 billion parameter

model, one of the smallest current

generation openweight models around,

while the models come in two different

size classes. Architecturally, Quen 3

dense models are very similar to the

Quen 2.5 models, Alibaba's previous

releases. Like Quen 2.5 and GPOSS, Quen

3 incorporates features like group query

attention, swiggloo, rope, and RMS norm.

Quinn 3's sparse models share the same

fundamental architecture as its dense

models, but add a mixture of experts

layer with 128 total experts of which

eight are activated per token. All Quinn

3 models also use the same tokenizer

used in previous Quen models, which

implements bite level bite pair codings

that allow it to handle any text or

symbol without special pre-processing,

unlike word or character-based

tokenizers. One of the main things that

sets Quen 3 apart from previous Quen

models is the way it controls the scale

of the key query and value projections

to keep attention score stable at scale.

It replaces QKV bias, a static offset

that shifts KQV projections in previous

models with QK norm, a normalization

step that dynamically rescales that

query and key vectors to maintain

constant magnitudes. Data set wise, Quen

3 was trained on 36 trillion

pre-training tokens, twice as many as

the Quen 2.5 models. In addition to

pulling data from multilingual texts,

STEM and coding sources, and reasoning

tasks, Quen 3 also uses Quen 2.5 models

to generate trillions of tokens of

synthetic data in different formats like

textbooks, instructions, and code

snippets. Quen 3's pre-training occurred

in three stages. In stage one, the

general stage, models were trained on

over 30 trillion tokens covering 119

languages at a sequence length of 4096

tokens. In stage two, the reasoning

stage, models were trained on an

additional 5 trillion higher quality

tokens featuring more stem reasoning and

coding problems. And in stage three,

which the Quen team calls the long

context stage, context length was

extended to over 32,000 tokens using a

bunch of clever algorithmic

optimizations, including ABF, a

technique to adjust rope so positional

signals remain accurate over much longer

sequences, yarn to further scale for

longer inputs, and dual chunk attention

to process sequences efficiently.

Together, all of these optimizations

allow the model to reason over much

longer inputs at inference. Finally,

Quen uses a four-step post-training

pipeline with two goals. Giving users

more control over how much reasoning to

use for a given query and letting them

efficiently distill larger model

capabilities into smaller models. The

first step in the post- training

pipeline is a long chain of thought cold

start stage which involves feeding a

model a curated data set of challenging

reasoning problems from math logic and

STEM with verifiable reference answers

and then filtering outputs to ensure

quality. This is followed by a reasoning

RL stage using GRPO an RL algorithm

originally developed by Deepseek

researchers on roughly 4,000 query

verifier pairs to strengthen complex

problem solving. Personally, I think

it's fascinating that it only takes

4,000 pairs to get great results. The

third step in the post-training

pipeline, thinking mode fusion, is a key

Quen 3 innovation that integrates

reasoning and non-reasoning into a

single model, letting users switch modes

without changing models. Essentially,

what developers did in this step was

fine-tune the model on a mix of thinking

data, which includes intermediate

reasoning steps, and non-thinking data,

which omits them, and then build a chat

interface to let users toggle modes.

Though this was unique to Quinn when the

model first launched, GPT5 now features

a similar toggle. The final step,

general RL, broadens capabilities in

instruction following, formatting,

preference alignment, tool use, and

specialized scenarios. Quinn's

developers then use strong to weak

distillation, which allows for the

training of smaller models from larger

ones. All in all, Quen 3's performance

is very impressive, especially given its

relatively small size. But just months

earlier, a different model had already

raised the stakes in open source.

Released in December of last year,

Deepseek's V3 model was one of the most

ambitious open source LLMs to come out

of a major lab in recent years.

The chatbot developed in China called

Deep Seek.

Deepseek is such a fundamental change to

the economics of what's going on.

The most downloaded free app in the US.

This is an update in what people think

is possible. At 671 billion parameters

is a massive generalpurposebased model

designed for efficiency as much as

capability laying the groundwork for the

reasoning focused R1 model that would

follow. We're not going to get into a

ton of detail about V3's architecture or

training pipeline here because we put

out a comprehensive deep dive into it

back in February. But high level the

thing to know about V3 is that it's a

mixture of experts model with several

hardware and algorithmic optimizations

including training V3 natively in 8bit

rather than 16 or 32-bit. a huge unlock

for cutting training costs. And just

recently, DeepSeek pushed V3 even

further with an updated version. The

newly releasleased V3.1 builds directly

on the original V3based checkpoint,

extending it with a two-phase long

context training approach and adding a

hybrid thinking mode that lets the same

model switch between reasoning heavy and

lightweight inference. It also improves

tool use and agent performance thanks to

a more advanced post- training. In

practice, this means V3.1 keeps the same

core architecture as V3, but delivers

stronger reasoning, smarter tool use,

and greater performance. One thing that

sets V3 apart is that it uses a

different attention mechanism than GPOSS

and Quen 3. In modern LLMs, a lot of the

compute and memory is tied up in the KV

cache, and so V3 makes use of MLA, which

compresses keys and values into a

smaller latent space before caching

them, then decompresses them during

inference. Although MLA is a bit more

complex to implement, the previous

Deepseek V2 paper found it delivers

greater memory savings and better

modeling performance than GQA,

especially in huge long context models

like this one. And that's just one of

several areas where Deepseek V3 takes a

different path. With all that in mind,

let's take a step back from V3 to Quen

to GPDoss. How should we think about at

a high level the differences between

these models? One big difference is

size. The Quen 3 model family is the

only one of the three to offer both

dense and mixture of expert variants.

with dense models from 6 billion to 32

billion parameters and a mixture of

experts lineup that includes a 30

billion parameter model and a 235

billion parameter model. Notably, Quen's

mixture of experts base models matched

the dense models performance with only a

fifth as many active parameters. On the

other hand, Deepseek V3 only comes in a

mixture of experts architecture with 671

billion parameters of which 37 billion

are activated for a given token

prediction. So considerably larger than

even the biggest Quen 3 model. GPT OSS

sits in the middle. It offers twoe

models. One with 117 billion parameters

of which 5.1 billion are activated for a

given token and a smaller one with 21

billion parameters of which 3.6 billion

are activated for a given token. One of

the most interesting technical

differences lies in how each model

extends its context length. Yarn short

for yet another rope extension is a

technique for stretching the model's

rotary positional embeddings so that it

can handle far longer sequences than it

was originally trained on. Normally rope

starts to break down when you feed it

more tokens than its base frequency was

set for. But yarn tweaks that frequency.

So the same embedding space covers much

more ground. What's interesting is how

the three models here use it

differently. GPTOSS applies yarn right

from pre-training. So its weights have

learned to work natively with 131,000

token contexts. Deepsee takes a staged

approach fine-tuning after pre-training

to first reach 32,000 tokens, then

further training to achieve 128,000.

Quen also fine-tunes to 32,000, but

skips that additional retraining step.

Instead, at inference time, they apply

yarn scaling again, increasing the rope

base frequency by a factor of four to

reach 128,000 tokens without extra

retraining. In other words, GPTOSS is

born with long context ability. DeepSeek

is trained into it step by step, and

Quen pushes the limits of what a 32,000

train model can do without more long

context training. Personally, I think

one of the most interesting things about

these papers and the state-of-the-art in

deep learning more generally is that a

lot of these read as empirical findings.

Each lab describes the combination of

tools that works well for them, but

almost no one gives a first principles

justification of why one tool is better

than the other. For instance, why MLA is

better than GQA full stop. This is much

different from domains like math or

theoretical physics which are all about

providing first principles explanations

that derive results from axioms or laws.

Also, it's interesting that even though

most of these models have similar

topline benchmark statistics and use

broadly the same tools like attention

mechanisms, activation functions,

positional embeddings, and so on, they

achieve these similar results using

often very different techniques. This is

quite surprising. You'd expect that very

different training methods would lead to

very different results. Also, all of the

major models heavily use reinforcement

learning as part of the post-raining and

reasoning portions of their model

training efforts. And it's fascinating

and pretty surprising how some of these

RL efforts require very little amounts

of data. just 4,000 data pairs in the

case of Quinn. Another point here is

that it's very opaque what the

differences in data sets are between the

labs. It's clear from the papers that

there's an enormous amount of work

happening behind the scenes in data set

engineering. This work is probably a

significant aspect of the moat that

makes these companies comfortable

releasing their models. It's very

difficult to replicate what they're

releasing. So the big takeaway when

reading these papers is you shouldn't

focus too much on just the benchmark

performance or topline stats like

context size. Instead, look at the

specific methods that these labs are

using to achieve those results. There

are tons of high performing open source

models that we didn't discuss in this

video, like Kim K2 or Google Gemma 3.

But when you peek under the hood of many

of these, you'll find nuance differences

that I find really interesting. I hope

this gives you a framework for how to

understand the latest open source

releases and gives you a toolkit to

start tinkering with them yourself.

Thanks for watching. See you in the next

episode.

[Music]

Key Vocabulary

Start Practicing

Vocabulary	Meanings
open /ˈoʊpən/ A1	adjective - not closed or fastened verb - to allow access
source /sɔːrs/ B1	noun - a place, person, or thing from which something comes or can be obtained.
model /ˈmɑːdl/ A2	noun - a representation of something
train /treɪn/ A2	verb - to teach a person or animal a particular skill or type of behavior
large /lɑːrdʒ/ A1	adjective - of considerable or relatively great size, extent, or capacity
expert /ˈekspɜːrt/ B2	noun - a person who has a comprehensive and authoritative knowledge of or skill in a particular area
efficient /ɪˈfɪʃənt/ B2	adjective - achieving maximum productivity with minimum wasted effort or expense.
memory /ˈmeməri/ B1	noun - the faculty by which the brain stores and remembers information
context /ˈkɑːntekst/ B2	noun - the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed
harmful /ˈhɑːrmfl/ B2	adjective - causing or likely to cause harm
quantized /ˈkwɒntaɪzd/ C1	adjective - having values that are restricted to a discrete set
capability /ˌkeɪpəˈbɪləti/ B2	noun - the power or ability to do something
architecture /ˈɑːrkɪtektʃər/ B2	noun - the complex or carefully designed structure of something
dense /dens/ B2	adjective - closely compacted in substance.
sparse /spɑːrs/ B2	adjective - thinly dispersed or scattered
reasoning /ˈriːzənɪŋ/ B2	noun - the action of thinking about something in a logical, sensible way
empirical /ɪmˈpɪrɪkl/ C1	adjective - based on, concerned with, or verifiable by observation or experience rather than theory or pure logic

💡 Which new word in “” caught your eye?

📱 Open the app to check meanings, build sentences, and try them out in real convos!

Key Grammar Structures

Coming Soon!

We're updating this section. Stay tuned!

Related Songs