Learn Language through the song - | English exercises

Translate:

OpenAI recently dropped GPT OSS, its

first open weights model since GPT2 in

2019. It's one of the highest profile

open source model launches since

DeepSeek R1 made waves back in January.

But how does GPT OSS compared to the

other top open source models out there

architecturally? Let's find out.

[Music]

GPT OSS is one of OpenAI's most

anticipated recent launches. a large

fully open weights model from one of the

leading American AI labs. Let's take a

closer look at the paper to find out how

it was actually engineered and trained.

GPT OSS is a mixture of experts model

available in two sizes, 120 billion

parameters and 20 billion parameters.

Each token activates the top four

experts, meaning only a portion of the

total parameters are used at any given

time. This allows for efficient

inference without sacrificing the

benefits of a larger model. Trained as a

decoder only transformer, GPTOSS

incorporates plenty of features typical

to modern LLMs. This includes grouped

query attention, a modified attention

mechanism that lets multiple query heads

share the same key value pairs to reduce

memory use and speed up inference. It

also includes swiggloo activations in

the feed forward network layers which

allow for more nuance transformations

than simpler activations like RLU as

well as rotary positional embeddings or

rope which encode token position

directly into the attention mechanism to

support longer contexts. Finally, the

model also makes use of RMS norm with

pre-normalization, a normalization

method that scales inputs by their root

mean square for more stable training.

One standout capability of the model is

its 131,000 token context window, which

it achieves by applying yarn scaling

during pre-training rather than as an

inference time adjustment. We'll touch

on what this means a little bit later in

the video. For GPTO OSS, OpenAI makes

use of their open- source 0200K harmony

tokenizer. This bite pair encoding

tokenizer has over 200,000 tokens and

builds on the O200K tokenizer used in

models like GPT40. As for the data set

GPT OSS was trained on, OpenAI has only

disclosed the broad strokes. The model

was trained on a textonly corpus in the

trillions of tokens with a focus on stem

coding and general knowledge. Harmful

content was filtered out for safety, but

beyond that, there's little else known

publicly. Once training was complete,

the model was released in a quantized

format by default, making it lightweight

enough for deployment on modest

hardware. This allows it to be run on

consumer grade GPUs, laptops, or other

resource limited hardware. However,

there's no unquantized version

available. GPTOSS also underwent

substantial post-training for safety and

alignment, shaping its default behavior

for more controlled outputs. It's worth

noting that some in the open source

community are experimenting with

reducing or removing these layers in

order to explore the raw models

capabilities. In the broader landscape

of open source AI, GPToss arrives as a

fully equipped long context model ready

for immediate use. As impressive as it

is, however, it's just one of several

models in a rapidly expanding field of

open source LLMs. Quen 3, the newest

family of models developed by Alibaba

Cloud, dropped this past April to

considerable hype with benchmark scores

that rivaled those of leading open

source-based models like DeepSeek V3 or

Llama 4. The Quen 3 family includes both

dense models, which activate all of

their parameters for each query, and

mixture of expert models, which only

activate a small subset of their

parameters for each query. The dense

models come in seven different size

classes, including a6 billion parameter

model, one of the smallest current

generation openweight models around,

while the models come in two different

size classes. Architecturally, Quen 3

dense models are very similar to the

Quen 2.5 models, Alibaba's previous

releases. Like Quen 2.5 and GPOSS, Quen

3 incorporates features like group query

attention, swiggloo, rope, and RMS norm.

Quinn 3's sparse models share the same

fundamental architecture as its dense

models, but add a mixture of experts

layer with 128 total experts of which

eight are activated per token. All Quinn

3 models also use the same tokenizer

used in previous Quen models, which

implements bite level bite pair codings

that allow it to handle any text or

symbol without special pre-processing,

unlike word or character-based

tokenizers. One of the main things that

sets Quen 3 apart from previous Quen

models is the way it controls the scale

of the key query and value projections

to keep attention score stable at scale.

It replaces QKV bias, a static offset

that shifts KQV projections in previous

models with QK norm, a normalization

step that dynamically rescales that

query and key vectors to maintain

constant magnitudes. Data set wise, Quen

3 was trained on 36 trillion

pre-training tokens, twice as many as

the Quen 2.5 models. In addition to

pulling data from multilingual texts,

STEM and coding sources, and reasoning

tasks, Quen 3 also uses Quen 2.5 models

to generate trillions of tokens of

synthetic data in different formats like

textbooks, instructions, and code

snippets. Quen 3's pre-training occurred

in three stages. In stage one, the

general stage, models were trained on

over 30 trillion tokens covering 119

languages at a sequence length of 4096

tokens. In stage two, the reasoning

stage, models were trained on an

additional 5 trillion higher quality

tokens featuring more stem reasoning and

coding problems. And in stage three,

which the Quen team calls the long

context stage, context length was

extended to over 32,000 tokens using a

bunch of clever algorithmic

optimizations, including ABF, a

technique to adjust rope so positional

signals remain accurate over much longer

sequences, yarn to further scale for

longer inputs, and dual chunk attention

to process sequences efficiently.

Together, all of these optimizations

allow the model to reason over much

longer inputs at inference. Finally,

Quen uses a four-step post-training

pipeline with two goals. Giving users

more control over how much reasoning to

use for a given query and letting them

efficiently distill larger model

capabilities into smaller models. The

first step in the post- training

pipeline is a long chain of thought cold

start stage which involves feeding a

model a curated data set of challenging

reasoning problems from math logic and

STEM with verifiable reference answers

and then filtering outputs to ensure

quality. This is followed by a reasoning

RL stage using GRPO an RL algorithm

originally developed by Deepseek

researchers on roughly 4,000 query

verifier pairs to strengthen complex

problem solving. Personally, I think

it's fascinating that it only takes

4,000 pairs to get great results. The

third step in the post-training

pipeline, thinking mode fusion, is a key

Quen 3 innovation that integrates

reasoning and non-reasoning into a

single model, letting users switch modes

without changing models. Essentially,

what developers did in this step was

fine-tune the model on a mix of thinking

data, which includes intermediate

reasoning steps, and non-thinking data,

which omits them, and then build a chat

interface to let users toggle modes.

Though this was unique to Quinn when the

model first launched, GPT5 now features

a similar toggle. The final step,

general RL, broadens capabilities in

instruction following, formatting,

preference alignment, tool use, and

specialized scenarios. Quinn's

developers then use strong to weak

distillation, which allows for the

training of smaller models from larger

ones. All in all, Quen 3's performance

is very impressive, especially given its

relatively small size. But just months

earlier, a different model had already

raised the stakes in open source.

Released in December of last year,

Deepseek's V3 model was one of the most

ambitious open source LLMs to come out

of a major lab in recent years.

The chatbot developed in China called

Deep Seek.

Deepseek is such a fundamental change to

the economics of what's going on.

The most downloaded free app in the US.

This is an update in what people think

is possible. At 671 billion parameters

is a massive generalpurposebased model

designed for efficiency as much as

capability laying the groundwork for the

reasoning focused R1 model that would

follow. We're not going to get into a

ton of detail about V3's architecture or

training pipeline here because we put

out a comprehensive deep dive into it

back in February. But high level the

thing to know about V3 is that it's a

mixture of experts model with several

hardware and algorithmic optimizations

including training V3 natively in 8bit

rather than 16 or 32-bit. a huge unlock

for cutting training costs. And just

recently, DeepSeek pushed V3 even

further with an updated version. The

newly releasleased V3.1 builds directly

on the original V3based checkpoint,

extending it with a two-phase long

context training approach and adding a

hybrid thinking mode that lets the same

model switch between reasoning heavy and

lightweight inference. It also improves

tool use and agent performance thanks to

a more advanced post- training. In

practice, this means V3.1 keeps the same

core architecture as V3, but delivers

stronger reasoning, smarter tool use,

and greater performance. One thing that

sets V3 apart is that it uses a

different attention mechanism than GPOSS

and Quen 3. In modern LLMs, a lot of the

compute and memory is tied up in the KV

cache, and so V3 makes use of MLA, which

compresses keys and values into a

smaller latent space before caching

them, then decompresses them during

inference. Although MLA is a bit more

complex to implement, the previous

Deepseek V2 paper found it delivers

greater memory savings and better

modeling performance than GQA,

especially in huge long context models

like this one. And that's just one of

several areas where Deepseek V3 takes a

different path. With all that in mind,

let's take a step back from V3 to Quen

to GPDoss. How should we think about at

a high level the differences between

these models? One big difference is

size. The Quen 3 model family is the

only one of the three to offer both

dense and mixture of expert variants.

with dense models from 6 billion to 32

billion parameters and a mixture of

experts lineup that includes a 30

billion parameter model and a 235

billion parameter model. Notably, Quen's

mixture of experts base models matched

the dense models performance with only a

fifth as many active parameters. On the

other hand, Deepseek V3 only comes in a

mixture of experts architecture with 671

billion parameters of which 37 billion

are activated for a given token

prediction. So considerably larger than

even the biggest Quen 3 model. GPT OSS

sits in the middle. It offers twoe

models. One with 117 billion parameters

of which 5.1 billion are activated for a

given token and a smaller one with 21

billion parameters of which 3.6 billion

are activated for a given token. One of

the most interesting technical

differences lies in how each model

extends its context length. Yarn short

for yet another rope extension is a

technique for stretching the model's

rotary positional embeddings so that it

can handle far longer sequences than it

was originally trained on. Normally rope

starts to break down when you feed it

more tokens than its base frequency was

set for. But yarn tweaks that frequency.

So the same embedding space covers much

more ground. What's interesting is how

the three models here use it

differently. GPTOSS applies yarn right

from pre-training. So its weights have

learned to work natively with 131,000

token contexts. Deepsee takes a staged

approach fine-tuning after pre-training

to first reach 32,000 tokens, then

further training to achieve 128,000.

Quen also fine-tunes to 32,000, but

skips that additional retraining step.

Instead, at inference time, they apply

yarn scaling again, increasing the rope

base frequency by a factor of four to

reach 128,000 tokens without extra

retraining. In other words, GPTOSS is

born with long context ability. DeepSeek

is trained into it step by step, and

Quen pushes the limits of what a 32,000

train model can do without more long

context training. Personally, I think

one of the most interesting things about

these papers and the state-of-the-art in

deep learning more generally is that a

lot of these read as empirical findings.

Each lab describes the combination of

tools that works well for them, but

almost no one gives a first principles

justification of why one tool is better

than the other. For instance, why MLA is

better than GQA full stop. This is much

different from domains like math or

theoretical physics which are all about

providing first principles explanations

that derive results from axioms or laws.

Also, it's interesting that even though

most of these models have similar

topline benchmark statistics and use

broadly the same tools like attention

mechanisms, activation functions,

positional embeddings, and so on, they

achieve these similar results using

often very different techniques. This is

quite surprising. You'd expect that very

different training methods would lead to

very different results. Also, all of the

major models heavily use reinforcement

learning as part of the post-raining and

reasoning portions of their model

training efforts. And it's fascinating

and pretty surprising how some of these

RL efforts require very little amounts

of data. just 4,000 data pairs in the

case of Quinn. Another point here is

that it's very opaque what the

differences in data sets are between the

labs. It's clear from the papers that

there's an enormous amount of work

happening behind the scenes in data set

engineering. This work is probably a

significant aspect of the moat that

makes these companies comfortable

releasing their models. It's very

difficult to replicate what they're

releasing. So the big takeaway when

reading these papers is you shouldn't

focus too much on just the benchmark

performance or topline stats like

context size. Instead, look at the

specific methods that these labs are

using to achieve those results. There

are tons of high performing open source

models that we didn't discuss in this

video, like Kim K2 or Google Gemma 3.

But when you peek under the hood of many

of these, you'll find nuance differences

that I find really interesting. I hope

this gives you a framework for how to

understand the latest open source

releases and gives you a toolkit to

start tinkering with them yourself.

Thanks for watching. See you in the next

episode.

[Music]

Please choose the correct answer for each question below: