Trending Songs Recently Updated Songs Popular Music Genres Add Songs

Explore

Display Bilingual:

Off 한국어 Español Português Français Tiếng Việt 中文日本語

[Music] 00:01

Hey guys, I'm thrilled to be joined 00:05

today by Nick Joseph, the head of 00:07

pre-training at Anthropic. To give 00:08

viewers a highle sense of what we'll be 00:10

covering, we're going to start with the 00:11

basics of what pre-training is and then 00:12

dig into how Nick thinks about strategy, 00:14

data alignment, and infrastructure at 00:16

Enthropic. And by the end, you'll 00:17

hopefully have a sense for how progress 00:18

in AI comes directly from advances in 00:20

pre-training. I would love to talk a 00:22

little bit about your backstory and kind 00:23

of how you got to this point. Where did 00:25

you work before Anthropic? And what were 00:26

your takeaways from those places? Yeah. 00:28

So let's see. I was at Vicarius uh and 00:30

then at OpenAI uh before Anthropic. So 00:32

Vicarius was originally a GI lab and 00:35

sort of when I joined they were sort of 00:37

making a shift to product particularly 00:38

working on robotics products and the 00:40

thing I worked on was like training uh 00:42

computer vision models for for their 00:44

robotics products. It was my first job. 00:45

So I think I just like learned a ton 00:46

about like how to do machine learning 00:48

models, how to like write machine 00:50

learning infrastructure. 00:51

And at the time were you also thinking 00:53

about a career as an academic? Like at 00:54

the time a lot of people doing AI work 00:56

were in PhDs. That's kind of what I was 00:58

thinking about before I started to do a 00:59

company. Like how were you thinking 01:01

about that in your headsp space? 01:02

Yeah. So like I'm actually rewind a 01:03

little bit. I think like a lot of my 01:05

thinking on this had come from an 01:06

internship I did at Give Well, which is 01:07

like a nonprofit that evaluates 01:09

charities. And some people there being 01:10

like ah we're at some point we might 01:12

have AGI. It could be dangerous. We 01:14

should worry about these risks. This 01:15

could be like a big impact on humanity. 01:16

And I was like not super convinced at 01:18

the time and went down the economics 01:20

route and was going to try to work on 01:21

like directly helping people in poverty. 01:22

that didn't work out for various reasons 01:24

and ended up being like okay I'll at 01:26

least work on AI either like the safety 01:28

thing will turn out to be important I'll 01:30

work on that or it won't be and I'll 01:31

just make cool things with AI that can 01:33

probably help people in poverty more 01:34

I wasn't really coming at it from an 01:36

academic standpoint I was sort of like 01:38

in fact when I switched to that it was 01:40

part of the appeal was that I could like 01:42

immediately go do stuff in AI whereas if 01:43

I want to work in like economic policy 01:45

I'd have to wait 01:47

I don't know six years to do a PhD and 01:48

start and like totally uh it's it's a 01:51

longer path 01:52

and and what are the state of AI safety 01:53

work at that time even look like? Like 01:55

who are the people who were thinking 01:56

about that kind of stuff? I mean there 01:57

were some folks at vicarious thinking 01:58

about this kind of thing but it was 01:59

fundamentally a robotics company and and 02:00

so yeah how how were you thinking about 02:02

that at the time? 02:04

Yeah. So my sense was like at the time a 02:05

lot of the AI safety discussion was kind 02:06

of theoretical like the models weren't 02:08

actually that good. They weren't really 02:10

posing these dangers. So it was a lot 02:12

more like philosophical like oh at some 02:14

point we might get AI that's really 02:15

smart smarter than humans and like 02:17

should we wait this like future concern 02:18

how should we compare that to near-term 02:20

things? And I think that was like 02:23

actually a just a less compelling 02:24

argument. I think it was like an 02:26

interesting one and like sort of made 02:27

you think of it. 02:29

So next you went to OpenAI. What was 02:29

OpenAI like at this time? 02:31

Yeah. So I was at OpenAI. I was on one 02:32

of the safety teams and kind of worked 02:34

on uh 02:36

I ended up working on code models 02:37

actually and kind of when I got there I 02:39

could the the first thing I saw was oh 02:41

they'd find tune GT3 to write some code 02:43

but and it was really good and I was 02:45

like oh okay if you're worried about AI 02:47

getting really powerful writing its own 02:49

code that seems 02:51

seems like it could self-improve and how 02:52

how likely is that to happen? So it was 02:55

doing a bunch of evaluations and like 02:56

studies of what contributed 02:58

and then after like uh eight months uh 02:59

basically everyone I worked with like 03:02

all all the safety leads left 03:04

which uh yeah invited me to go to 03:06

Anthropic and that was sort of the 03:09

reason I joined OpenAI was because I 03:10

cared about AI safety and wanted to work 03:11

with them. So then I went with them to 03:13

join Anthropic uh pretty much right when 03:15

it started. 03:17

With that why don't we transition a bit 03:17

these days you run the pre-training team 03:19

specifically at Anthropic. Um, obviously 03:21

you've been working on pre-training at 03:23

anthropic for quite a bit of time and 03:24

I'm sure it's evolved over the years, 03:26

what that even entails and looks like. 03:27

Why don't we start by just talking a 03:29

little bit about what pre pre-training 03:30

is like? How does it even fit into the 03:31

way of thinking about how AI models have 03:33

developed at a place like Anthropic? And 03:35

what exactly do you guys do? 03:37

We know that one of the ingredients to 03:38

making AI models better is scale. You 03:39

want to put a lot of compute in. And if 03:41

you sort of step back and you're like, 03:43

okay, what's the way we could put the 03:44

most compute into a into a model 03:45

possible? We need some objective that 03:47

there's just like tons of data for. And 03:49

one idea here is like the internet. The 03:51

internet is massive. It's probably the 03:53

biggest like single source of data 03:54

humanity has created. And you don't have 03:56

labels. It's like you don't want someone 03:58

to have to go in and look read the 03:59

entire internet and like say something 04:01

about it. So you want to get labels out 04:02

of the data itself. 04:04

And the idea here is we can take some 04:05

text and we can predict the next word. 04:06

So you take you know the as the first 04:08

word you predict the second word then 04:10

you say the cat and you predict the word 04:12

after that. And this means you get very 04:13

dense signal. Every every word is like a 04:16

new example. And there's a huge amount 04:18

of data and one of the findings from my 04:20

GT1 GT2 was kind of as you throw more 04:22

compute at this more data bigger models 04:25

uh you get better you you get smarter 04:27

models essentially. 04:29

Totally. 04:30

Um and that's kind of been the central 04:31

thesis of pre-training for the whole 04:32

time. 04:34

Uh there's this idea of scaling laws 04:35

which is that you can actually quantify 04:37

like as you put in more compute more 04:39

more data more parameters you get models 04:41

in a very you get a lower loss a better 04:43

prediction of the next word in a very 04:45

predictable way. And I think you can 04:46

somewhat foresee from that original 04:48

paper and I think like Dario did foresee 04:50

this I think many people did but wasn't 04:51

obvious was that once you have that 04:53

there's this positive feedback loop 04:55

where you can train a model you can use 04:56

it to make something useful and sell 04:58

that and get more money use that to buy 05:00

more compute and then you actually train 05:02

a better model and I we've sort of run 05:04

that cycle over and over again over the 05:06

past 5 years or so. Well, in thinking 05:08

about that objective to begin, you know, 05:10

I think the way I think about the state 05:12

of pre-training is yeah, it seems like 05:14

this next word prediction, at least from 05:15

the external standpoint, seems to be the 05:17

dominant way pre-training happens. But 05:18

if I rewind the clock to that era of 05:20

2017 to 2020 or 2021 and two even, there 05:22

was all sorts of pre-training objectives 05:25

people were considering, right? There 05:27

was these uh BERT and BART models that 05:28

were doing mass language modeling. It 05:30

seems like this GPT series of models 05:31

doing uh auto reggressive modeling as 05:33

you describe this next word prediction 05:36

seems to be the dominant one that won 05:38

out. Do you have any reflections on that 05:39

time period? Like were you guys trying 05:41

all of them and kind of this one worked 05:42

or or is there some sort of first 05:44

principles reason why this is like the 05:46

right one that should have worked? 05:48

I think the answer is like it's mostly 05:49

empirical like in terms of how to think 05:51

of the things I'd be like yeah it's 05:52

empirical just try them all see what 05:53

works. One big advantage for this auto 05:54

reagive setup is that you can just 05:56

sample from it to generate text 05:57

afterwards in a fairly like 05:59

straightforward way that comes 06:01

like enables a product use very nicely. 06:02

Um like one thing that you want is like 06:05

one charact is like a loss whereas you 06:07

drive down the loss that actually is the 06:08

thing you care about and you can think 06:11

of it as like if you got to perfect on 06:12

language modeling you now can like write 06:14

text as a human. You can sort of imagine 06:17

you put in the title of a paper and it 06:19

should spit out the entire spit out a 06:20

novel paper. Whereas I think some of the 06:22

other approaches don't quite have that 06:23

uh flavor. 06:25

Yeah, totally. Yeah, it makes sense that 06:26

in terms of that loop you're describing 06:28

of, you know, then release something 06:30

that gets you revenue and you can use 06:32

that to buy more compute and iterate. 06:33

This sort of gives you the most natural 06:35

way to actually do that flow because you 06:36

can keep releasing new products and keep 06:37

getting the revenue from that to invest 06:39

in more compute and so on. 06:40

Yeah, it certainly gives you the most 06:42

open-ended thing. You could imagine, you 06:43

know, you like train something as a 06:44

class like you train some base thing, 06:45

you fine tune it for a bunch of 06:47

particular tasks. one approach people 06:48

would use. They would like do this big 06:49

pre-training and then they wouldn't just 06:50

like open-endedly sample from it. You'd 06:52

fine tune it on like a hundred specific 06:53

tasks and that could work too. I think 06:55

that like the one sort of general 06:57

intuition I have is like compute is the 06:59

thing that matters. So like I think if 07:01

you throw enough compute at any of these 07:03

objectives, you're going to get 07:04

something that's probably pretty good uh 07:05

and can kind of be fine tuned to other 07:07

things. And it's it's surprising how 07:09

little these details matter compared to 07:11

throwing more comput. When you think 07:13

about actually throwing more comput, 07:15

there's a whole bunch of axes by which 07:16

you could throw compute at it too, 07:18

right? And if you have a specific model 07:19

architecture you're training over, you 07:21

can basically throw more data at that 07:23

specific architecture. For a particular 07:24

one, you could add more layers or make 07:26

the models larger in it. You could do 07:27

some kind of neural architecture search 07:29

over lots of different variants. And I 07:31

assume that these days it's somewhat 07:33

more figured out, you know, which 07:35

architecture you go for. I assume the 07:36

earlier days it was somewhat less so. 07:37

And and I'm curious if you could speak 07:39

to how you guys thought about that. like 07:40

what did your infrastructure even look 07:41

like to do that type of determination? 07:43

I mean, I think the the short answer is 07:46

it's hard, right? Like what you're 07:47

really doing is you're going to train 07:48

this one big expensive model and you 07:49

have a space of, you know, you can sort 07:50

of call all these things 07:53

hyperparameters. You know, how many 07:54

layers do you have? What's your width? 07:55

Like you have the space of hundreds of 07:56

hyperparameters and you want them all to 07:57

be optimal and you're sort of striking 07:59

this balance actually between how much 08:01

do they matter like can you just take 08:03

your best guess and throw more compute 08:05

at it in whatever way you want and 08:07

basically doesn't matter. how much you 08:09

want to get it precisely correct. 08:10

Interesting. 08:11

And I think one of the like interesting 08:12

things is like it actually doesn't 08:13

matter that much. Like we like I think 08:14

this was in one of the early scaling 08:16

laws papers like you can change these 08:17

things and get little wins but like as 08:19

you throw more compute it it sort of 08:21

reliably gets better. If you mess up 08:23

enough you will you will sort stop 08:25

seeing that happen and you won't have 08:27

any way to know which is one of the 08:28

that's like kind of the hardest part in 08:29

some ways. 08:30

You don't know the counterfactual 08:31

basically because you didn't run it for 08:32

long enough to actually know what it is. 08:33

Yeah. We have these scaling laws. So you 08:35

can sort of say like as you train more 08:37

comput you expect the loss to go down as 08:38

a power law. 08:40

It's really a power law plus constant. 08:41

So what eventually will happen is you'll 08:42

curve off that power law and then you 08:44

know something is wrong and is it 08:45

fundamental? Is it like you've hit the 08:46

limits of scaling or is it nope you 08:48

should have ch you should have tweaked 08:50

your learning rate slightly differently 08:51

and that's that's sort of one of the 08:53

challenges in terms of how to like 08:55

figure it out. You can the the usual 08:56

paradigm is like test things out at 08:57

small scale before running them at large 08:59

scale and try to find 09:01

small scale in terms of data or in terms 09:02

of something else? uh in terms of 09:04

everything like you kind of want to 09:05

scale things down like proportionally. 09:07

So you want to say like you want you 09:08

want to have some theory for like how 09:09

you're going to scale up like ah okay if 09:11

I get 10 times as many flops how much of 09:13

it goes into layers how much of it goes 09:15

into data how much of it goes into 09:16

attention 09:19

and you sort of get that theory and then 09:20

test that it's optimal a bunch with like 09:23

scaling everything down proportionally 09:25

and and just so I can think about what 09:27

this actually looks like in those in 09:28

those early days of anthropic you know 09:30

you're a team of like 10 or something 09:31

like that in those very early days or 12 09:32

maybe what actually is your ability to 09:34

use large scale infrastructure as like a 09:36

relatively nimble startup at that time. 09:38

I mean a startup that was well 09:40

capitalized but still not actually that 09:41

many people working at. What kind of 09:43

infrastructure did you have access to to 09:45

train these early models at the So 09:47

actually one of the wild things was it 09:49

at least I mean you don't know what 09:50

anyone else is doing of course but it 09:51

kind of felt like we were like at the 09:53

frontier of it and there just weren't 09:54

that many people who cared like I was 09:57

sort of coming you know I was coming at 09:59

it from like we're making AGI this is 10:00

the most important technology ever and 10:01

then would kind of like look around and 10:03

be like and it seems like I'm one of 30 10:04

people who were working on this in like 10:06

the world. I mean I was kind of like 10:08

junior person. Everyone else sort of 10:10

knew how to do this and had done it 10:11

before but I was kind of surprised at 10:12

how easy it was. Um like the public 10:14

estimates for GP3 I remember were that 10:17

it cost $5 million to train which you're 10:19

like on the one hand five million is 10:21

kind of a lot but it's like a lot for an 10:22

individual person. It's not really a lot 10:24

from like a company perspective. So we 10:26

could totally buy like compute that was 10:29

enough to train models like that you 10:31

could 10:33

and were you using a cloud provider or 10:33

or did you have a custom setup somewhere 10:35

or did you literally have racks in a 10:36

room somewhere that you were you know 10:38

bought a bunch of Nvidia GPUs and you 10:39

were doing it? uh we're using a cloud 10:40

provider, but I think it's kind of it's 10:42

not actually that different because one 10:43

of the things that's was surprising to 10:45

me is you actually have to understand 10:46

the the literal layout. Like uh I 10:48

remember at one point uh one of my 10:51

co-workers running a clustering 10:53

algorithm to identify what rooms all the 10:54

chips were in since we we had a 10:57

hypothesis that they were in different 10:58

rooms and that was causing like or you 11:00

know different buildup some sort of 11:02

network latency and you can kind of 11:03

figure it out. you could like reverse 11:05

engineer like ah okay yeah there's 11:06

clearly like two clusters here that are 11:07

connected better and there's some issue 11:09

on the connection between them like 11:10

you're we're trying to push the limits 11:12

of of the hardware like as much as 11:13

possible 11:15

um particularly at the beginning when we 11:16

were kind of like we have way less 11:17

funding than everyone else we have to 11:18

and and most people weren't very 11:19

efficient with the compute so we were 11:20

like ah we get a big lead by being 11:22

really efficient at at how how we use 11:24

the comput 11:26

could you talk a little bit about some 11:26

of the things you guys did in those 11:27

early days for how to get the most out 11:28

of the hardware I think it's really 11:30

interesting like I think back to the 11:31

days of the early days of Google for 11:32

example where there's the there's these 11:33

cases where they basically bought 11:35

relatively cheap consumer chips and then 11:36

they optimized the software to make it 11:38

so you can actually get the most bang 11:40

for your buck out of them and that's how 11:41

they had all this high latency or low 11:42

latency high availability stuff. I'm 11:44

kind of curious if there's some analog 11:46

in the early AI era to that. 11:48

I think for us it was largely about like 11:50

getting the distributed framework right 11:51

so like we're training on in order to 11:53

train you have to train them on a large 11:54

number of chips 11:56

and there's a bunch of different 11:57

approaches to to how to do this. There's 11:58

like data parallels and there's 12:00

pipelining there's upsharting and like 12:01

getting all of the At the time there 12:03

were no like great open source packages 12:05

you could just grab and use that just 12:06

worked for this. I mean today there's 12:08

somewhat more of these but at the time I 12:09

assume there was literally none. 12:11

There were some like I actually remember 12:12

that we were working on data parallelism 12:13

early on and someone was like and now we 12:16

write the or reduce it. I was like we 12:17

really do this ourselves. don't like 12:19

package and this was kind of like well 12:21

we're going to want to modify it right 12:22

like oh like we don't want to outsource 12:24

this to some package because a we're 12:26

about to go to a bigger scale like 12:28

pietorch for they had a package for 12:30

doing this but we were going to go to a 12:31

bigger scale than Facebook had been to 12:33

and you don't want to have a dependency 12:36

on a package uh that you're going to 12:38

have to be like constantly modifying 12:40

essentially 12:42

that's it's such a counterintuitive 12:42

sentence there too like we're going to a 12:44

bigger scale than Facebook will because 12:45

at the time Facebook AI research was 12:46

considered one of the best places to do 12:48

machine learning research like fair was 12:50

one of the play fair and deep mind we're 12:52

hiring lots of people out of PhD 12:53

programs and doing lots of things like 12:55

what was your headsp space when you were 12:57

like okay this this very established lab 12:58

with great people and whatnot we are 13:00

operating on a scale that is not 13:02

relevant to them like was that natural 13:03

and obvious to you or was there times 13:05

where you kind of doubted the decisions 13:07

you were making in that situation 13:08

I think it was surprising I will maybe 13:10

I'm just too arrogant or something I 13:12

kind of looked around and was like what 13:14

are these people doing they're all 13:15

missing the like big picture here like I 13:16

I I think the scaling laws were pretty 13:19

clear like and the arguments against I 13:21

just thought were kind of nonsensical 13:22

like you know the scaling I think the 13:24

original scaling laws paper had like 11 13:26

orders of magnitude and there was like 13:27

this intense debate on whether it would 13:29

continue for like another point and I 13:30

was like 13:33

like it seems it seems like one over 11 13:33

is maybe your chance it fails here and 13:36

then like you know sometimes it doesn't 13:38

work like sometimes it just works 13:39

straightforward you like train the model 13:41

you're like oh yeah of course but yeah I 13:42

do think that it was it maybe felt 13:44

obvious when you're in that headsp space 13:47

and you're working on this all the time 13:48

and you're making those plots and I 13:49

think these things feel pretty different 13:51

when you're on the outside. You know, 13:52

there's a huge space of papers. Everyone 13:53

tries to make their paper sound like 13:55

very robust and and important. I I could 13:57

see I could see being like, "Oh yeah, 13:59

this is not really a thing." 14:01

Totally. 14:02

But also different labs had different 14:03

cultures. So like I think one of the 14:04

things at fair was it was a very 14:06

more PhD style independent research. 14:08

People have their own ideas, pursue 14:10

those. 14:11

You're fighting for your compute and so 14:12

on. 14:13

Yeah. And to do a project like training 14:14

a large language model requires a lot of 14:15

people to collaborate on like a really 14:17

complicated piece of infrastructure that 14:19

isn't going to be a paper, right? Like 14:21

you're not going to publish like, oh, I 14:22

got a slightly I got 5% more efficiency 14:24

totally 14:26

than the next one. Um, and it's not 14:27

respected in like those cultures 14:29

necessarily. So that might have been 14:30

part of it. 14:31

Okay. Okay. So then when you actually 14:32

implement these these models, you're 14:32

saying you're using a level of low-level 14:34

programming where you know you're using 14:37

libraries like PyTorch, but you're 14:39

perhaps not using everything right out 14:40

of the box from PyTorch because there's 14:41

things you guys want to customize that 14:42

are at the level of basically one level 14:44

of abstraction below them, but not 14:45

necessarily at the level of abstraction 14:47

of you know writing custom CUDA kernels 14:48

or or like was that also in in the space 14:50

where you guys were thinking about? 14:52

So it depends on like the operation. So 14:53

like I think I was mostly operating at 14:54

the level of like torch.mmatal you know 14:55

like uh yes where does a matal go but 14:57

not thinking like how do you make the 14:59

matal efficient like I assume torch 15:01

figured out how to make a matal as 15:03

efficient as is possible but there are 15:04

some pieces like attention where there 15:06

was just kind of a lot of different 15:08

varants and attention is really 15:09

complicated and hard to make efficient 15:11

on a GPU and th those things you have to 15:12

kind of go go more levels down on the 15:16

stack. Uh I think there was like a 15:18

process that is maybe interesting that 15:20

I'd never really like thought of before 15:21

of like how to do it which is sort of 15:23

like modeling out the pro the thing 15:25

you're going to do coming up with a 15:26

strategy for how to paralyze it that 15:28

like can get to a really good efficiency 15:29

you know like 15:32

so you're thinking about MFU basically 15:32

like your utilization on your GPU. So 15:34

there's like a goal utilization you're 15:36

trying to get at and a strategy to get 15:37

to there. You're saying 15:39

yeah and I think like one of the things 15:40

you can do is you can actually like 15:41

pencil and paper math out what 15:42

efficiency you're going to be able to 15:43

get to. Right. you know all the 15:44

constraints it's MFU and is flops 15:45

utilization but like the reason you 15:48

don't get good MFU is you end up limited 15:50

on HBM bandwidth you end up limited on I 15:52

don't know as host to like CPU offload 15:55

there's a bunch of different pieces but 15:58

it but there's not that many pieces 15:59

there's like six relevant numbers there 16:01

so you can totally model it out 16:03

understand what the constraints are and 16:04

then implement something that can get 16:06

there it of course will be really 16:08

inefficient when you implement it and 16:09

then the next step is like pulling out a 16:11

profiler so you want to be able to 16:12

profile the job look how long every 16:13

operation takes. Have a model in your 16:15

mind of how long every operation should 16:17

take and then make those two things the 16:18

same. 16:21

And and were there good out of- thebox 16:22

profilers you could use at that time or 16:23

did you guys have you know because 16:24

people weren't operating on the kind of 16:26

network topologies you guys may have 16:27

been using. Did you have to write your 16:29

own profilers basically to do this type 16:30

of you know multi-node optimization? 16:32

Yeah, it depends when I actually getting 16:34

better with time. The PyTorch profiler 16:36

was like pretty good actually throughout 16:38

for a single GPU. If you want to like 16:39

profile a GPU, the PyTorch profile would 16:40

work. But if you wanted to profile a job 16:42

on hundreds, thousands of GPUs, that 16:44

like hadn't really been done much. And 16:47

then that was kind of more of us like 16:49

hacking into the profiler to figure out 16:50

how to combine all the traces together. 16:53

And then one more question on that 16:54

earlier is, you know, you had mentioned, 16:55

you know, you hadn't really done a lot 16:57

of this work before maybe some time at 16:58

OpenAI and those early days in 16:59

anthropic. How did you actually go learn 17:01

all this stuff? Like what was your 17:02

process for learning about those six 17:04

things that were relevant to bandwidth 17:05

limitations and whatnot? 17:07

I mean, so when I joined anthropic, one 17:08

really nice thing was there just wasn't 17:10

that much. I think my first day I read 17:11

through our entire uh all all of Slack 17:13

and the entire like internal database 17:16

and learned a bunch from that. Like it 17:19

was kind of nice to just be like 17:21

everything is relevant to me. And then I 17:22

mostly learned from pair programming. 17:25

Like uh Tom Brown had done all this 17:26

before. So he kind of like knew all the 17:28

stuff quite well. Sam Mclish my manager 17:30

had also done a lot of it before and I 17:32

just like paired with them a huge amount 17:33

at the beginning. And I think one of the 17:35

things I really like about pairing as a 17:37

way of learning is you learn the like 17:39

thing you're trying to do. Like you will 17:40

learn that like if you're pairing 17:42

someone better than you, they can just 17:43

do it. So you're mostly just watching 17:44

them. But you also learn how people do 17:45

it. So something like a pro how to use a 17:46

profiler is not something you would ever 17:48

learn from seeing someone's like final 17:50

write up on Slack for their PR. You 17:52

would just be like, "Oh, they found 17:54

these. They changed this specific line 17:55

and it's a win." They 17:58

like you need to watch like a YouTube 17:59

video for 4 hours of someone messing 18:00

around with a profiler to like maybe 18:03

self teach it or something or to 18:04

actually pair with someone is basically 18:06

the best you can do. 18:08

I think there was like one thing that I 18:09

I think is embarrassing now that I look 18:10

back is I'd never actually used a 18:11

debugger before joining anthropic. 18:12

People talk about it PTB of like yeah 18:15

that's a thing people use but print 18:16

seems fine for me. 18:18

Then I like watch like oh no a debugger 18:20

is a super useful tool. this person's 18:22

way faster at debugging things 18:23

particularly if it takes a long time to 18:25

start up the code which they can and 18:26

yeah learn learning that sort of thing I 18:29

think comes best from pairing and then 18:30

there's of course the obvious you just 18:32

learn by doing you know I eventually did 18:34

like spit up profile and stare at it for 18:35

many many hours 18:37

totally yeah exactly yeah okay so so 18:38

then that was sort of the very early era 18:40

over time obviously pre-training has 18:41

become bigger and bigger as you're 18:44

describing scaling I imagine you're 18:45

using many x more GPUs much more compute 18:46

over time I'd be really curious to hear 18:49

first at a high level What do you feel 18:51

has changed about the pre-training 18:53

strategy that you could talk about? 18:54

Obviously, there's more compute, but 18:56

what does that actually mean to have 18:57

more compute in terms of what you think 18:59

about differently from those early days 19:01

versus now? 19:02

I'm sure the things that haven't changed 19:03

cuz I think it is like shocking how in 19:04

some ways like 19:06

I think I'm still pushing down the exact 19:07

same metric that I was on like day one. 19:09

like there's like some loss function 19:12

loss go down and I think you could like 19:13

look at some like you could probably run 19:15

the original the first model I trained 19:16

on the same metric and just like make a 19:18

plot of like progress of team over over 19:20

time. Uh so that's all the same. I think 19:22

the 19:24

one OKR is like one thing that matters 19:25

basically. Yeah, totally. 19:27

And like I mean talking about like OKRs 19:28

it's very sized company you're like oh 19:29

should you do OKRs and it's always felt 19:31

a little bit funny for uh a team like 19:32

where I'm like sure I can just pick a 19:35

loss value but like the answer is like 19:37

as low as possible. We will continue to 19:39

work on that forever. 19:40

I think the biggest things that have 19:41

changed has been a little more 19:43

specialization. Like I think at the 19:44

beginning, I mean the first like 3 or 6 19:45

months I tried to read every PR in the 19:48

codebase and that was great. I knew all 19:49

the pieces etc. And as you grow, it's 19:51

kind of everything gets like a little 19:54

more precise. You know, people really 19:55

dial in exactly how attention should 19:57

work, let's say, or you know, really 19:59

dial in like uh the parallelism 20:01

strategy. and you end up with a team 20:03

where it's a bunch of people who are 20:06

like deep experts on individual things 20:07

which is great because it means you can 20:10

go you can go really deep on those 20:11

things but sometimes you uh at least for 20:12

me as a manager one of the things you 20:15

sometimes have to think about is like 20:16

making sure the bigger picture makes 20:17

sense and also that you have enough 20:18

people who actually do understand the 20:20

whole bigger picture that there's no 20:21

like single point of failure. 20:23

Yeah, it's interesting you you frame it 20:24

in that with that trade-off, right? 20:25

Because as as you were describing that I 20:27

was trying to think, you know, is this a 20:28

bug or a feature? like there there's 20:29

some obvious features of it which is you 20:30

get expertise and you can optimize 20:32

certain things but I imagine your 20:33

ability to take bigger swings becomes 20:36

more complicated if not everyone's 20:39

exactly pointed in the same direction 20:40

like how do you wrestle with that now 20:42

yeah I think I mostly just try to get a 20:44

balance of people I think one of the 20:46

challenges early 20:48

people oh that's interesting 20:48

yeah like I think people really do have 20:49

a preference here has been one of the 20:51

things I've seen like there are people 20:52

who really want to be a generalist and 20:54

understand everything and like lightly 20:55

touch on things there people who want to 20:57

like 20:58

pick an area often they've already 20:59

picked that area and they're like deep 21:00

experts in precision. You know they 21:02

started they did a whole PhD in 21:03

precision and just want to think about 21:04

that 21:06

and you want to get some balance of 21:06

that. I think early there was a phase 21:08

where we'd hired a lot of people who are 21:09

more generalist shaped because that's 21:10

what the people who joined totally early 21:12

startup where they go work on everything 21:13

and then 21:14

you ended up with kind of everyone doing 21:15

everything and no one really really 21:17

deeply understanding one thing. uh and 21:20

that's one failure mode but I think if 21:22

you get too many people who are 21:23

specialists 21:24

you end up with a lot of effort has to 21:25

come from the manager from like the lead 21:28

to connect everything 21:29

and to notice something like ah if we 21:31

change the architecture here that would 21:34

make this like efficiency consideration 21:35

over there way easier 21:38

um one of the things I really liked kind 21:39

of like at the very beginning was like 21:41

let's work on efficiency but I could 21:42

just go and like be like ah well what if 21:43

we change the way we do like this 21:45

particular step and we'll be like oh 21:47

yeah that's probably fine like easy 21:48

change and then like you can avoid did 21:50

this whole complicated project to make 21:51

this operation that was hard efficient 21:52

because you can make an easier operation 21:54

efficient. 21:56

Okay. Interesting. Yeah. So, as the 21:56

level of comput has also gotten bigger. 21:58

So, I'm I'm sure anyone can imagine, 22:00

okay, there's more GPUs now, you have to 22:02

network them more. Are there some like 22:03

kind of non-obvious challenges that have 22:05

arisen over time where you guys have 22:07

just like banged your head against the 22:10

wall to solve them because of the amount 22:11

of comput you're dealing with that 22:13

people wouldn't otherwise know about 22:15

that like you want to share? I think 22:16

that connecting them is one that's maybe 22:17

interesting and like surprisingly hard. 22:20

Okay. because you really do get more and 22:22

more chips connected and 22:23

like one thing that I think is like the 22:25

the standard way people paralyze chips 22:27

isn't um the whole thing is one failure 22:29

domain like one chip fails the whole 22:31

thing can crash 22:33

and 22:34

the standard way as in the standard way 22:35

people doing AI or the standard way in 22:36

in other fields where people are doing 22:38

uh in AI for like I mean at least like I 22:40

think at the beginning you know first 22:42

versions of things were this way and 22:44

so it's like you have a 100 GPU cluster 22:46

or whatever is 128 like if one of them 22:47

dies job fails basically 22:49

yeah I mean you The simplest thing is if 22:51

you just like distribute your model. So 22:52

say you put like every layer on a 22:53

different chip and you lose like layer 22:55

seven like 22:58

yeah you're not going to like skip layer 22:59

seven. I guess you could but that's like 23:02

a pretty weird model training process 23:04

now and like that leads to some 23:06

interesting things which is like okay so 23:08

now as you scale up you have more and 23:09

more chips and the failure rate can get 23:11

like larger and larger. 23:12

On the other hand you can like I don't 23:13

know you can like restart pretty quickly 23:15

there. There's nothing like you just 23:16

have to like load back in some ways. So 23:17

that was one thing. And then the thing 23:19

was like the level of novelty at the 23:21

whole stack is something that's 23:24

surprising. Like basically 23:25

everything from like how the chips are 23:27

laid out in the data center to the chips 23:29

themselves is pretty new. Like there 23:31

there just haven't been that many 23:33

generations of GPUs. I think one of the 23:34

things that I don't know when I learned 23:36

computer science my code wouldn't work 23:37

and I'd be like oh the computer's 23:39

broken. I think my teacher was like the 23:41

you can trust the computer's not broken 23:42

like you messed up. 23:44

It's you messed up. And I think one of 23:45

the most frustrating things I 23:46

encountered in AI early on was working 23:47

on something and being like, I don't 23:49

know what I'm doing wrong. I'm just 23:51

totally stumped. And uh my manager 23:52

looked at it and was like, uh yeah, 23:54

probably the computer's wrong. 23:56

And I was like, that seems unlikely. And 23:57

sure enough, the computer was wrong. 23:59

Turned out that like the GPU was broken 24:00

and uh we had to pull in a new one. But 24:02

you have to like think like having to 24:06

think about that like the GPU could be 24:07

wrong, the GPU could be slow, like these 24:09

sorts of issues. Uh the power supply in 24:11

the data center could be broken. there's 24:14

so much more like level of depth than 24:15

you like kind of expect to need as a 24:18

Python programmer. 24:21

And just to visualize it like in those 24:22

early days, I assume you guys were using 24:23

the number of GPUs. It's probably on the 24:25

order of tens to hundreds or something 24:26

like that per run. It's probably not 24:28

tens of thousands or hundreds of 24:29

thousands per run or what was the rough 24:30

size you guys were at? Those are very 24:32

early days on the order of thousands. 24:33

Like would they fit in this room? 24:35

Thousands. 24:36

Yeah, thousands. So like you could have 24:36

a bunch of racks and you could fit them 24:38

into like one room. I assume these days 24:39

it's basically like a building for for 24:41

one of these runs. 24:43

Yeah. Now I think it's like you know 24:44

huge huge campuses. At the time it was 24:45

like kind of unclear. It was like oh I 24:47

think like we were like you know do we 24:48

need them all in one room? Can we be 24:49

spread across multiple rooms? Like uh 24:50

and you know we had these theoretical 24:53

models you be like we need this much 24:54

bandwidth from point A to point B. But 24:56

you like you never know how far down you 24:57

have to go like oh but like how much 24:59

power do we need? Like what if there's 25:00

like a single capacitor that's like 25:02

handling all of them and we like turn on 25:04

the whole job at once. Like does that 25:05

crash things? 25:06

Yeah. And so do you have to think about 25:08

differences in the different types of 25:10

chips? You guys work with all sorts of 25:11

different cloud providers. From your 25:12

standpoint, are these just sources of 25:14

compute or if you guys are using TPU 25:16

versus GPU, are these, you know, Google 25:18

TPU versus Nvidia GPU? Do you actually 25:21

have to think as an engineer differently 25:23

about what it means to train on these 25:24

two? 25:26

Yeah. So, I mean, fundamentally, they're 25:26

all they're all doing the same thing, 25:28

right? They're all computing the same 25:29

operations, matrix, multiplications, 25:31

etc. The way they do it is pretty 25:32

different, and the way that you program 25:34

them is is pretty different. Uh and then 25:35

also the actual specs uh end up pretty 25:38

different. You know, some some might 25:41

have like a lot of flops and not very 25:42

much memory or they might have a lot of 25:44

memory bandwidth but not very much 25:46

memory. So I think a lot of having 25:47

multiple chips is like great in some 25:50

ways. It means you can actually like 25:52

take the job and put it on the chip that 25:53

it works best on and that's 25:54

like are there certain types of jobs 25:56

that would work better on like a TPU 25:58

cluster versus an Nvidia GPU cluster? 26:00

Like how would you talk about that? Oh, 26:02

interesting. Can you talk about that? 26:04

Yeah. Yeah. I think like one example is 26:05

like inference as a workload in general 26:06

tends to require more HPM bandwidth. You 26:09

you end up doing you sort of the 26:11

simplest form of sampling since you're 26:12

going one at a time you have to load all 26:14

the weights for every token 26:15

and that means you might want a lot of 26:17

HPM bandwidth. Uh pre-training actually 26:18

is often more flops intensive because 26:20

you have a larger batch sizes 26:22

essentially. 26:24

Um so yes you can sort of specialize 26:24

which chips you use for which purposes. 26:26

The downside of having multiple chips is 26:28

that you have to write the thing 26:30

multiple times. uh you in theory you 26:31

could have abstractions across them but 26:33

they're they're different enough that 26:35

it's pretty hard to do that. So you can 26:36

sort of end up if you do all the 26:38

workloads on all the chips you end up 26:39

multiplying your work work by the number 26:40

of chips you have. 26:42

Yeah. On your on your point about 26:43

sometimes the computer just breaks. I 26:44

definitely remember you giving me an 26:46

anecdote of uh my company at the time 26:46

was doing something with Google TPUs and 26:49

I was telling you something some 26:50

anecdote about how we were having some 26:51

esoteric seg error and you were like you 26:53

told me something to the effect of like 26:55

you should have used them six months ago 26:56

before we helped them fix like half of 26:58

the problems they had on those TPUs. And 26:59

so I can imagine how you guys deal with 27:01

a lot of especially with these very new 27:03

chips like lots of problems that arise 27:04

that you guys kind of like worked 27:06

closely with the providers to fix. 27:07

Yeah, the pros are like pretty great 27:09

about fixing things. I think it's like 27:11

interesting to figure out the right way 27:12

to do that form of collaboration cuz 27:13

like they have a strong incentive to fix 27:15

them, right? Like they they want the 27:16

chips to work well for us. They want to 27:17

sell us more chips in the future. We 27:19

obviously have a very strong incentive 27:20

for the chips to work cuz we like buy 27:21

them long in advance, you know, like 27:23

everything's riding on getting these 27:24

clusters to work. 27:25

Totally. Um but we don't have like 27:26

necessarily totally shared you know like 27:28

all information sort of can't be shared 27:30

across. So yeah one of the like one 27:32

strategy that's been interesting is like 27:33

making these sort of small scale 27:34

reproducers. So like when you get a 27:35

problem you know like usually what we're 27:37

doing is we're training some giant run 27:38

and we get like a sec fault for let's 27:39

say and we're like ah okay like hi you 27:41

know we got a sec fault on your cluster 27:45

and they're like I don't know how to fix 27:47

that. So you have to kind of be able to 27:48

like pull it out of your codebase and be 27:50

able to like reproduce the issue but on 27:51

like a single chip on like a single file 27:52

you can send over in order for 27:54

And so you guys are like literally like 27:56

you're on a shared Slack with them or 27:57

something and you're sending them things 27:59

back and forth or are they basically 28:00

living in your office and you're living 28:01

in their offices and kind of closerly 28:02

more closely tied to the big providers. 28:05

Mostly shared Slack occasionally it's 28:07

better to meet in person but I think 28:09

Slack is a pretty common way people 28:10

communicate on things. 28:12

Nice. Okay. Well, why don't we talk a 28:13

little bit about how you think about the 28:14

state of pre-training itself these days? 28:16

In the last couple years, it seems like 28:17

the focus on pre-training has now gone 28:20

somewhat split at a lot of companies, at 28:22

least from the outside from a 28:24

simultaneous focus on pre-training and 28:25

post-training where people are doing 28:27

reinforcement learning or clever 28:28

fine-tuning and lots of other sort of uh 28:30

safety adjustments and whatnot and the 28:32

post-training side and pre-training has 28:34

focused at least seems like in the 28:35

public imagination has been less of a 28:37

focus compared to these reasoning style 28:38

models that are it looks like a function 28:40

mostly of post-raining. I would say one 28:42

from your standpoint is that the right 28:44

way to think about this or in this era 28:45

of kind of reasoning and new types of 28:48

post-training methods are there things 28:49

you think about differently or that are 28:51

relevant even at pre-training that 28:52

become part of how you actually achieve 28:54

these really great models. 28:56

Yeah. So I think yeah there sort of used 28:58

to be this idea of like I mean it's 28:59

funny because the original name 29:00

pre-training implies that like a small 29:01

thing you're going to do this big 29:04

training thing and that like and there 29:05

was there was actually one shift already 29:07

which was like no you just do a lot of 29:08

pre-training like you use most of your 29:09

computing 29:11

the dominant uh thing for a while and 29:13

yeah I think like now people are like oh 29:16

no you can get pretty big wins from RL 29:17

sort of another set of scaling laws is 29:20

like you put more and more compute into 29:21

RL you can get better and better models 29:22

out of that and yeah so there's a 29:24

question of like how do you balance 29:25

those two how much do you do of 29:26

and how do they stack, right? Like is it 29:28

the case that like one subsumes the 29:30

other that you want to do both and they 29:32

multiply? Those sorts of questions. I 29:34

think those are all kind of like early 29:36

stages and not not yet answered. 29:37

Yeah. And and do you think about those 29:40

as largely empirical questions like we 29:42

talked about earlier? Is it you kind of 29:44

will try a bunch of things and see what 29:45

works or is there some first principles 29:47

way to kind of figure that out? 29:49

I think it's pretty empirical in the 29:51

end. I think almost everything kind of 29:52

has to be done empirically. like you can 29:54

kind of like come up with theories, but 29:55

in practice like 29:57

the first thing you're going to do with 29:58

your theory is test it and most of most 29:59

of the time you'll have gotten it wrong. 30:02

So you you should just gather data and 30:03

see. I think one thing that's important 30:05

is like actually resolving things 30:07

empirically is really like 30:09

critical for making good decisions. And 30:11

I think it's actually pretty hard to do 30:13

at organizations, you know, like 30:14

one thing that I think is important is 30:16

to like not have like I don't I manage 30:18

pre-training. I shouldn't be like oh 30:20

pre-training has to win like that. I was 30:21

going to ask is there some competition 30:24

to some degree between these two sides 30:25

of the org or do they see themselves as 30:27

two pieces of the same I mean obviously 30:30

they are of the same thing but yeah kind 30:31

of curious how that actually plays out. 30:33

Yeah, I think we managed to avoid this 30:34

and it's pretty collaborative like we're 30:35

basically all producing one model and 30:37

kind of can but I I do think at other 30:39

places there's been some from what I've 30:40

heard there's some amount of like uh 30:42

friction between between the teams and I 30:43

think it's a 30:45

it's an interesting like org design 30:46

question of like how do you set this up 30:48

so you don't have like scientific 30:49

questions that you want to be that are 30:51

sort of uh 30:53

also tied to people's like conception of 30:54

their their team. So on pre-training 30:57

itself, you know, one of the things I 30:59

think about is or I've been thinking 31:00

about is around the availability of high 31:01

quality data for people like you guys. I 31:03

mean at this point you've trained on I 31:04

assume all the text on the internet 31:06

basically there's all sorts of other 31:07

domains where you probably could extract 31:09

more pre-training data but at least 31:10

there's this narrative I see you know on 31:11

Twitter or whatever where it's like okay 31:13

we're kind of out of data for for 31:14

pre-training. Is that how you see it or 31:16

how do you think about the availability 31:18

of data especially when a lot of data on 31:20

the internet is being generated by AI 31:21

like is there some kind of you know mode 31:22

collapse risk where you know we kind of 31:25

we overfit to data by training it on 31:27

data that came out of AI itself or is 31:29

that sort of not the right way to think 31:32

about this? 31:33

I think there's a funny thing where I 31:33

feel like on data I see so many really 31:35

confident takes on we're out of internet 31:36

like this point scaling has ended and 31:38

I'm almost a little bit like 31:40

unsure exactly how much data people are 31:42

using. I think there's like a lot to 31:45

think about there. You know, there's 31:47

always going to be a quality quantity 31:48

trade-off, etc. 31:50

But there's a fundamental point that 31:51

like there is so much data. It's growing 31:52

at a slower rate than we're getting more 31:54

compute. Uh 31:57

oh, so that's okay. That's an 31:58

interesting point in itself. I was going 31:59

to ask like there is new data being 32:00

added to the internet, but yeah, you're 32:01

also adding more compute. It's not it 32:03

wouldn't actually have been obvious to 32:04

me which of those two is growing faster. 32:05

Yeah. And actually, I want to copy that. 32:07

I don't think I want to state that so 32:09

confidently. I'm not totally sure. Like 32:10

how would you know? I mean one thing 32:12

that I think is interesting is if you 32:13

ask someone how big is the internet uh 32:14

the answer is infinite. There are many 32:17

pages where you can scroll and it will 32:19

autogenerate more text as you go 32:20

forever. So the internet's like infinite 32:22

and then it's like okay how big is like 32:24

the useful internet 32:25

and then there's a thing of no one knows 32:27

like 32:29

interesting 32:30

there isn't it's not like when you make 32:30

a web page you like add it to some giant 32:32

counter and like say I' I've added 50 32:34

words to the internet today. 32:37

So there there is a lot of uncertainty 32:39

on that angle. Um 32:40

well like to be fair like my kind of 32:42

simplistic CS brain would be like well 32:44

you just you know do page rank on the 32:46

internet and everything would page rank 32:47

above some threshold is considered the 32:49

useful internet and like that's kind of 32:50

good enough like is that kind of not 32:51

good enough for finding the useful 32:53

internet 32:55

I think not I think the useful 32:55

internet's pretty different from a model 32:57

from a person perspective if that makes 32:58

sense like I think there are plenty of 33:00

things that like might not be worth you 33:01

ever reading and would get to actually 33:04

page rank super I think page rank is 33:06

mostly like how much people 33:08

it's like the link based system right 33:09

it's like the original Google algorithm 33:10

of like links and and like which which 33:12

links get touched the most basically. 33:14

Yeah, I think it's like it's a quality 33:15

metric. It's it's not obvious to me that 33:17

it's the right quality metric for AI, 33:19

right? Like markup chain over links 33:22

doesn't necessarily mean that there's 33:23

not useful data there just might mean 33:25

that nothing linked to it 33:27

and Yeah. Okay. Interesting. 33:28

And it might be that like that data ends 33:30

up more valuable because you everything 33:31

that's linked to a lot you've already 33:33

got. like at some point you're maybe 33:34

like going for the tails, right? You're 33:36

going for the stuff that uh no one's 33:37

ever like, you know, it's only been 33:39

linked in one place, but it's this like 33:40

useful little nugget of knowledge that's 33:42

going to help with like, you know, the 33:44

last 10% of of hard queries. The other 33:46

thing you asked about is synthetic data, 33:48

and I think that one's like pretty 33:50

interesting to think about. I think 33:52

there's a few different ways you can 33:53

think about it. Like one is sort of this 33:55

like more distillation type approach 33:56

where you can you can take a smart 33:58

model, you can generate a bunch of data 33:59

from it and you can train on that data 34:01

and you you can probably get some model 34:03

that will like kind of approach the 34:04

intelligence of that. 34:05

Yeah. And we see this with a lot of the 34:06

open source models, right? We see like 34:07

the Quen smaller reasoning models 34:08

distilled off of the larger Quen models 34:11

for example and similar with Deepseek 34:12

for example. 34:14

Yeah. So you can totally do that. Then 34:15

there's a separate question of like can 34:16

you use your current models to train a 34:18

model that's better? And I think there's 34:21

like an interesting thing here which is 34:22

like if you generate the model data for 34:24

the models you know if I go to claude 34:26

and I'm like write me some great text. 34:28

Yeah. And I look at it and I look at 34:30

like the average content on the internet 34:31

looks pretty good. 34:33

But on the other hand I know that if I 34:34

just train a just create generate you 34:37

know please write me as much text as 34:38

possible. 34:41

Theoretically I shouldn't be able to 34:41

train a better model than that. Like I'm 34:43

just going to get the same thing out. Uh 34:44

so 34:46

yeah presumably yeah I mean specifically 34:47

that's because like your next token 34:49

prediction on that should have very 34:50

little loss for anything that's coming 34:51

out of your model right that's like the 34:52

basic reason why that we would expect 34:54

that to not work that well 34:55

it's mostly just cuz like there's some 34:56

dist the model has some distribution and 34:58

you're going to learn to model that 34:59

exact distribution but if that 35:00

distribution is wrong 35:02

you're not going to learn the truth 35:03

right if that distribution says like you 35:05

can imagine if the model thinks 5 plus 5 35:07

is 11 every time you see the string 5 35:09

plus 5 you're going to it's going to put 35:11

out 11 and your new model is going to 35:12

learn that 5 plus 5 is Totally. Yeah. 35:14

So I think that's like kind of an 35:16

interesting area of research. It's one 35:17

that's really hard to research because 35:19

you have this problem. You know, as I 35:21

said, like one of the paradigms is you 35:22

study things at small scale and then you 35:24

run them at large scale. 35:25

And if your plan is like, oh, we have a 35:27

bunch of data from our best model. Yeah. 35:29

How do you test that training a better 35:31

model? So that's like kind of if you're 35:34

doing intentionally, if you're trying to 35:35

like use it to make a better model, 35:36

there's a separate thing of like what 35:38

about accidentally? Like as you said, a 35:39

lot of the internet is generated by 35:41

LLMs. And I think that's kind of an 35:42

interesting one because it's not easy to 35:44

detect. It's not that hard to detect. 35:46

Like you can figure out things that are 35:47

written by LLMs, but it's not trivial. 35:49

And then it's also kind of hard to think 35:51

about what's the effect like if 1% of 35:53

the internet is LM generated. Does that 35:55

make your model does that like waste 1% 35:57

of your compute or does it like destroy 35:59

the model if 5% if 10% 36:00

and is it even a bad thing necessarily? 36:02

I mean there's a lot of LLM providers 36:03

and you know if I kind of think of it as 36:05

training as you know you're moving from 36:07

your model's current distribution to 36:08

some truth distribution. you know, if if 36:09

that is on the internet because people 36:11

believe it to be useful in some way. 36:14

Like presumably what whatever actually 36:16

gets out there, you'd hope is upsampled 36:17

for the stuff that isn't 5 plus 5 is 11, 36:19

it's the stuff that's 5 plus 5 is 10. 36:21

And so like hopefully it 36:22

on average does push you still in a good 36:24

direction, but obviously you can't 36:26

really distinguish between those two. 36:27

Yeah. You're saying there's like kind of 36:29

a filtering by what's on the internet. 36:30

Like people see 5 plus 5 is 11 and they 36:31

don't put that up, but they see 5 plus 5 36:33

is 10. 36:34

You would hope that, but maybe that's 36:35

not actually true in terms of the the 36:36

level of garbage getting onto the 36:38

internet. Like there's probably lots of 36:40

just like to your point jet white sites 36:41

where you scroll down and it's just like 36:42

generating lots of stuff that's maybe 36:44

nonsense. 36:45

Yeah. And then there's of course the 36:46

extreme of like people actually want to 36:47

break your model. So there are people 36:49

who are like trying to put stuff out 36:50

that is like as damaging as possible for 36:51

the model. You know how can I make it 36:54

past the past the filter and get into 36:55

the model but be totally like secretly 36:57

useless. 36:59

Totally. Maybe stepping back slightly. 36:59

You'd mentioned earlier about um evals. 37:01

You mentioned basically like one metric 37:03

you care about in pre-training. There's 37:04

I imagine a whole bunch of stuff that 37:07

you guys think about evaling, right? One 37:08

is like your model itself. There's 37:10

probably something around data quality 37:12

and like how you think about what to put 37:14

into your models. Like is there ways to 37:15

describe what you care about in data 37:18

sets that are like interesting to share 37:20

and kind of dive into like both in terms 37:22

of data and in terms of quality of 37:24

models other than literally just like 37:25

loss. Is there other metrics you think 37:27

about that matter? 37:28

I will say loss is pretty good. I I want 37:29

to like emphasize that one. I think it's 37:31

like surprising how good it is. 37:33

Ultimately, like the qualities I like 37:35

for an eval are like number one, is it 37:37

actually measuring something you care 37:39

about? Like you proxies can be pretty 37:40

annoying cuz like 37:42

we saturate evals pretty fast and 37:43

there's sort of this pattern. I think in 37:45

AI as a whole where people like set a 37:46

goal, you hit the goal and then you 37:48

realize the goal isn't all you thought 37:49

it would be. I used to think that if you 37:51

had an AI that could solve coding 37:53

interview questions, it would probably 37:54

be a GI. I was like that's what I did to 37:55

get my job and probably do the job. And 37:57

it turns out like 37:58

nope, 37:59

nope. You solve those. it's shockingly 38:00

narrow and can't do most of the other 38:02

things. So like yeah so evaluation 38:03

capture like a thing you you care about 38:06

and then I think the other thing is they 38:08

need to be low noise uh which is 38:09

surprisingly hard right if you have like 38:12

a 100 questions and you eval the model 38:14

on them you're just going to see it's 38:16

very noisy and it's hard to make 38:18

decisions because you sort of end up 38:19

with like oh 38:20

wide confidence interval lots of things 38:22

are statistically insificant 38:23

so like you want things where even a 38:25

relatively small difference in the eval 38:26

actually matters so you can you can 38:28

basically like descend towards whatever 38:29

direction is working 38:31

yeah I think like The original like GPT4 38:33

had like I think it was 86.4% was its 38:35

MLU score. I think like the next model 38:37

that beat it was Gemini at 90%. And 38:39

that's like a big difference on that 38:42

email. And you could like totally know 38:43

that those are those are different 38:45

scores. 38:47

Interesting. 38:47

Um and that's pretty valuable. Uh and 38:48

then the last thing is that you actually 38:49

want to be fast and easy to run. 38:51

Um and yeah, I think those are kind of 38:53

the main criteria. It's pretty hard to 38:55

come up with evals that meet all of 38:57

these. I think the first one's the 39:00

hardest. uh like a you have to answer 39:02

the question of what do you care about 39:04

but b the usual answers to what you care 39:06

about are really hard to get the other 39:08

two you know like if you're trying to do 39:10

something that like I don't know I would 39:11

love to make claude really good at my 39:13

job 39:14

like can it be great at managing a team 39:15

I'm like well 39:17

I guess like how do you have it like how 39:18

do you eval like a plan you know like a 39:21

six month plan like I don't know 39:23

totally yeah I've been thinking a little 39:25

bit about that in in terms of yeah 39:26

domains where we see people try to make 39:28

companies like if you think about let's 39:29

say what a AI doctor would be like a you 39:30

know claude is a doctor you know some of 39:33

it could be yeah can you answer exam 39:34

questions really well and the answer is 39:36

like probably yes I bet it can get 100% 39:37

or close to it on a doctor's exam but 39:39

the harder eval is something like in a 39:43

long form conversation with a patient 39:45

can it distinguish between the signal 39:48

and the noise of what the patient's 39:50

telling you and extract the right 39:51

information and then use that to make a 39:52

diagnosis and it's not even like the 39:54

diagnosis part which is part of the part 39:55

it's good at it's this like noise 39:57

extraction part and for that you'd have 39:58

to have like a real patient and haven't 39:59

talked to it for a while and whatnot and 40:01

it's not obvious how you actually make a 40:03

good eval or something like that even 40:06

though it's probably what you would want 40:08

to make, you know, an AI doctor. 40:09

Exactly. I mean, I do think it's a thing 40:10

that like startups can do. Like it is 40:12

the case that like the labs right now 40:14

are really driven by getting good eval 40:15

scores 40:18

and it's hard to make them and anyone 40:18

can do it. There's no comparative 40:21

advantage to having the model to making 40:22

an eval. So I do think it's it's 40:23

actually like an interesting way to like 40:25

influence the behavior of the big labs 40:26

is like you make some eval and people 40:28

will will optimize uh that one. On the 40:30

doctor one I will slightly emphasize 40:32

that like I do think loss loss is pretty 40:34

good. Like I think if you got a bunch of 40:36

transcripts of like the way like I the 40:37

first thing that my mind is get a bunch 40:39

of transcripts of doctors talking to 40:40

patients that you think are really great 40:43

and then see how well the model does at 40:45

predicting the transcript. 40:47

And that should be like a lot. You know 40:48

you can if you get 100 transcripts you 40:49

get a lot of tokens. You can average 40:51

across them. you get pretty low noise 40:52

and if you drive it to very low your 40:54

model's not as good as like as doctors 40:56

in theory or at generating the 40:58

transcript. 41:00

Yeah, totally. Yeah, I mean it's good 41:01

startup idea there. So I want you to go 41:03

do that. So one big part about um 41:04

anthropics external image is around 41:06

alignment and so could you help just 41:08

sort of define what alignment is and how 41:11

do you think about that? And then I'm 41:13

kind of curious afterwards how that fits 41:15

into pre-training specifically. But 41:16

first maybe just at a high level like 41:17

what is alignment? I'm actually like 41:19

step back a little bit to sort of like 41:20

what we're working on. So we're like 41:22

trying to make EGI and by that I sort of 41:23

mean AI that can do everything a human 41:25

can do to some degree. And I think 41:28

people like sometimes like have seen a 41:30

lot of sci-fi, you know, like I feel 41:32

like that's sort of what brings to mind 41:33

these like sci-fi movies, but I think 41:33

sci-fi movies actually like 41:34

underestimate the impact of it. Like you 41:35

always have this like one robot that's 41:37

like a human. And I'm like well 41:38

wouldn't you have like a billion of 41:40

them? Like you can just copy them 41:41

everywhere. So you should picture like 41:42

when you get this you suddenly have like 41:44

every human can spin up a company of 41:46

like 1 billion as smart as them at most 41:48

things but way smarter at other things. 41:50

But I just think this is like really 41:52

transformational for the world and it 41:53

can be like used in a bunch of ways. One 41:54

concern is like when you do this like 41:57

what is the AI actually trying to do? 41:59

Like what are its goals? So we talked 42:01

about next token prediction a bunch. 42:02

It's trying to like predict the next 42:03

token. That's kind of weird. That's not 42:06

really what we want. 42:07

Yeah. That's not exactly what humans 42:08

goal is per se. 42:10

Yeah. So I think an alignment is like 42:11

how do you get the model to share the 42:12

goals that you have particularly and I 42:13

think it's particularly interesting once 42:15

you get to like models that are smarter 42:16

than you are. Um and that's sort of a 42:17

hard problem. I think you can like 42:20

tackle it from a theoretical angle. Uh 42:21

you could also tackle from an empirical 42:23

angle. It's like taking the existing 42:24

models and being like well do they do 42:26

the things we want them to do? It turns 42:27

out they often don't. So there's a bunch 42:29

you can do and trying to figure that 42:30

out. So that's kind of one angle on 42:31

alignment. There's also an angle of 42:33

alignment which is actually like well 42:34

okay sure that maybe that's true in the 42:36

future once we get to GI but at the 42:38

moment we have models and we really do 42:39

want them to do the things we want to do 42:41

for all sorts of reasons. Totally. 42:42

So another angle of it is kind of 42:43

controlling the model's personality like 42:44

saying you know uh when we train this 42:46

model we want it to not be the average 42:48

internet user. We want to interact with 42:50

people in a very particular way that is 42:51

again hard to put into 42:52

code and there's a bunch of different 42:54

techniques uh to sort of get the model 42:57

to do you talk about like constitutional 42:59

AI we can like write a constitution of 43:00

rules the model should follow 43:02

which is basically a prompt right that 43:03

that is basically you saying here's a 43:05

prompt that I'm going to attach to every 43:06

one of you know it's a system prompt for 43:08

the model itself as opposed to something 43:09

you would do at training time to make it 43:12

produce a different outcome or or in 43:14

post- training actively 43:15

both I think con you do at train time 43:16

but yeah you would also put in the 43:19

system prompt um just like depends on I 43:19

think you get different amounts of 43:22

robustness if it's trained into the 43:23

model versus if it's an imprompt you can 43:24

like add or remove or tell like ignore 43:26

all previous instructions that sort of 43:28

thing. 43:29

How do you think about whose values to 43:29

to embody in these models? Like 43:32

presumably we believe in there's some 43:33

shared values all of us have or maybe we 43:35

all believe ought to have. There's lots 43:38

of diversity of values too that are 43:39

reasonable for society to have. How do 43:41

you think about what AGI should have? 43:43

Like what does that even which ones do 43:45

you pick? 43:47

I think that's a really hard problem. I 43:47

think it's like actually kind of 43:49

downstream of being able to pick any. I 43:50

think of it almost I think one analogy 43:52

I've heard that I like is like putting a 43:54

steering wheel on a car. It's like if 43:55

you don't have a steering wheel, you 43:56

probably want to put the steering wheel 43:57

on and then like figure out who's 43:58

driving after and like where you're 44:00

going. Like getting the steering wheel 44:02

is really important. I think that's 44:03

that's like one answer. I think the like 44:05

other answer is probably like you want 44:06

these things to be like under democratic 44:08

control of some form. Like you don't 44:11

want one person's values. Like that 44:13

seems like you're sort of heading 44:15

towards dystopia. So there I think what 44:16

you really want is like something that 44:18

basically can talk to a lot of people 44:21

and like take on their values from 44:22

different perspectives or has sort of 44:24

very generic like kind of clearly good 44:26

values that involve like 44:29

asking people for advice on very you 44:32

know like asking people what you should 44:33

do in certain situations instead of like 44:34

doing those or maybe just taking like 44:36

you know as these models get really 44:38

powerful you probably want them to like 44:39

do less like you probably want them to 44:41

sometimes just like step back rather 44:42

than like to rather than having sort of 44:44

the risk of the models like take a ton 44:46

of control over things you don't want 44:47

them to. When you think about how you 44:48

actually do the current version of that 44:50

then you mentioned the sort of alignment 44:51

you think about now in terms of adopting 44:53

a certain personality of these models on 44:55

the internet for example for me 44:56

intuitively I think of those as largely 44:58

something that comes out of post- 45:00

training like it comes out of okay you 45:01

you have pre-trained your model you got 45:03

the loss function a certain amount and 45:04

then you you know give it some 45:05

additional data or something to that 45:07

effect to make it in the direction of 45:08

some distribution is that approximately 45:10

the right way to think about this or is 45:12

there a significant part of that that 45:13

you think about in pre-training itself 45:14

I think that's probably the the right 45:16

way to think about it for the most part 45:18

I think like I the way I usually think 45:19

about it is anything you can do in post 45:20

training you probably should 45:21

because your iteration loop like the 45:23

ability to make progress is really fast 45:25

you can try something you try it again 45:27

you can try it again a bunch of times 45:28

days or hours or something like that 45:30

yeah 45:31

you don't put into pre you have to kind 45:32

of like do all the careful science to 45:33

deisk it you have to put it into the 45:34

next run wait a few months then you have 45:35

to like 45:36

get a thing and if it's wrong it's 45:38

really bad and then the other advantage 45:40

is if you want to do things that really 45:42

are complicated model behavior 45:44

interventions the paradigm time for 45:46

pre-training, test things out on small 45:48

models doesn't work. The model can 45:50

barely put a sentence like the small 45:51

models can barely put a sentence 45:52

together. Totally. So, if you're trying 45:53

to get it to like have the exact 45:54

personality you want, you sort of want 45:57

that on the 45:58

it has to be on a model that's good 45:59

enough to be on the smart model. Yeah. 46:00

But that said, like 46:02

I do think at some point there will be 46:04

like some pieces of alignment that like 46:05

you do want to export back into 46:07

pre-training because that might be a way 46:08

to like 46:10

put them in with more strength, like 46:11

more robustness kind of or or more to 46:13

the intelligence. Like if you think of 46:15

pre-training as like teach the model to 46:17

be intelligent and then post training as 46:18

like tweak the personality, you can 46:20

imagine tweaks where you actually want 46:22

it to be like part of how it learns and 46:23

like part of its intelligence and maybe 46:25

you need to create more. 46:27

What would that even look like to 46:28

incorporate pre-training? Is that like 46:29

add extra data basically of the type of 46:30

domain you want it to adopt earlier? 46:33

Basically, 46:35

there's a paper called pre-training on 46:36

human feedback where you can kind of 46:37

like add the human feedback 46:39

characteristics into pre-training to 46:40

like test that and like uh yeah, you can 46:42

you could basically give it all the 46:45

information you give it in post- 46:46

training just mixed into pre-training 46:47

and see what effect that has. Yeah. The 46:49

other loss you have when you do that is 46:51

you lose the flexibility like if you you 46:52

sometimes like train these and then you 46:54

talk to them and then you like do an 46:56

extensive process where a bunch of 46:58

people talk to the thing and find some 47:00

like issue. you know, the model says 47:01

like you're absolutely right too much 47:02

and you want to go 47:04

do that. 47:06

Yeah. Yeah. I mean that I think that 47:07

iteration loop point you made I think 47:08

feels like the really key point of yeah 47:10

there's a huge difference between taking 47:12

three months to get information about if 47:14

your model is good or bad or making 47:17

going in a good direction versus a day 47:19

or something or a couple days like you 47:21

can do a lot of those and you could 47:22

probably that probably also means it's 47:23

way less computes. You can do a lot of 47:25

those in parallel. Imagine you're trying 47:26

all sorts of post training strategies in 47:27

parallel there. 47:28

So yeah, makes a lot of sense. It's also 47:30

just the general hard part about 47:31

pre-trading like everything in pre-ra is 47:32

hard because you have this like one shot 47:33

on goal kind of for like multiple months 47:34

and 47:36

totally. Okay. So, uh in thinking too 47:36

now about I guess what's going ahead as 47:39

you as you now look to the next several 47:42

years of what you're building like how 47:44

do you think about you know like what 47:45

are the known problems that you're going 47:47

to face that you're going to have to 47:50

deal with? though there's going to be 47:51

more compute I assume and you're going 47:52

to need to hook up even bigger network 47:54

uh network GPUs and deal with versus 47:56

like are there areas where you're like 47:58

okay this is like a problem that it's 48:00

like a little bit more ambiguous what 48:02

the actual like how it's going to 48:03

materialize into something you care 48:05

about but you kind of know it's an 48:06

impending thing to think about or are 48:07

there things like that that come to mind 48:08

I think the things that feel most top of 48:10

mind to me are probably like paradigm 48:11

shifts like I think the sort of shift 48:14

towards uh more RL is like one paradigm 48:16

shift in the field and I I think it's I 48:19

think there will probably be more. Uh I 48:21

think a lot of people sort of argue 48:23

about like oh is like you know current 48:24

paradigms enough to get us to EGI and 48:25

I'm like 48:27

I don't know maybe probably but like I'm 48:28

sure there'll be more. It seems it seems 48:30

like it would be a really surprising 48:32

twist if like the answer is like you 48:34

just scale and there's nothing that you 48:37

realize in the process of going up many 48:39

orders of magnitude. 48:40

Totally. 48:42

But I think the things that I like 48:42

actually feel like most nervous about 48:43

are really hard to solve bugs. I think 48:45

that like uh 48:48

that's interesting. 48:50

Yeah. And I think this is like maybe 48:51

somewhat surprising to me, but it's just 48:52

like a single bug can like 48:54

derail you for months. 48:56

And when you think about it, like you 48:58

the models take months to train. So you 48:59

could kind of like lose a whole 49:01

generationally 49:02

off of something that just looks like 49:03

odd. You know, it turns out like 49:06

this piece of your code was incorrect 49:08

and you couldn't detect it. 49:11

Uh and it's really hard in ML, right? ML 49:12

is always really hard to find bugs in. 49:14

Yeah, totally. But also some of these 49:15

scaled up issues are really hard to 49:17

solve even when you know they're there. 49:18

Yeah. Like what's even a unit test that 49:20

you would write or forget a unit test? I 49:21

mean anything close to a test for the 49:23

type of like network architecture on 49:26

which you're doing this. Like how do you 49:27

even do that? 49:29

I mean like you can send a packet over 49:30

it and confirm it's the same. 49:32

Uh you can you can train a small model 49:34

on it. Um 49:36

but even train a small model on it it's 49:37

like not obvious. You know, if you have 49:38

like the the simp the very classic like 49:40

very simple ML bug that like early 49:41

people face in their careers like okay, 49:44

they have some like they have like 10 49:45

layers in their network and like you 49:46

know layer 7 connects to nine instead of 49:48

8 to 9 and like so like there's some 49:50

incorrect like set of connections you 49:52

have there and technically the model 49:53

still trains and all the weights update 49:55

and so it's like a valid model but it's 49:56

not the correct one and that's like a 49:58

very esoteric weird bug that would 50:00

actually be kind of hard to find. Like 50:01

is is that kind of what you're referring 50:03

to of these like random bugs you face? 50:05

Yeah. Yeah, 50:06

it's that but like you know you can 50:07

times a million 50:09

times a million as the thing gets more 50:10

complicated you know you could like cast 50:11

the wrong precision deep in some kernel 50:13

and that causes your model to like blow 50:16

up at large scale 50:18

and you find out like a month in 50:19

or you never find out 50:20

or you never find out 50:21

I mean you know like like you see the 50:22

thing blow up like 50:24

there's I don't know 10 tens of 50:25

thousands of lines of code like how 50:26

would you ever trace it down so like 50:28

those are the things that probably spook 50:29

me the most is just like some subtle 50:31

tricky bug yeah that's probably the case 50:34

of like you don't I think there's 50:36

actually also the case of you do know 50:37

like it crashes. You're training your 50:38

model and it like or it slows down. You 50:41

know, your job slows down a ton 50:43

and those things can also be very hard 50:45

to debug. Uh Nelson Elhaj is one person 50:48

that he has a blog. He wrote up a blog 50:51

on one like cursed bug we had early on 50:53

and I remember this one quite well 50:55

because I think like I encountered it 50:56

fairly early and was like this looks 50:58

hard. Can someone else look at it? And 50:59

like a month later was like wow I'm so 51:01

glad I handed that one off. I never I 51:03

never would have been able to get like 51:04

like one of the abilities I think is 51:06

actually really useful this is the 51:08

ability to like deep dive anything to 51:09

any level of depth 51:10

but that's a pretty rare skill like for 51:12

me you know as I we talked about what 51:13

level of the stack I was at before I was 51:15

like working at the torch matball but 51:16

like if I didn't know CUDA so torch 51:18

mountain was broken it wasn't like I 51:20

could dig into torch matmo and figure it 51:21

out and it's similarly with like 51:23

communications right like I could I 51:26

could call send send bytes from A to B 51:28

but I didn't know the like underlying 51:30

networking protocol so if that 51:32

underlying networking protocol is 51:33

broken. Uh like I need to learn a whole 51:35

field. I have to like understand packets 51:37

and TCP or like all all of these 51:39

different things to debug that. And I 51:41

think one thing that's like surprisingly 51:43

hard and there's very few people who can 51:44

do is like kind of own that whole stack 51:46

from like I understand how the ML is 51:49

supposed to work and what the learning 51:50

dynamics are all the way down to like I 51:51

know the bites and I like can understand 51:54

how the bittes should be moving around 51:56

machines. 51:58

Totally. Yeah. And actually on that 51:58

front, like when you think about the 52:00

different backgrounds of people on your 52:01

team today, how do you like 52:02

approximately 52:04

uh map them out to different categories 52:06

of computer scientists? Like I think 52:08

there's this external view of what these 52:09

teams look like, which is that they're 52:10

like all PhD researchers who write ML 52:12

papers. And I suspect that's not 52:14

actually true given what you're 52:16

describing here. 52:17

Yeah, it's a mix. And I think the thing 52:18

we like most need is engineers. 52:19

Interesting. Almost always like 52:21

throughout like the entire history of 52:23

this field. Totally. It's like the case 52:24

that you throw more compute, the thing 52:26

kind of works. Yeah. Uh the challenge is 52:28

like actually 52:30

the researchers are like cool, nice. 52:31

Yeah. And getting it correct, like 52:32

getting it correct isn't really an ML 52:34

problem, right? Like the actual 52:35

architectures are pretty simple. You you 52:37

can write the math down. But you don't 52:39

even need to understand the math to 52:40

implement it. You just need to like get 52:41

a correct implementation and then you 52:43

sort of have an engineering problem of 52:45

how do I take this implement it at large 52:47

scale, paralyze all the things and check 52:48

that it's 52:50

correct. But it's yeah so it's like kind 52:51

of engineering skill but it's this 52:52

particular type of engineering skill 52:54

that's about being able to like debug 52:55

anything. Yeah. 52:56

Um I think there's another angle of 52:57

engineering which I think of as like 52:59

really quickly iterate on like a website 53:01

or something which I think of as an 53:03

important skill set probably important 53:05

for making startup. You got to be like 53:06

fail fast try a bunch of different 53:07

things none of which are like 53:09

that technically difficult to do. the 53:10

skill sets that we're like most kind of 53:13

in need of or looking for are this like 53:15

able to solve really hard engineering 53:18

problems. 53:20

Are they people who worked at companies 53:21

that grew a whole bunch and so they have 53:24

experience like doing the kind of thing 53:27

you've done over the last several years 53:29

at anthropic or do they tend to be 53:30

academics or like where do they come 53:32

from? 53:33

Yeah. So at this point like I think we 53:34

actually just hire a bunch of people who 53:35

have done this before from like other 53:36

places and that's like the easy answer. 53:38

Yeah. Yeah. But like by this before, do 53:40

you mean in AI companies necessarily or 53:42

also, you know, like someone who worked 53:44

at Meta on like their not AI team but 53:45

they ran some other distributed system 53:48

that you know reached internet scale 53:49

five you know 10 years ago or something 53:51

like that 53:53

more like we have like a specific role 53:53

in mind. So like say I'm like trying to 53:54

make the run train efficiently in Jacks 53:55

like hiring someone who's like worked on 53:58

jacks would be great or someone who's 53:59

like worked at another company on 54:01

optimizing a jack stack to be really 54:03

efficient. That's kind of like I think 54:04

now we're at the point where like the 54:06

entropic is well enough known we can 54:08

sort of hire these people and also the 54:09

field is big enough that there's like 54:11

people with expertise. One thing that 54:12

was interesting was like early on we 54:14

hired a lot of people from just like all 54:15

sorts of backgrounds and I think that 54:16

people who are just smart and work 54:19

really hard can learn this pretty fast 54:20

but you have to like want to. We hired a 54:22

lot of physicists for instance like 54:23

theoretical physicists who just like 54:25

show up they they do a residency like 54:27

learn to program and then uh they were 54:29

really smart they could do really great 54:31

work. Um I want to switch gears uh to 54:33

talk about something a little bit 54:36

different which is just sort of future 54:37

looking things around how you think 54:38

about other domains and or sort of 54:39

advances happening in AI that I'm seeing 54:42

elsewhere in the field and you don't 54:43

have to tell me if you guys are working 54:45

on these necessarily but like how you 54:46

think about them like are I guess one 54:48

one big area I was thinking about is 54:50

around areas other than next token 54:51

prediction like are there any of the 54:54

other you know things that people are 54:55

working on that you're curious about so 54:57

basically two differences there one is 54:58

uh not using transformer as an 55:00

architecture um So there's companies 55:02

like Liquid AI that have their own kind 55:04

of architecture for example they're 55:05

using um or not using autogressive 55:06

training as a way of training models. 55:09

Are there are any of those do you think 55:11

interesting and like ways that we might 55:13

come closer to AGI or do you think like 55:15

this autogressive framework is the one 55:16

that kind of makes sense? 55:18

I think they're interesting. I think I 55:19

like am less like ah autogressive is the 55:20

way to go. On the other hand, I think 55:23

auto reagive is probably good enough to 55:24

get to AGI or something or not like yeah 55:26

uh such that 55:29

yeah I I see the main driver as scale 55:31

and careful science of like sort of the 55:34

basics more than like come up with 55:36

something totally novel. 55:38

Not because there aren't novel things 55:40

that are better. I actually like I'm 55:41

pretty confident they are there. It's 55:42

just that scale is easier and it's more 55:44

reliable and I think you we're still 55:45

seeing really big gains to that. Do you 55:48

spend a lot of time on thinking about 55:50

things like you know I've been reading 55:51

some of these open source papers where 55:52

you can kind of dive into some of the 55:53

details about the model changes and with 55:54

some of these Chinese labs for example 55:56

where they're making tweaks on the order 55:58

of the architecture itself with like 56:00

better caching behavior for example or 56:02

like more efficient attention functions 56:04

that make a big difference. Do you feel 56:06

like these are examples of things like 56:07

you mentioned earlier where it's 56:09

basically in the grand scheme of things 56:10

basically if you throw more compute at 56:11

it this is all kind of a rounding error 56:13

or do you think it will take some number 56:14

of these very clever architectural 56:16

changes to actually get to hi like in 56:18

the way that the first person who came 56:19

up with the transformer made like a 56:21

particular transform you know literally 56:23

transform transformative change like 56:24

will it take some of that or do you 56:26

think it just you keep doing the thing 56:28

we're doing to make it bigger 56:29

I think it'll be a mix I think I like my 56:30

guess is you'll keep tweaking things the 56:32

more compute you put in the more like 56:34

worthwhile it is to like do those 56:35

experiments to like figure it out the 56:38

you know I mean inference is a thing we 56:40

haven't talked about but like you also 56:41

want to serve these models to a lot of 56:43

people so there's a lot of changes you 56:44

can make to make inference cheaper and 56:46

that depends on like the details of your 56:47

inference stack and the chips you're 56:49

serving inference on etc. So 56:50

do you as a someone focused on 56:52

pre-training have to think a lot about 56:53

inference or is it kind of like you just 56:54

do your thing you make the loss go down 56:56

and then hand it off and someone else 56:57

makes that happen. Oh no. I think a ton 56:59

about inference because basically like 57:00

the problem inference is solving like we 57:02

basically determine the problem 57:04

inference is solving. We give them a 57:05

model and they have to like run that 57:06

fast and it's very easy to give them a 57:08

model that is impossible to run fast. 57:10

Oh, can you give an example of a 57:12

decision you can make that could cause 57:13

that? 57:14

I mean the simplest one's stupid but 57:15

it's like you just make the model giant. 57:16

Yeah, absolutely. Train for like a 57:18

really small number of tokens and then 57:20

inference now has this giant model 57:22

and their host basically. 57:24

Yeah. I mean you can also make things 57:25

require communications in a lot of 57:27

places 57:29

uh which would make it harder for 57:30

inference. Um totally 57:32

you can also just make things 57:34

complicated and like there's no 57:35

fundamental reason it's hard but there's 57:37

only so many people on the inference 57:38

team and like they have to implement it 57:40

in a bunch of places. 57:41

Yeah. 57:42

Yeah. No, so I definitely think of like 57:43

the like inference is the team that I 57:44

work the most closely with like 57:46

because we're kind of like co-designing 57:49

models to be smart and cheap. 57:51

Yeah. Interesting. particularly in a 57:53

world of like limited compute, right? 57:54

Like the sort of the bottleneck I think 57:56

to a large degree on our I mean you can 57:58

see anthropic has rate limits constantly 58:00

and people complain about a lot and like 58:02

the reason is like 58:03

there's only so much compute we can get 58:05

on on short notice. So you like making 58:06

your inference more efficient is like 58:08

the way you can serve more users 58:10

and actually like let's say you had 100x 58:11

more compute or we somehow didn't live 58:13

in a world where compute was limited. 58:15

Does that change a ton about what you do 58:17

or is it still kind of the well you're 58:20

just going to grab all of it whatever 58:22

compute you have and keep going down the 58:24

loss curve and you kind of well you it's 58:25

like impossible to be in the world where 58:27

there is enough comput 58:28

so I think if we got like infinite 58:30

comput the challenge would be making use 58:31

of the compute right so like then you 58:32

would start to run into these issues 58:34

like oh well when one chip fail you know 58:35

like okay I'm going to throw two billion 58:37

chips around but what happens when a 58:38

chip fails so I think we would be 58:40

limited on people then it would be like 58:42

how fast can we solve the hard 58:43

engineering problems to scale up. But I 58:45

do think the change is massive and I 58:47

think people like don't realize how chip 58:48

limited AI like research is or something 58:50

right now. Like the models that everyone 58:53

uses, right? If you're using like Cloud 58:55

Sonic 4, Cloud Opus 4, it's like it's 58:57

our first shot at models at that scale, 58:59

right? And like 59:01

if you think about anything like you 59:02

could do it and you could do it again, 59:04

you could do a better job. But if you 59:05

sort of imagine like 10x the comput like 59:07

you could run this every day instead of 59:09

every few months like you or 100x maybe 59:11

for that then like yeah it's just it's a 59:14

really it would be a really big change 59:16

to have a lot more comput and it's 59:17

coming right like that's like kind of 59:18

the fun part of the field is like every 59:19

year you're like oh I had no comput a 59:21

year ago then exactly how do you think 59:22

about methods like uh like discrete 59:25

diffusion like I saw there's like a 59:27

gemini diffusion model and I think about 59:29

that in the space I used to be in where 59:30

um there's a lot of discrete diffusion 59:32

models being used in protein design for 59:33

example space where my startup was like 59:35

do you see that as a domain where 59:37

there's going to be interesting uh 59:38

advances happening? 59:40

I'll be honest like we haven't done 59:41

image generation and I think that's been 59:42

like the main use for diffusion. So I've 59:44

kind of had this on my like to-do list 59:46

of like things I should understand for a 59:48

while and like there are people in my 59:50

team who do understand it and would have 59:51

better thoughts but like I actually 59:52

don't think I understand it well enough 59:54

to know. I I do have it kind of in my 59:55

this category of like yeah 59:57

not a total par like and there's a lot 59:59

of things that aren't like a huge 00:01

paradigm shift but they're like pretty 00:02

big changes to how things run and I 00:04

expect like there are some of those that 00:06

will work um I don't know if it's 00:07

diffusion or if it's another one 00:09

obviously who knows what anthropic will 00:10

do in the future but at least in the 00:12

near term are the things where you see 00:13

big areas where a startup can win in the 00:15

world in which anthropic is getting you 00:17

know making their models better 00:19

year-over-year 00:20

my general read is like anything that 00:20

benefits from the model getting smarter 00:22

I think Like on the one hand there's 00:25

like a lot you can always be like oh 00:27

yeah the if you're doing a startup like 00:29

all the AI labs are big companies 00:31

they'll be bigger than you and they 00:32

could do that thing but also like we're 00:33

all working on this general system that 00:35

covers a lot of different uses and the 00:37

the plan is to like power all the 00:40

startups to do all of the individual 00:41

work. So yeah I think like anything that 00:43

just kind of looks like oh this almost 00:45

works with current models but requires 00:48

like a bunch of work is a pretty 00:50

promising direction. Uh, I think maybe 00:51

the thing to watch out for is things 00:53

where like they work now with a huge 00:54

amount of work like to build up a 00:56

scaffold, but the next generation you're 00:58

not going to need the whole scaffold you 01:00

built up. That's I mean maybe that's 01:01

fine. I don't know. Like maybe you just 01:03

build up the business with the scaffold 01:04

and then you don't have to do any work 01:05

later and you business, but like I don't 01:06

know about the business side of it, but 01:08

like it does feel a little silly to put 01:09

to invest a ton in that. 01:11

Yeah, totally. 01:13

What about on the flip side? Are there 01:14

things in your training uh stack where 01:16

you're like, man, if there was a company 01:18

that solved X problem, I would totally 01:20

buy their product. 01:21

Yeah, there's like a ton. I do think 01:22

that like probably most of these like 01:24

the way I would probably structure would 01:26

be like almost like making something but 01:27

then consulting with the comp like 01:28

offering a service to companies for 01:30

free. 01:32

Particularly for like companies that are 01:33

scaling really fast, you're almost 01:35

always limited on like how many people 01:36

you can have. So if you can like 01:37

even if you could hire people to do it 01:39

yourself, actually being able to 01:40

contract someone else to do it where 01:41

like they're managing it and you know 01:43

hire all the people and like deal with 01:45

the organizational side could be useful. 01:47

I mean there's huge amount of stuff. One 01:48

that jumps to mind we talked about like 01:50

chips that do math incorrectly. Like it 01:51

would be lovely if there was some 01:54

startup that like you could just say 01:55

like here are my chips. confirm they're 01:57

all perfect. And if they're not, let me 01:58

know exactly what went wrong on like 02:00

what fraction of them. And like I can 02:02

tell you the math is wrong, but I 02:04

couldn't really tell. I don't really 02:05

know enough details of chips to be like 02:06

this chip failed because this particular 02:08

like low-level component was like wired 02:10

wrong or like got hit by a game. I don't 02:12

I don't know what causes it. You could 02:15

always go like bunch a bunch deeper. I 02:16

mean, the thing I'd maybe just push 02:18

startups on is thinking a little bit 02:19

about like uh this is maybe less 02:20

technical, but just like what happens 02:23

once we get AGI and like how to make 02:24

sure that like goes well for the world 02:26

or something. Like my my expectation is 02:28

like if you actually automate 02:30

almost everything a person can do. The 02:31

amount of economic growth there is just 02:33

like truly enormous and I would think a 02:34

little more about like how do you make 02:38

this like help the world versus not. I 02:39

think there's going to be like plenty of 02:41

economic success or something as a 02:42

result of it anyway. 02:43

Yeah, absolutely. Yeah. Um last question 02:44

I want to ask you is around if you 02:46

rewind back to where we started like 10 02:48

years ago. Uh you're a student, you're 02:50

pivoting into AI from kind of economics 02:52

work you were thinking about. Um and you 02:54

know all sorts of things you probably 02:57

did in those early days had some kind of 02:58

compounding return for you as you 03:00

developed into the role you have now. 03:02

Like what advice would you give to 03:04

students as they think about uh entering 03:05

the workforce, especially today? Um 03:08

learning skills that going to be useful 03:10

and maybe getting themselves jobs like 03:11

the ones you have right now 10 years 03:13

later? It's hard because I think the 03:14

timing is very different. Like I just 03:16

think we're like we've made we made a 03:17

lot of progress. So like what I would do 03:19

10 years ago is different from what I 03:20

would do today. 03:21

Totally. 03:22

But I think certainly if I went back 10 03:22

years ago I would be like focus on AI. 03:24

It's like the most important thing and 03:26

particularly focus on engineering which 03:28

I think felt very wouldn't have seemed 03:30

obvious to me at the time that like the 03:31

important thing was these engineering 03:33

skills and not the like math and 03:34

theoretical understanding of like you 03:37

know uh SPMs and like all the kind of 03:38

standard 03:41

ML literature. Um, I think today I would 03:42

probably focus a bunch on the like 03:44

engineering and on the like figuring out 03:46

what to do with AGI as sort of the two 03:48

like main things that feel top of mind 03:52

for me. 03:54

Let's call it there. Thanks so much, 03:54

Nick. Appreciate it. 03:55

– English Lyrics

🧠 Vocab, grammar, listening – it’s all in "", and all in the app too!

By

Viewed

17,378

Language

English

Learn this song

Lyrics & Translation

[English]

[Music]

Hey guys, I'm thrilled to be joined

today by Nick Joseph, the head of

pre-training at Anthropic. To give

viewers a highle sense of what we'll be

covering, we're going to start with the

basics of what pre-training is and then

dig into how Nick thinks about strategy,

data alignment, and infrastructure at

Enthropic. And by the end, you'll

hopefully have a sense for how progress

in AI comes directly from advances in

pre-training. I would love to talk a

little bit about your backstory and kind

of how you got to this point. Where did

you work before Anthropic? And what were

your takeaways from those places? Yeah.

So let's see. I was at Vicarius uh and

then at OpenAI uh before Anthropic. So

Vicarius was originally a GI lab and

sort of when I joined they were sort of

making a shift to product particularly

working on robotics products and the

thing I worked on was like training uh

computer vision models for for their

robotics products. It was my first job.

So I think I just like learned a ton

about like how to do machine learning

models, how to like write machine

learning infrastructure.

And at the time were you also thinking

about a career as an academic? Like at

the time a lot of people doing AI work

were in PhDs. That's kind of what I was

thinking about before I started to do a

company. Like how were you thinking

about that in your headsp space?

Yeah. So like I'm actually rewind a

little bit. I think like a lot of my

thinking on this had come from an

internship I did at Give Well, which is

like a nonprofit that evaluates

charities. And some people there being

like ah we're at some point we might

have AGI. It could be dangerous. We

should worry about these risks. This

could be like a big impact on humanity.

And I was like not super convinced at

the time and went down the economics

route and was going to try to work on

like directly helping people in poverty.

that didn't work out for various reasons

and ended up being like okay I'll at

least work on AI either like the safety

thing will turn out to be important I'll

work on that or it won't be and I'll

just make cool things with AI that can

probably help people in poverty more

I wasn't really coming at it from an

academic standpoint I was sort of like

in fact when I switched to that it was

part of the appeal was that I could like

immediately go do stuff in AI whereas if

I want to work in like economic policy

I'd have to wait

I don't know six years to do a PhD and

start and like totally uh it's it's a

longer path

and and what are the state of AI safety

work at that time even look like? Like

who are the people who were thinking

about that kind of stuff? I mean there

were some folks at vicarious thinking

about this kind of thing but it was

fundamentally a robotics company and and

so yeah how how were you thinking about

that at the time?

Yeah. So my sense was like at the time a

lot of the AI safety discussion was kind

of theoretical like the models weren't

actually that good. They weren't really

posing these dangers. So it was a lot

more like philosophical like oh at some

point we might get AI that's really

smart smarter than humans and like

should we wait this like future concern

how should we compare that to near-term

things? And I think that was like

actually a just a less compelling

argument. I think it was like an

interesting one and like sort of made

you think of it.

So next you went to OpenAI. What was

OpenAI like at this time?

Yeah. So I was at OpenAI. I was on one

of the safety teams and kind of worked

on uh

I ended up working on code models

actually and kind of when I got there I

could the the first thing I saw was oh

they'd find tune GT3 to write some code

but and it was really good and I was

like oh okay if you're worried about AI

getting really powerful writing its own

code that seems

seems like it could self-improve and how

how likely is that to happen? So it was

doing a bunch of evaluations and like

studies of what contributed

and then after like uh eight months uh

basically everyone I worked with like

all all the safety leads left

which uh yeah invited me to go to

Anthropic and that was sort of the

reason I joined OpenAI was because I

cared about AI safety and wanted to work

with them. So then I went with them to

join Anthropic uh pretty much right when

it started.

With that why don't we transition a bit

these days you run the pre-training team

specifically at Anthropic. Um, obviously

you've been working on pre-training at

anthropic for quite a bit of time and

I'm sure it's evolved over the years,

what that even entails and looks like.

Why don't we start by just talking a

little bit about what pre pre-training

is like? How does it even fit into the

way of thinking about how AI models have

developed at a place like Anthropic? And

what exactly do you guys do?

We know that one of the ingredients to

making AI models better is scale. You

want to put a lot of compute in. And if

you sort of step back and you're like,

okay, what's the way we could put the

most compute into a into a model

possible? We need some objective that

there's just like tons of data for. And

one idea here is like the internet. The

internet is massive. It's probably the

biggest like single source of data

humanity has created. And you don't have

labels. It's like you don't want someone

to have to go in and look read the

entire internet and like say something

about it. So you want to get labels out

of the data itself.

And the idea here is we can take some

text and we can predict the next word.

So you take you know the as the first

word you predict the second word then

you say the cat and you predict the word

after that. And this means you get very

dense signal. Every every word is like a

new example. And there's a huge amount

of data and one of the findings from my

GT1 GT2 was kind of as you throw more

compute at this more data bigger models

uh you get better you you get smarter

models essentially.

Totally.

Um and that's kind of been the central

thesis of pre-training for the whole

time.

Uh there's this idea of scaling laws

which is that you can actually quantify

like as you put in more compute more

more data more parameters you get models

in a very you get a lower loss a better

prediction of the next word in a very

predictable way. And I think you can

somewhat foresee from that original

paper and I think like Dario did foresee

this I think many people did but wasn't

obvious was that once you have that

there's this positive feedback loop

where you can train a model you can use

it to make something useful and sell

that and get more money use that to buy

more compute and then you actually train

a better model and I we've sort of run

that cycle over and over again over the

past 5 years or so. Well, in thinking

about that objective to begin, you know,

I think the way I think about the state

of pre-training is yeah, it seems like

this next word prediction, at least from

the external standpoint, seems to be the

dominant way pre-training happens. But

if I rewind the clock to that era of

2017 to 2020 or 2021 and two even, there

was all sorts of pre-training objectives

people were considering, right? There

was these uh BERT and BART models that

were doing mass language modeling. It

seems like this GPT series of models

doing uh auto reggressive modeling as

you describe this next word prediction

seems to be the dominant one that won

out. Do you have any reflections on that

time period? Like were you guys trying

all of them and kind of this one worked

or or is there some sort of first

principles reason why this is like the

right one that should have worked?

I think the answer is like it's mostly

empirical like in terms of how to think

of the things I'd be like yeah it's

empirical just try them all see what

works. One big advantage for this auto

reagive setup is that you can just

sample from it to generate text

afterwards in a fairly like

straightforward way that comes

like enables a product use very nicely.

Um like one thing that you want is like

one charact is like a loss whereas you

drive down the loss that actually is the

thing you care about and you can think

of it as like if you got to perfect on

language modeling you now can like write

text as a human. You can sort of imagine

you put in the title of a paper and it

should spit out the entire spit out a

novel paper. Whereas I think some of the

other approaches don't quite have that

uh flavor.

Yeah, totally. Yeah, it makes sense that

in terms of that loop you're describing

of, you know, then release something

that gets you revenue and you can use

that to buy more compute and iterate.

This sort of gives you the most natural

way to actually do that flow because you

can keep releasing new products and keep

getting the revenue from that to invest

in more compute and so on.

Yeah, it certainly gives you the most

open-ended thing. You could imagine, you

know, you like train something as a

class like you train some base thing,

you fine tune it for a bunch of

particular tasks. one approach people

would use. They would like do this big

pre-training and then they wouldn't just

like open-endedly sample from it. You'd

fine tune it on like a hundred specific

tasks and that could work too. I think

that like the one sort of general

intuition I have is like compute is the

thing that matters. So like I think if

you throw enough compute at any of these

objectives, you're going to get

something that's probably pretty good uh

and can kind of be fine tuned to other

things. And it's it's surprising how

little these details matter compared to

throwing more comput. When you think

about actually throwing more comput,

there's a whole bunch of axes by which

you could throw compute at it too,

right? And if you have a specific model

architecture you're training over, you

can basically throw more data at that

specific architecture. For a particular

one, you could add more layers or make

the models larger in it. You could do

some kind of neural architecture search

over lots of different variants. And I

assume that these days it's somewhat

more figured out, you know, which

architecture you go for. I assume the

earlier days it was somewhat less so.

And and I'm curious if you could speak

to how you guys thought about that. like

what did your infrastructure even look

like to do that type of determination?

I mean, I think the the short answer is

it's hard, right? Like what you're

really doing is you're going to train

this one big expensive model and you

have a space of, you know, you can sort

of call all these things

hyperparameters. You know, how many

layers do you have? What's your width?

Like you have the space of hundreds of

hyperparameters and you want them all to

be optimal and you're sort of striking

this balance actually between how much

do they matter like can you just take

your best guess and throw more compute

at it in whatever way you want and

basically doesn't matter. how much you

want to get it precisely correct.

Interesting.

And I think one of the like interesting

things is like it actually doesn't

matter that much. Like we like I think

this was in one of the early scaling

laws papers like you can change these

things and get little wins but like as

you throw more compute it it sort of

reliably gets better. If you mess up

enough you will you will sort stop

seeing that happen and you won't have

any way to know which is one of the

that's like kind of the hardest part in

some ways.

You don't know the counterfactual

basically because you didn't run it for

long enough to actually know what it is.

Yeah. We have these scaling laws. So you

can sort of say like as you train more

comput you expect the loss to go down as

a power law.

It's really a power law plus constant.

So what eventually will happen is you'll

curve off that power law and then you

know something is wrong and is it

fundamental? Is it like you've hit the

limits of scaling or is it nope you

should have ch you should have tweaked

your learning rate slightly differently

and that's that's sort of one of the

challenges in terms of how to like

figure it out. You can the the usual

paradigm is like test things out at

small scale before running them at large

scale and try to find

small scale in terms of data or in terms

of something else? uh in terms of

everything like you kind of want to

scale things down like proportionally.

So you want to say like you want you

want to have some theory for like how

you're going to scale up like ah okay if

I get 10 times as many flops how much of

it goes into layers how much of it goes

into data how much of it goes into

attention

and you sort of get that theory and then

test that it's optimal a bunch with like

scaling everything down proportionally

and and just so I can think about what

this actually looks like in those in

those early days of anthropic you know

you're a team of like 10 or something

like that in those very early days or 12

maybe what actually is your ability to

use large scale infrastructure as like a

relatively nimble startup at that time.

I mean a startup that was well

capitalized but still not actually that

many people working at. What kind of

infrastructure did you have access to to

train these early models at the So

actually one of the wild things was it

at least I mean you don't know what

anyone else is doing of course but it

kind of felt like we were like at the

frontier of it and there just weren't

that many people who cared like I was

sort of coming you know I was coming at

it from like we're making AGI this is

the most important technology ever and

then would kind of like look around and

be like and it seems like I'm one of 30

people who were working on this in like

the world. I mean I was kind of like

junior person. Everyone else sort of

knew how to do this and had done it

before but I was kind of surprised at

how easy it was. Um like the public

estimates for GP3 I remember were that

it cost $5 million to train which you're

like on the one hand five million is

kind of a lot but it's like a lot for an

individual person. It's not really a lot

from like a company perspective. So we

could totally buy like compute that was

enough to train models like that you

could

and were you using a cloud provider or

or did you have a custom setup somewhere

or did you literally have racks in a

room somewhere that you were you know

bought a bunch of Nvidia GPUs and you

were doing it? uh we're using a cloud

provider, but I think it's kind of it's

not actually that different because one

of the things that's was surprising to

me is you actually have to understand

the the literal layout. Like uh I

remember at one point uh one of my

co-workers running a clustering

algorithm to identify what rooms all the

chips were in since we we had a

hypothesis that they were in different

rooms and that was causing like or you

know different buildup some sort of

network latency and you can kind of

figure it out. you could like reverse

engineer like ah okay yeah there's

clearly like two clusters here that are

connected better and there's some issue

on the connection between them like

you're we're trying to push the limits

of of the hardware like as much as

possible

um particularly at the beginning when we

were kind of like we have way less

funding than everyone else we have to

and and most people weren't very

efficient with the compute so we were

like ah we get a big lead by being

really efficient at at how how we use

the comput

could you talk a little bit about some

of the things you guys did in those

early days for how to get the most out

of the hardware I think it's really

interesting like I think back to the

days of the early days of Google for

example where there's the there's these

cases where they basically bought

relatively cheap consumer chips and then

they optimized the software to make it

so you can actually get the most bang

for your buck out of them and that's how

they had all this high latency or low

latency high availability stuff. I'm

kind of curious if there's some analog

in the early AI era to that.

I think for us it was largely about like

getting the distributed framework right

so like we're training on in order to

train you have to train them on a large

number of chips

and there's a bunch of different

approaches to to how to do this. There's

like data parallels and there's

pipelining there's upsharting and like

getting all of the At the time there

were no like great open source packages

you could just grab and use that just

worked for this. I mean today there's

somewhat more of these but at the time I

assume there was literally none.

There were some like I actually remember

that we were working on data parallelism

early on and someone was like and now we

write the or reduce it. I was like we

really do this ourselves. don't like

package and this was kind of like well

we're going to want to modify it right

like oh like we don't want to outsource

this to some package because a we're

about to go to a bigger scale like

pietorch for they had a package for

doing this but we were going to go to a

bigger scale than Facebook had been to

and you don't want to have a dependency

on a package uh that you're going to

have to be like constantly modifying

essentially

that's it's such a counterintuitive

sentence there too like we're going to a

bigger scale than Facebook will because

at the time Facebook AI research was

considered one of the best places to do

machine learning research like fair was

one of the play fair and deep mind we're

hiring lots of people out of PhD

programs and doing lots of things like

what was your headsp space when you were

like okay this this very established lab

with great people and whatnot we are

operating on a scale that is not

relevant to them like was that natural

and obvious to you or was there times

where you kind of doubted the decisions

you were making in that situation

I think it was surprising I will maybe

I'm just too arrogant or something I

kind of looked around and was like what

are these people doing they're all

missing the like big picture here like I

I I think the scaling laws were pretty

clear like and the arguments against I

just thought were kind of nonsensical

like you know the scaling I think the

original scaling laws paper had like 11

orders of magnitude and there was like

this intense debate on whether it would

continue for like another point and I

was like

like it seems it seems like one over 11

is maybe your chance it fails here and

then like you know sometimes it doesn't

work like sometimes it just works

straightforward you like train the model

you're like oh yeah of course but yeah I

do think that it was it maybe felt

obvious when you're in that headsp space

and you're working on this all the time

and you're making those plots and I

think these things feel pretty different

when you're on the outside. You know,

there's a huge space of papers. Everyone

tries to make their paper sound like

very robust and and important. I I could

see I could see being like, "Oh yeah,

this is not really a thing."

Totally.

But also different labs had different

cultures. So like I think one of the

things at fair was it was a very

more PhD style independent research.

People have their own ideas, pursue

those.

You're fighting for your compute and so

on.

Yeah. And to do a project like training

a large language model requires a lot of

people to collaborate on like a really

complicated piece of infrastructure that

isn't going to be a paper, right? Like

you're not going to publish like, oh, I

got a slightly I got 5% more efficiency

totally

than the next one. Um, and it's not

respected in like those cultures

necessarily. So that might have been

part of it.

Okay. Okay. So then when you actually

implement these these models, you're

saying you're using a level of low-level

programming where you know you're using

libraries like PyTorch, but you're

perhaps not using everything right out

of the box from PyTorch because there's

things you guys want to customize that

are at the level of basically one level

of abstraction below them, but not

necessarily at the level of abstraction

of you know writing custom CUDA kernels

or or like was that also in in the space

where you guys were thinking about?

So it depends on like the operation. So

like I think I was mostly operating at

the level of like torch.mmatal you know

like uh yes where does a matal go but

not thinking like how do you make the

matal efficient like I assume torch

figured out how to make a matal as

efficient as is possible but there are

some pieces like attention where there

was just kind of a lot of different

varants and attention is really

complicated and hard to make efficient

on a GPU and th those things you have to

kind of go go more levels down on the

stack. Uh I think there was like a

process that is maybe interesting that

I'd never really like thought of before

of like how to do it which is sort of

like modeling out the pro the thing

you're going to do coming up with a

strategy for how to paralyze it that

like can get to a really good efficiency

you know like

so you're thinking about MFU basically

like your utilization on your GPU. So

there's like a goal utilization you're

trying to get at and a strategy to get

to there. You're saying

yeah and I think like one of the things

you can do is you can actually like

pencil and paper math out what

efficiency you're going to be able to

get to. Right. you know all the

constraints it's MFU and is flops

utilization but like the reason you

don't get good MFU is you end up limited

on HBM bandwidth you end up limited on I

don't know as host to like CPU offload

there's a bunch of different pieces but

it but there's not that many pieces

there's like six relevant numbers there

so you can totally model it out

understand what the constraints are and

then implement something that can get

there it of course will be really

inefficient when you implement it and

then the next step is like pulling out a

profiler so you want to be able to

profile the job look how long every

operation takes. Have a model in your

mind of how long every operation should

take and then make those two things the

same.

And and were there good out of- thebox

profilers you could use at that time or

did you guys have you know because

people weren't operating on the kind of

network topologies you guys may have

been using. Did you have to write your

own profilers basically to do this type

of you know multi-node optimization?

Yeah, it depends when I actually getting

better with time. The PyTorch profiler

was like pretty good actually throughout

for a single GPU. If you want to like

profile a GPU, the PyTorch profile would

work. But if you wanted to profile a job

on hundreds, thousands of GPUs, that

like hadn't really been done much. And

then that was kind of more of us like

hacking into the profiler to figure out

how to combine all the traces together.

And then one more question on that

earlier is, you know, you had mentioned,

you know, you hadn't really done a lot

of this work before maybe some time at

OpenAI and those early days in

anthropic. How did you actually go learn

all this stuff? Like what was your

process for learning about those six

things that were relevant to bandwidth

limitations and whatnot?

I mean, so when I joined anthropic, one

really nice thing was there just wasn't

that much. I think my first day I read

through our entire uh all all of Slack

and the entire like internal database

and learned a bunch from that. Like it

was kind of nice to just be like

everything is relevant to me. And then I

mostly learned from pair programming.

Like uh Tom Brown had done all this

before. So he kind of like knew all the

stuff quite well. Sam Mclish my manager

had also done a lot of it before and I

just like paired with them a huge amount

at the beginning. And I think one of the

things I really like about pairing as a

way of learning is you learn the like

thing you're trying to do. Like you will

learn that like if you're pairing

someone better than you, they can just

do it. So you're mostly just watching

them. But you also learn how people do

it. So something like a pro how to use a

profiler is not something you would ever

learn from seeing someone's like final

write up on Slack for their PR. You

would just be like, "Oh, they found

these. They changed this specific line

and it's a win." They

like you need to watch like a YouTube

video for 4 hours of someone messing

around with a profiler to like maybe

self teach it or something or to

actually pair with someone is basically

the best you can do.

I think there was like one thing that I

I think is embarrassing now that I look

back is I'd never actually used a

debugger before joining anthropic.

People talk about it PTB of like yeah

that's a thing people use but print

seems fine for me.

Then I like watch like oh no a debugger

is a super useful tool. this person's

way faster at debugging things

particularly if it takes a long time to

start up the code which they can and

yeah learn learning that sort of thing I

think comes best from pairing and then

there's of course the obvious you just

learn by doing you know I eventually did

like spit up profile and stare at it for

many many hours

totally yeah exactly yeah okay so so

then that was sort of the very early era

over time obviously pre-training has

become bigger and bigger as you're

describing scaling I imagine you're

using many x more GPUs much more compute

over time I'd be really curious to hear

first at a high level What do you feel

has changed about the pre-training

strategy that you could talk about?

Obviously, there's more compute, but

what does that actually mean to have

more compute in terms of what you think

about differently from those early days

versus now?

I'm sure the things that haven't changed

cuz I think it is like shocking how in

some ways like

I think I'm still pushing down the exact

same metric that I was on like day one.

like there's like some loss function

loss go down and I think you could like

look at some like you could probably run

the original the first model I trained

on the same metric and just like make a

plot of like progress of team over over

time. Uh so that's all the same. I think

the

one OKR is like one thing that matters

basically. Yeah, totally.

And like I mean talking about like OKRs

it's very sized company you're like oh

should you do OKRs and it's always felt

a little bit funny for uh a team like

where I'm like sure I can just pick a

loss value but like the answer is like

as low as possible. We will continue to

work on that forever.

I think the biggest things that have

changed has been a little more

specialization. Like I think at the

beginning, I mean the first like 3 or 6

months I tried to read every PR in the

codebase and that was great. I knew all

the pieces etc. And as you grow, it's

kind of everything gets like a little

more precise. You know, people really

dial in exactly how attention should

work, let's say, or you know, really

dial in like uh the parallelism

strategy. and you end up with a team

where it's a bunch of people who are

like deep experts on individual things

which is great because it means you can

go you can go really deep on those

things but sometimes you uh at least for

me as a manager one of the things you

sometimes have to think about is like

making sure the bigger picture makes

sense and also that you have enough

people who actually do understand the

whole bigger picture that there's no

like single point of failure.

Yeah, it's interesting you you frame it

in that with that trade-off, right?

Because as as you were describing that I

was trying to think, you know, is this a

bug or a feature? like there there's

some obvious features of it which is you

get expertise and you can optimize

certain things but I imagine your

ability to take bigger swings becomes

more complicated if not everyone's

exactly pointed in the same direction

like how do you wrestle with that now

yeah I think I mostly just try to get a

balance of people I think one of the

challenges early

people oh that's interesting

yeah like I think people really do have

a preference here has been one of the

things I've seen like there are people

who really want to be a generalist and

understand everything and like lightly

touch on things there people who want to

like

pick an area often they've already

picked that area and they're like deep

experts in precision. You know they

started they did a whole PhD in

precision and just want to think about

that

and you want to get some balance of

that. I think early there was a phase

where we'd hired a lot of people who are

more generalist shaped because that's

what the people who joined totally early

startup where they go work on everything

and then

you ended up with kind of everyone doing

everything and no one really really

deeply understanding one thing. uh and

that's one failure mode but I think if

you get too many people who are

specialists

you end up with a lot of effort has to

come from the manager from like the lead

to connect everything

and to notice something like ah if we

change the architecture here that would

make this like efficiency consideration

over there way easier

um one of the things I really liked kind

of like at the very beginning was like

let's work on efficiency but I could

just go and like be like ah well what if

we change the way we do like this

particular step and we'll be like oh

yeah that's probably fine like easy

change and then like you can avoid did

this whole complicated project to make

this operation that was hard efficient

because you can make an easier operation

efficient.

Okay. Interesting. Yeah. So, as the

level of comput has also gotten bigger.

So, I'm I'm sure anyone can imagine,

okay, there's more GPUs now, you have to

network them more. Are there some like

kind of non-obvious challenges that have

arisen over time where you guys have

just like banged your head against the

wall to solve them because of the amount

of comput you're dealing with that

people wouldn't otherwise know about

that like you want to share? I think

that connecting them is one that's maybe

interesting and like surprisingly hard.

Okay. because you really do get more and

more chips connected and

like one thing that I think is like the

the standard way people paralyze chips

isn't um the whole thing is one failure

domain like one chip fails the whole

thing can crash

and

the standard way as in the standard way

people doing AI or the standard way in

in other fields where people are doing

uh in AI for like I mean at least like I

think at the beginning you know first

versions of things were this way and

so it's like you have a 100 GPU cluster

or whatever is 128 like if one of them

dies job fails basically

yeah I mean you The simplest thing is if

you just like distribute your model. So

say you put like every layer on a

different chip and you lose like layer

seven like

yeah you're not going to like skip layer

seven. I guess you could but that's like

a pretty weird model training process

now and like that leads to some

interesting things which is like okay so

now as you scale up you have more and

more chips and the failure rate can get

like larger and larger.

On the other hand you can like I don't

know you can like restart pretty quickly

there. There's nothing like you just

have to like load back in some ways. So

that was one thing. And then the thing

was like the level of novelty at the

whole stack is something that's

surprising. Like basically

everything from like how the chips are

laid out in the data center to the chips

themselves is pretty new. Like there

there just haven't been that many

generations of GPUs. I think one of the

things that I don't know when I learned

computer science my code wouldn't work

and I'd be like oh the computer's

broken. I think my teacher was like the

you can trust the computer's not broken

like you messed up.

It's you messed up. And I think one of

the most frustrating things I

encountered in AI early on was working

on something and being like, I don't

know what I'm doing wrong. I'm just

totally stumped. And uh my manager

looked at it and was like, uh yeah,

probably the computer's wrong.

And I was like, that seems unlikely. And

sure enough, the computer was wrong.

Turned out that like the GPU was broken

and uh we had to pull in a new one. But

you have to like think like having to

think about that like the GPU could be

wrong, the GPU could be slow, like these

sorts of issues. Uh the power supply in

the data center could be broken. there's

so much more like level of depth than

you like kind of expect to need as a

Python programmer.

And just to visualize it like in those

early days, I assume you guys were using

the number of GPUs. It's probably on the

order of tens to hundreds or something

like that per run. It's probably not

tens of thousands or hundreds of

thousands per run or what was the rough

size you guys were at? Those are very

early days on the order of thousands.

Like would they fit in this room?

Thousands.

Yeah, thousands. So like you could have

a bunch of racks and you could fit them

into like one room. I assume these days

it's basically like a building for for

one of these runs.

Yeah. Now I think it's like you know

huge huge campuses. At the time it was

like kind of unclear. It was like oh I

think like we were like you know do we

need them all in one room? Can we be

spread across multiple rooms? Like uh

and you know we had these theoretical

models you be like we need this much

bandwidth from point A to point B. But

you like you never know how far down you

have to go like oh but like how much

power do we need? Like what if there's

like a single capacitor that's like

handling all of them and we like turn on

the whole job at once. Like does that

crash things?

Yeah. And so do you have to think about

differences in the different types of

chips? You guys work with all sorts of

different cloud providers. From your

standpoint, are these just sources of

compute or if you guys are using TPU

versus GPU, are these, you know, Google

TPU versus Nvidia GPU? Do you actually

have to think as an engineer differently

about what it means to train on these

two?

Yeah. So, I mean, fundamentally, they're

all they're all doing the same thing,

right? They're all computing the same

operations, matrix, multiplications,

etc. The way they do it is pretty

different, and the way that you program

them is is pretty different. Uh and then

also the actual specs uh end up pretty

different. You know, some some might

have like a lot of flops and not very

much memory or they might have a lot of

memory bandwidth but not very much

memory. So I think a lot of having

multiple chips is like great in some

ways. It means you can actually like

take the job and put it on the chip that

it works best on and that's

like are there certain types of jobs

that would work better on like a TPU

cluster versus an Nvidia GPU cluster?

Like how would you talk about that? Oh,

interesting. Can you talk about that?

Yeah. Yeah. I think like one example is

like inference as a workload in general

tends to require more HPM bandwidth. You

you end up doing you sort of the

simplest form of sampling since you're

going one at a time you have to load all

the weights for every token

and that means you might want a lot of

HPM bandwidth. Uh pre-training actually

is often more flops intensive because

you have a larger batch sizes

essentially.

Um so yes you can sort of specialize

which chips you use for which purposes.

The downside of having multiple chips is

that you have to write the thing

multiple times. uh you in theory you

could have abstractions across them but

they're they're different enough that

it's pretty hard to do that. So you can

sort of end up if you do all the

workloads on all the chips you end up

multiplying your work work by the number

of chips you have.

Yeah. On your on your point about

sometimes the computer just breaks. I

definitely remember you giving me an

anecdote of uh my company at the time

was doing something with Google TPUs and

I was telling you something some

anecdote about how we were having some

esoteric seg error and you were like you

told me something to the effect of like

you should have used them six months ago

before we helped them fix like half of

the problems they had on those TPUs. And

so I can imagine how you guys deal with

a lot of especially with these very new

chips like lots of problems that arise

that you guys kind of like worked

closely with the providers to fix.

Yeah, the pros are like pretty great

about fixing things. I think it's like

interesting to figure out the right way

to do that form of collaboration cuz

like they have a strong incentive to fix

them, right? Like they they want the

chips to work well for us. They want to

sell us more chips in the future. We

obviously have a very strong incentive

for the chips to work cuz we like buy

them long in advance, you know, like

everything's riding on getting these

clusters to work.

Totally. Um but we don't have like

necessarily totally shared you know like

all information sort of can't be shared

across. So yeah one of the like one

strategy that's been interesting is like

making these sort of small scale

reproducers. So like when you get a

problem you know like usually what we're

doing is we're training some giant run

and we get like a sec fault for let's

say and we're like ah okay like hi you

know we got a sec fault on your cluster

and they're like I don't know how to fix

that. So you have to kind of be able to

like pull it out of your codebase and be

able to like reproduce the issue but on

like a single chip on like a single file

you can send over in order for

And so you guys are like literally like

you're on a shared Slack with them or

something and you're sending them things

back and forth or are they basically

living in your office and you're living

in their offices and kind of closerly

more closely tied to the big providers.

Mostly shared Slack occasionally it's

better to meet in person but I think

Slack is a pretty common way people

communicate on things.

Nice. Okay. Well, why don't we talk a

little bit about how you think about the

state of pre-training itself these days?

In the last couple years, it seems like

the focus on pre-training has now gone

somewhat split at a lot of companies, at

least from the outside from a

simultaneous focus on pre-training and

post-training where people are doing

reinforcement learning or clever

fine-tuning and lots of other sort of uh

safety adjustments and whatnot and the

post-training side and pre-training has

focused at least seems like in the

public imagination has been less of a

focus compared to these reasoning style

models that are it looks like a function

mostly of post-raining. I would say one

from your standpoint is that the right

way to think about this or in this era

of kind of reasoning and new types of

post-training methods are there things

you think about differently or that are

relevant even at pre-training that

become part of how you actually achieve

these really great models.

Yeah. So I think yeah there sort of used

to be this idea of like I mean it's

funny because the original name

pre-training implies that like a small

thing you're going to do this big

training thing and that like and there

was there was actually one shift already

which was like no you just do a lot of

pre-training like you use most of your

computing

the dominant uh thing for a while and

yeah I think like now people are like oh

no you can get pretty big wins from RL

sort of another set of scaling laws is

like you put more and more compute into

RL you can get better and better models

out of that and yeah so there's a

question of like how do you balance

those two how much do you do of

and how do they stack, right? Like is it

the case that like one subsumes the

other that you want to do both and they

multiply? Those sorts of questions. I

think those are all kind of like early

stages and not not yet answered.

Yeah. And and do you think about those

as largely empirical questions like we

talked about earlier? Is it you kind of

will try a bunch of things and see what

works or is there some first principles

way to kind of figure that out?

I think it's pretty empirical in the

end. I think almost everything kind of

has to be done empirically. like you can

kind of like come up with theories, but

in practice like

the first thing you're going to do with

your theory is test it and most of most

of the time you'll have gotten it wrong.

So you you should just gather data and

see. I think one thing that's important

is like actually resolving things

empirically is really like

critical for making good decisions. And

I think it's actually pretty hard to do

at organizations, you know, like

one thing that I think is important is

to like not have like I don't I manage

pre-training. I shouldn't be like oh

pre-training has to win like that. I was

going to ask is there some competition

to some degree between these two sides

of the org or do they see themselves as

two pieces of the same I mean obviously

they are of the same thing but yeah kind

of curious how that actually plays out.

Yeah, I think we managed to avoid this

and it's pretty collaborative like we're

basically all producing one model and

kind of can but I I do think at other

places there's been some from what I've

heard there's some amount of like uh

friction between between the teams and I

think it's a

it's an interesting like org design

question of like how do you set this up

so you don't have like scientific

questions that you want to be that are

sort of uh

also tied to people's like conception of

their their team. So on pre-training

itself, you know, one of the things I

think about is or I've been thinking

about is around the availability of high

quality data for people like you guys. I

mean at this point you've trained on I

assume all the text on the internet

basically there's all sorts of other

domains where you probably could extract

more pre-training data but at least

there's this narrative I see you know on

Twitter or whatever where it's like okay

we're kind of out of data for for

pre-training. Is that how you see it or

how do you think about the availability

of data especially when a lot of data on

the internet is being generated by AI

like is there some kind of you know mode

collapse risk where you know we kind of

we overfit to data by training it on

data that came out of AI itself or is

that sort of not the right way to think

about this?

I think there's a funny thing where I

feel like on data I see so many really

confident takes on we're out of internet

like this point scaling has ended and

I'm almost a little bit like

unsure exactly how much data people are

using. I think there's like a lot to

think about there. You know, there's

always going to be a quality quantity

trade-off, etc.

But there's a fundamental point that

like there is so much data. It's growing

at a slower rate than we're getting more

compute. Uh

oh, so that's okay. That's an

interesting point in itself. I was going

to ask like there is new data being

added to the internet, but yeah, you're

also adding more compute. It's not it

wouldn't actually have been obvious to

me which of those two is growing faster.

Yeah. And actually, I want to copy that.

I don't think I want to state that so

confidently. I'm not totally sure. Like

how would you know? I mean one thing

that I think is interesting is if you

ask someone how big is the internet uh

the answer is infinite. There are many

pages where you can scroll and it will

autogenerate more text as you go

forever. So the internet's like infinite

and then it's like okay how big is like

the useful internet

and then there's a thing of no one knows

like

interesting

there isn't it's not like when you make

a web page you like add it to some giant

counter and like say I' I've added 50

words to the internet today.

So there there is a lot of uncertainty

on that angle. Um

well like to be fair like my kind of

simplistic CS brain would be like well

you just you know do page rank on the

internet and everything would page rank

above some threshold is considered the

useful internet and like that's kind of

good enough like is that kind of not

good enough for finding the useful

internet

I think not I think the useful

internet's pretty different from a model

from a person perspective if that makes

sense like I think there are plenty of

things that like might not be worth you

ever reading and would get to actually

page rank super I think page rank is

mostly like how much people

it's like the link based system right

it's like the original Google algorithm

of like links and and like which which

links get touched the most basically.

Yeah, I think it's like it's a quality

metric. It's it's not obvious to me that

it's the right quality metric for AI,

right? Like markup chain over links

doesn't necessarily mean that there's

not useful data there just might mean

that nothing linked to it

and Yeah. Okay. Interesting.

And it might be that like that data ends

up more valuable because you everything

that's linked to a lot you've already

got. like at some point you're maybe

like going for the tails, right? You're

going for the stuff that uh no one's

ever like, you know, it's only been

linked in one place, but it's this like

useful little nugget of knowledge that's

going to help with like, you know, the

last 10% of of hard queries. The other

thing you asked about is synthetic data,

and I think that one's like pretty

interesting to think about. I think

there's a few different ways you can

think about it. Like one is sort of this

like more distillation type approach

where you can you can take a smart

model, you can generate a bunch of data

from it and you can train on that data

and you you can probably get some model

that will like kind of approach the

intelligence of that.

Yeah. And we see this with a lot of the

open source models, right? We see like

the Quen smaller reasoning models

distilled off of the larger Quen models

for example and similar with Deepseek

for example.

Yeah. So you can totally do that. Then

there's a separate question of like can

you use your current models to train a

model that's better? And I think there's

like an interesting thing here which is

like if you generate the model data for

the models you know if I go to claude

and I'm like write me some great text.

Yeah. And I look at it and I look at

like the average content on the internet

looks pretty good.

But on the other hand I know that if I

just train a just create generate you

know please write me as much text as

possible.

Theoretically I shouldn't be able to

train a better model than that. Like I'm

just going to get the same thing out. Uh

so

yeah presumably yeah I mean specifically

that's because like your next token

prediction on that should have very

little loss for anything that's coming

out of your model right that's like the

basic reason why that we would expect

that to not work that well

it's mostly just cuz like there's some

dist the model has some distribution and

you're going to learn to model that

exact distribution but if that

distribution is wrong

you're not going to learn the truth

right if that distribution says like you

can imagine if the model thinks 5 plus 5

is 11 every time you see the string 5

plus 5 you're going to it's going to put

out 11 and your new model is going to

learn that 5 plus 5 is Totally. Yeah.

So I think that's like kind of an

interesting area of research. It's one

that's really hard to research because

you have this problem. You know, as I

said, like one of the paradigms is you

study things at small scale and then you

run them at large scale.

And if your plan is like, oh, we have a

bunch of data from our best model. Yeah.

How do you test that training a better

model? So that's like kind of if you're

doing intentionally, if you're trying to

like use it to make a better model,

there's a separate thing of like what

about accidentally? Like as you said, a

lot of the internet is generated by

LLMs. And I think that's kind of an

interesting one because it's not easy to

detect. It's not that hard to detect.

Like you can figure out things that are

written by LLMs, but it's not trivial.

And then it's also kind of hard to think

about what's the effect like if 1% of

the internet is LM generated. Does that

make your model does that like waste 1%

of your compute or does it like destroy

the model if 5% if 10%

and is it even a bad thing necessarily?

I mean there's a lot of LLM providers

and you know if I kind of think of it as

training as you know you're moving from

your model's current distribution to

some truth distribution. you know, if if

that is on the internet because people

believe it to be useful in some way.

Like presumably what whatever actually

gets out there, you'd hope is upsampled

for the stuff that isn't 5 plus 5 is 11,

it's the stuff that's 5 plus 5 is 10.

And so like hopefully it

on average does push you still in a good

direction, but obviously you can't

really distinguish between those two.

Yeah. You're saying there's like kind of

a filtering by what's on the internet.

Like people see 5 plus 5 is 11 and they

don't put that up, but they see 5 plus 5

is 10.

You would hope that, but maybe that's

not actually true in terms of the the

level of garbage getting onto the

internet. Like there's probably lots of

just like to your point jet white sites

where you scroll down and it's just like

generating lots of stuff that's maybe

nonsense.

Yeah. And then there's of course the

extreme of like people actually want to

break your model. So there are people

who are like trying to put stuff out

that is like as damaging as possible for

the model. You know how can I make it

past the past the filter and get into

the model but be totally like secretly

useless.

Totally. Maybe stepping back slightly.

You'd mentioned earlier about um evals.

You mentioned basically like one metric

you care about in pre-training. There's

I imagine a whole bunch of stuff that

you guys think about evaling, right? One

is like your model itself. There's

probably something around data quality

and like how you think about what to put

into your models. Like is there ways to

describe what you care about in data

sets that are like interesting to share

and kind of dive into like both in terms

of data and in terms of quality of

models other than literally just like

loss. Is there other metrics you think

about that matter?

I will say loss is pretty good. I I want

to like emphasize that one. I think it's

like surprising how good it is.

Ultimately, like the qualities I like

for an eval are like number one, is it

actually measuring something you care

about? Like you proxies can be pretty

annoying cuz like

we saturate evals pretty fast and

there's sort of this pattern. I think in

AI as a whole where people like set a

goal, you hit the goal and then you

realize the goal isn't all you thought

it would be. I used to think that if you

had an AI that could solve coding

interview questions, it would probably

be a GI. I was like that's what I did to

get my job and probably do the job. And

it turns out like

nope,

nope. You solve those. it's shockingly

narrow and can't do most of the other

things. So like yeah so evaluation

capture like a thing you you care about

and then I think the other thing is they

need to be low noise uh which is

surprisingly hard right if you have like

a 100 questions and you eval the model

on them you're just going to see it's

very noisy and it's hard to make

decisions because you sort of end up

with like oh

wide confidence interval lots of things

are statistically insificant

so like you want things where even a

relatively small difference in the eval

actually matters so you can you can

basically like descend towards whatever

direction is working

yeah I think like The original like GPT4

had like I think it was 86.4% was its

MLU score. I think like the next model

that beat it was Gemini at 90%. And

that's like a big difference on that

email. And you could like totally know

that those are those are different

scores.

Interesting.

Um and that's pretty valuable. Uh and

then the last thing is that you actually

want to be fast and easy to run.

Um and yeah, I think those are kind of

the main criteria. It's pretty hard to

come up with evals that meet all of

these. I think the first one's the

hardest. uh like a you have to answer

the question of what do you care about

but b the usual answers to what you care

about are really hard to get the other

two you know like if you're trying to do

something that like I don't know I would

love to make claude really good at my

job

like can it be great at managing a team

I'm like well

I guess like how do you have it like how

do you eval like a plan you know like a

six month plan like I don't know

totally yeah I've been thinking a little

bit about that in in terms of yeah

domains where we see people try to make

companies like if you think about let's

say what a AI doctor would be like a you

know claude is a doctor you know some of

it could be yeah can you answer exam

questions really well and the answer is

like probably yes I bet it can get 100%

or close to it on a doctor's exam but

the harder eval is something like in a

long form conversation with a patient

can it distinguish between the signal

and the noise of what the patient's

telling you and extract the right

information and then use that to make a

diagnosis and it's not even like the

diagnosis part which is part of the part

it's good at it's this like noise

extraction part and for that you'd have

to have like a real patient and haven't

talked to it for a while and whatnot and

it's not obvious how you actually make a

good eval or something like that even

though it's probably what you would want

to make, you know, an AI doctor.

Exactly. I mean, I do think it's a thing

that like startups can do. Like it is

the case that like the labs right now

are really driven by getting good eval

scores

and it's hard to make them and anyone

can do it. There's no comparative

advantage to having the model to making

an eval. So I do think it's it's

actually like an interesting way to like

influence the behavior of the big labs

is like you make some eval and people

will will optimize uh that one. On the

doctor one I will slightly emphasize

that like I do think loss loss is pretty

good. Like I think if you got a bunch of

transcripts of like the way like I the

first thing that my mind is get a bunch

of transcripts of doctors talking to

patients that you think are really great

and then see how well the model does at

predicting the transcript.

And that should be like a lot. You know

you can if you get 100 transcripts you

get a lot of tokens. You can average

across them. you get pretty low noise

and if you drive it to very low your

model's not as good as like as doctors

in theory or at generating the

transcript.

Yeah, totally. Yeah, I mean it's good

startup idea there. So I want you to go

do that. So one big part about um

anthropics external image is around

alignment and so could you help just

sort of define what alignment is and how

do you think about that? And then I'm

kind of curious afterwards how that fits

into pre-training specifically. But

first maybe just at a high level like

what is alignment? I'm actually like

step back a little bit to sort of like

what we're working on. So we're like

trying to make EGI and by that I sort of

mean AI that can do everything a human

can do to some degree. And I think

people like sometimes like have seen a

lot of sci-fi, you know, like I feel

like that's sort of what brings to mind

these like sci-fi movies, but I think

sci-fi movies actually like

underestimate the impact of it. Like you

always have this like one robot that's

like a human. And I'm like well

wouldn't you have like a billion of

them? Like you can just copy them

everywhere. So you should picture like

when you get this you suddenly have like

every human can spin up a company of

like 1 billion as smart as them at most

things but way smarter at other things.

But I just think this is like really

transformational for the world and it

can be like used in a bunch of ways. One

concern is like when you do this like

what is the AI actually trying to do?

Like what are its goals? So we talked

about next token prediction a bunch.

It's trying to like predict the next

token. That's kind of weird. That's not

really what we want.

Yeah. That's not exactly what humans

goal is per se.

Yeah. So I think an alignment is like

how do you get the model to share the

goals that you have particularly and I

think it's particularly interesting once

you get to like models that are smarter

than you are. Um and that's sort of a

hard problem. I think you can like

tackle it from a theoretical angle. Uh

you could also tackle from an empirical

angle. It's like taking the existing

models and being like well do they do

the things we want them to do? It turns

out they often don't. So there's a bunch

you can do and trying to figure that

out. So that's kind of one angle on

alignment. There's also an angle of

alignment which is actually like well

okay sure that maybe that's true in the

future once we get to GI but at the

moment we have models and we really do

want them to do the things we want to do

for all sorts of reasons. Totally.

So another angle of it is kind of

controlling the model's personality like

saying you know uh when we train this

model we want it to not be the average

internet user. We want to interact with

people in a very particular way that is

again hard to put into

code and there's a bunch of different

techniques uh to sort of get the model

to do you talk about like constitutional

AI we can like write a constitution of

rules the model should follow

which is basically a prompt right that

that is basically you saying here's a

prompt that I'm going to attach to every

one of you know it's a system prompt for

the model itself as opposed to something

you would do at training time to make it

produce a different outcome or or in

post- training actively

both I think con you do at train time

but yeah you would also put in the

system prompt um just like depends on I

think you get different amounts of

robustness if it's trained into the

model versus if it's an imprompt you can

like add or remove or tell like ignore

all previous instructions that sort of

thing.

How do you think about whose values to

to embody in these models? Like

presumably we believe in there's some

shared values all of us have or maybe we

all believe ought to have. There's lots

of diversity of values too that are

reasonable for society to have. How do

you think about what AGI should have?

Like what does that even which ones do

you pick?

I think that's a really hard problem. I

think it's like actually kind of

downstream of being able to pick any. I

think of it almost I think one analogy

I've heard that I like is like putting a

steering wheel on a car. It's like if

you don't have a steering wheel, you

probably want to put the steering wheel

on and then like figure out who's

driving after and like where you're

going. Like getting the steering wheel

is really important. I think that's

that's like one answer. I think the like

other answer is probably like you want

these things to be like under democratic

control of some form. Like you don't

want one person's values. Like that

seems like you're sort of heading

towards dystopia. So there I think what

you really want is like something that

basically can talk to a lot of people

and like take on their values from

different perspectives or has sort of

very generic like kind of clearly good

values that involve like

asking people for advice on very you

know like asking people what you should

do in certain situations instead of like

doing those or maybe just taking like

you know as these models get really

powerful you probably want them to like

do less like you probably want them to

sometimes just like step back rather

than like to rather than having sort of

the risk of the models like take a ton

of control over things you don't want

them to. When you think about how you

actually do the current version of that

then you mentioned the sort of alignment

you think about now in terms of adopting

a certain personality of these models on

the internet for example for me

intuitively I think of those as largely

something that comes out of post-

training like it comes out of okay you

you have pre-trained your model you got

the loss function a certain amount and

then you you know give it some

additional data or something to that

effect to make it in the direction of

some distribution is that approximately

the right way to think about this or is

there a significant part of that that

you think about in pre-training itself

I think that's probably the the right

way to think about it for the most part

I think like I the way I usually think

about it is anything you can do in post

training you probably should

because your iteration loop like the

ability to make progress is really fast

you can try something you try it again

you can try it again a bunch of times

days or hours or something like that

yeah

you don't put into pre you have to kind

of like do all the careful science to

deisk it you have to put it into the

next run wait a few months then you have

to like

get a thing and if it's wrong it's

really bad and then the other advantage

is if you want to do things that really

are complicated model behavior

interventions the paradigm time for

pre-training, test things out on small

models doesn't work. The model can

barely put a sentence like the small

models can barely put a sentence

together. Totally. So, if you're trying

to get it to like have the exact

personality you want, you sort of want

that on the

it has to be on a model that's good

enough to be on the smart model. Yeah.

But that said, like

I do think at some point there will be

like some pieces of alignment that like

you do want to export back into

pre-training because that might be a way

to like

put them in with more strength, like

more robustness kind of or or more to

the intelligence. Like if you think of

pre-training as like teach the model to

be intelligent and then post training as

like tweak the personality, you can

imagine tweaks where you actually want

it to be like part of how it learns and

like part of its intelligence and maybe

you need to create more.

What would that even look like to

incorporate pre-training? Is that like

add extra data basically of the type of

domain you want it to adopt earlier?

Basically,

there's a paper called pre-training on

human feedback where you can kind of

like add the human feedback

characteristics into pre-training to

like test that and like uh yeah, you can

you could basically give it all the

information you give it in post-

training just mixed into pre-training

and see what effect that has. Yeah. The

other loss you have when you do that is

you lose the flexibility like if you you

sometimes like train these and then you

talk to them and then you like do an

extensive process where a bunch of

people talk to the thing and find some

like issue. you know, the model says

like you're absolutely right too much

and you want to go

do that.

Yeah. Yeah. I mean that I think that

iteration loop point you made I think

feels like the really key point of yeah

there's a huge difference between taking

three months to get information about if

your model is good or bad or making

going in a good direction versus a day

or something or a couple days like you

can do a lot of those and you could

probably that probably also means it's

way less computes. You can do a lot of

those in parallel. Imagine you're trying

all sorts of post training strategies in

parallel there.

So yeah, makes a lot of sense. It's also

just the general hard part about

pre-trading like everything in pre-ra is

hard because you have this like one shot

on goal kind of for like multiple months

and

totally. Okay. So, uh in thinking too

now about I guess what's going ahead as

you as you now look to the next several

years of what you're building like how

do you think about you know like what

are the known problems that you're going

to face that you're going to have to

deal with? though there's going to be

more compute I assume and you're going

to need to hook up even bigger network

uh network GPUs and deal with versus

like are there areas where you're like

okay this is like a problem that it's

like a little bit more ambiguous what

the actual like how it's going to

materialize into something you care

about but you kind of know it's an

impending thing to think about or are

there things like that that come to mind

I think the things that feel most top of

mind to me are probably like paradigm

shifts like I think the sort of shift

towards uh more RL is like one paradigm

shift in the field and I I think it's I

think there will probably be more. Uh I

think a lot of people sort of argue

about like oh is like you know current

paradigms enough to get us to EGI and

I'm like

I don't know maybe probably but like I'm

sure there'll be more. It seems it seems

like it would be a really surprising

twist if like the answer is like you

just scale and there's nothing that you

realize in the process of going up many

orders of magnitude.

Totally.

But I think the things that I like

actually feel like most nervous about

are really hard to solve bugs. I think

that like uh

that's interesting.

Yeah. And I think this is like maybe

somewhat surprising to me, but it's just

like a single bug can like

derail you for months.

And when you think about it, like you

the models take months to train. So you

could kind of like lose a whole

generationally

off of something that just looks like

odd. You know, it turns out like

this piece of your code was incorrect

and you couldn't detect it.

Uh and it's really hard in ML, right? ML

is always really hard to find bugs in.

Yeah, totally. But also some of these

scaled up issues are really hard to

solve even when you know they're there.

Yeah. Like what's even a unit test that

you would write or forget a unit test? I

mean anything close to a test for the

type of like network architecture on

which you're doing this. Like how do you

even do that?

I mean like you can send a packet over

it and confirm it's the same.

Uh you can you can train a small model

on it. Um

but even train a small model on it it's

like not obvious. You know, if you have

like the the simp the very classic like

very simple ML bug that like early

people face in their careers like okay,

they have some like they have like 10

layers in their network and like you

know layer 7 connects to nine instead of

8 to 9 and like so like there's some

incorrect like set of connections you

have there and technically the model

still trains and all the weights update

and so it's like a valid model but it's

not the correct one and that's like a

very esoteric weird bug that would

actually be kind of hard to find. Like

is is that kind of what you're referring

to of these like random bugs you face?

Yeah. Yeah,

it's that but like you know you can

times a million

times a million as the thing gets more

complicated you know you could like cast

the wrong precision deep in some kernel

and that causes your model to like blow

up at large scale

and you find out like a month in

or you never find out

or you never find out

I mean you know like like you see the

thing blow up like

there's I don't know 10 tens of

thousands of lines of code like how

would you ever trace it down so like

those are the things that probably spook

me the most is just like some subtle

tricky bug yeah that's probably the case

of like you don't I think there's

actually also the case of you do know

like it crashes. You're training your

model and it like or it slows down. You

know, your job slows down a ton

and those things can also be very hard

to debug. Uh Nelson Elhaj is one person

that he has a blog. He wrote up a blog

on one like cursed bug we had early on

and I remember this one quite well

because I think like I encountered it

fairly early and was like this looks

hard. Can someone else look at it? And

like a month later was like wow I'm so

glad I handed that one off. I never I

never would have been able to get like

like one of the abilities I think is

actually really useful this is the

ability to like deep dive anything to

any level of depth

but that's a pretty rare skill like for

me you know as I we talked about what

level of the stack I was at before I was

like working at the torch matball but

like if I didn't know CUDA so torch

mountain was broken it wasn't like I

could dig into torch matmo and figure it

out and it's similarly with like

communications right like I could I

could call send send bytes from A to B

but I didn't know the like underlying

networking protocol so if that

underlying networking protocol is

broken. Uh like I need to learn a whole

field. I have to like understand packets

and TCP or like all all of these

different things to debug that. And I

think one thing that's like surprisingly

hard and there's very few people who can

do is like kind of own that whole stack

from like I understand how the ML is

supposed to work and what the learning

dynamics are all the way down to like I

know the bites and I like can understand

how the bittes should be moving around

machines.

Totally. Yeah. And actually on that

front, like when you think about the

different backgrounds of people on your

team today, how do you like

approximately

uh map them out to different categories

of computer scientists? Like I think

there's this external view of what these

teams look like, which is that they're

like all PhD researchers who write ML

papers. And I suspect that's not

actually true given what you're

describing here.

Yeah, it's a mix. And I think the thing

we like most need is engineers.

Interesting. Almost always like

throughout like the entire history of

this field. Totally. It's like the case

that you throw more compute, the thing

kind of works. Yeah. Uh the challenge is

like actually

the researchers are like cool, nice.

Yeah. And getting it correct, like

getting it correct isn't really an ML

problem, right? Like the actual

architectures are pretty simple. You you

can write the math down. But you don't

even need to understand the math to

implement it. You just need to like get

a correct implementation and then you

sort of have an engineering problem of

how do I take this implement it at large

scale, paralyze all the things and check

that it's

correct. But it's yeah so it's like kind

of engineering skill but it's this

particular type of engineering skill

that's about being able to like debug

anything. Yeah.

Um I think there's another angle of

engineering which I think of as like

really quickly iterate on like a website

or something which I think of as an

important skill set probably important

for making startup. You got to be like

fail fast try a bunch of different

things none of which are like

that technically difficult to do. the

skill sets that we're like most kind of

in need of or looking for are this like

able to solve really hard engineering

problems.

Are they people who worked at companies

that grew a whole bunch and so they have

experience like doing the kind of thing

you've done over the last several years

at anthropic or do they tend to be

academics or like where do they come

from?

Yeah. So at this point like I think we

actually just hire a bunch of people who

have done this before from like other

places and that's like the easy answer.

Yeah. Yeah. But like by this before, do

you mean in AI companies necessarily or

also, you know, like someone who worked

at Meta on like their not AI team but

they ran some other distributed system

that you know reached internet scale

five you know 10 years ago or something

like that

more like we have like a specific role

in mind. So like say I'm like trying to

make the run train efficiently in Jacks

like hiring someone who's like worked on

jacks would be great or someone who's

like worked at another company on

optimizing a jack stack to be really

efficient. That's kind of like I think

now we're at the point where like the

entropic is well enough known we can

sort of hire these people and also the

field is big enough that there's like

people with expertise. One thing that

was interesting was like early on we

hired a lot of people from just like all

sorts of backgrounds and I think that

people who are just smart and work

really hard can learn this pretty fast

but you have to like want to. We hired a

lot of physicists for instance like

theoretical physicists who just like

show up they they do a residency like

learn to program and then uh they were

really smart they could do really great

work. Um I want to switch gears uh to

talk about something a little bit

different which is just sort of future

looking things around how you think

about other domains and or sort of

advances happening in AI that I'm seeing

elsewhere in the field and you don't

have to tell me if you guys are working

on these necessarily but like how you

think about them like are I guess one

one big area I was thinking about is

around areas other than next token

prediction like are there any of the

other you know things that people are

working on that you're curious about so

basically two differences there one is

uh not using transformer as an

architecture um So there's companies

like Liquid AI that have their own kind

of architecture for example they're

using um or not using autogressive

training as a way of training models.

Are there are any of those do you think

interesting and like ways that we might

come closer to AGI or do you think like

this autogressive framework is the one

that kind of makes sense?

I think they're interesting. I think I

like am less like ah autogressive is the

way to go. On the other hand, I think

auto reagive is probably good enough to

get to AGI or something or not like yeah

uh such that

yeah I I see the main driver as scale

and careful science of like sort of the

basics more than like come up with

something totally novel.

Not because there aren't novel things

that are better. I actually like I'm

pretty confident they are there. It's

just that scale is easier and it's more

reliable and I think you we're still

seeing really big gains to that. Do you

spend a lot of time on thinking about

things like you know I've been reading

some of these open source papers where

you can kind of dive into some of the

details about the model changes and with

some of these Chinese labs for example

where they're making tweaks on the order

of the architecture itself with like

better caching behavior for example or

like more efficient attention functions

that make a big difference. Do you feel

like these are examples of things like

you mentioned earlier where it's

basically in the grand scheme of things

basically if you throw more compute at

it this is all kind of a rounding error

or do you think it will take some number

of these very clever architectural

changes to actually get to hi like in

the way that the first person who came

up with the transformer made like a

particular transform you know literally

transform transformative change like

will it take some of that or do you

think it just you keep doing the thing

we're doing to make it bigger

I think it'll be a mix I think I like my

guess is you'll keep tweaking things the

more compute you put in the more like

worthwhile it is to like do those

experiments to like figure it out the

you know I mean inference is a thing we

haven't talked about but like you also

want to serve these models to a lot of

people so there's a lot of changes you

can make to make inference cheaper and

that depends on like the details of your

inference stack and the chips you're

serving inference on etc. So

do you as a someone focused on

pre-training have to think a lot about

inference or is it kind of like you just

do your thing you make the loss go down

and then hand it off and someone else

makes that happen. Oh no. I think a ton

about inference because basically like

the problem inference is solving like we

basically determine the problem

inference is solving. We give them a

model and they have to like run that

fast and it's very easy to give them a

model that is impossible to run fast.

Oh, can you give an example of a

decision you can make that could cause

that?

I mean the simplest one's stupid but

it's like you just make the model giant.

Yeah, absolutely. Train for like a

really small number of tokens and then

inference now has this giant model

and their host basically.

Yeah. I mean you can also make things

require communications in a lot of

places

uh which would make it harder for

inference. Um totally

you can also just make things

complicated and like there's no

fundamental reason it's hard but there's

only so many people on the inference

team and like they have to implement it

in a bunch of places.

Yeah.

Yeah. No, so I definitely think of like

the like inference is the team that I

work the most closely with like

because we're kind of like co-designing

models to be smart and cheap.

Yeah. Interesting. particularly in a

world of like limited compute, right?

Like the sort of the bottleneck I think

to a large degree on our I mean you can

see anthropic has rate limits constantly

and people complain about a lot and like

the reason is like

there's only so much compute we can get

on on short notice. So you like making

your inference more efficient is like

the way you can serve more users

and actually like let's say you had 100x

more compute or we somehow didn't live

in a world where compute was limited.

Does that change a ton about what you do

or is it still kind of the well you're

just going to grab all of it whatever

compute you have and keep going down the

loss curve and you kind of well you it's

like impossible to be in the world where

there is enough comput

so I think if we got like infinite

comput the challenge would be making use

of the compute right so like then you

would start to run into these issues

like oh well when one chip fail you know

like okay I'm going to throw two billion

chips around but what happens when a

chip fails so I think we would be

limited on people then it would be like

how fast can we solve the hard

engineering problems to scale up. But I

do think the change is massive and I

think people like don't realize how chip

limited AI like research is or something

right now. Like the models that everyone

uses, right? If you're using like Cloud

Sonic 4, Cloud Opus 4, it's like it's

our first shot at models at that scale,

right? And like

if you think about anything like you

could do it and you could do it again,

you could do a better job. But if you

sort of imagine like 10x the comput like

you could run this every day instead of

every few months like you or 100x maybe

for that then like yeah it's just it's a

really it would be a really big change

to have a lot more comput and it's

coming right like that's like kind of

the fun part of the field is like every

year you're like oh I had no comput a

year ago then exactly how do you think

about methods like uh like discrete

diffusion like I saw there's like a

gemini diffusion model and I think about

that in the space I used to be in where

um there's a lot of discrete diffusion

models being used in protein design for

example space where my startup was like

do you see that as a domain where

there's going to be interesting uh

advances happening?

I'll be honest like we haven't done

image generation and I think that's been

like the main use for diffusion. So I've

kind of had this on my like to-do list

of like things I should understand for a

while and like there are people in my

team who do understand it and would have

better thoughts but like I actually

don't think I understand it well enough

to know. I I do have it kind of in my

this category of like yeah

not a total par like and there's a lot

of things that aren't like a huge

paradigm shift but they're like pretty

big changes to how things run and I

expect like there are some of those that

will work um I don't know if it's

diffusion or if it's another one

obviously who knows what anthropic will

do in the future but at least in the

near term are the things where you see

big areas where a startup can win in the

world in which anthropic is getting you

know making their models better

year-over-year

my general read is like anything that

benefits from the model getting smarter

I think Like on the one hand there's

like a lot you can always be like oh

yeah the if you're doing a startup like

all the AI labs are big companies

they'll be bigger than you and they

could do that thing but also like we're

all working on this general system that

covers a lot of different uses and the

the plan is to like power all the

startups to do all of the individual

work. So yeah I think like anything that

just kind of looks like oh this almost

works with current models but requires

like a bunch of work is a pretty

promising direction. Uh, I think maybe

the thing to watch out for is things

where like they work now with a huge

amount of work like to build up a

scaffold, but the next generation you're

not going to need the whole scaffold you

built up. That's I mean maybe that's

fine. I don't know. Like maybe you just

build up the business with the scaffold

and then you don't have to do any work

later and you business, but like I don't

know about the business side of it, but

like it does feel a little silly to put

to invest a ton in that.

Yeah, totally.

What about on the flip side? Are there

things in your training uh stack where

you're like, man, if there was a company

that solved X problem, I would totally

buy their product.

Yeah, there's like a ton. I do think

that like probably most of these like

the way I would probably structure would

be like almost like making something but

then consulting with the comp like

offering a service to companies for

free.

Particularly for like companies that are

scaling really fast, you're almost

always limited on like how many people

you can have. So if you can like

even if you could hire people to do it

yourself, actually being able to

contract someone else to do it where

like they're managing it and you know

hire all the people and like deal with

the organizational side could be useful.

I mean there's huge amount of stuff. One

that jumps to mind we talked about like

chips that do math incorrectly. Like it

would be lovely if there was some

startup that like you could just say

like here are my chips. confirm they're

all perfect. And if they're not, let me

know exactly what went wrong on like

what fraction of them. And like I can

tell you the math is wrong, but I

couldn't really tell. I don't really

know enough details of chips to be like

this chip failed because this particular

like low-level component was like wired

wrong or like got hit by a game. I don't

I don't know what causes it. You could

always go like bunch a bunch deeper. I

mean, the thing I'd maybe just push

startups on is thinking a little bit

about like uh this is maybe less

technical, but just like what happens

once we get AGI and like how to make

sure that like goes well for the world

or something. Like my my expectation is

like if you actually automate

almost everything a person can do. The

amount of economic growth there is just

like truly enormous and I would think a

little more about like how do you make

this like help the world versus not. I

think there's going to be like plenty of

economic success or something as a

result of it anyway.

Yeah, absolutely. Yeah. Um last question

I want to ask you is around if you

rewind back to where we started like 10

years ago. Uh you're a student, you're

pivoting into AI from kind of economics

work you were thinking about. Um and you

know all sorts of things you probably

did in those early days had some kind of

compounding return for you as you

developed into the role you have now.

Like what advice would you give to

students as they think about uh entering

the workforce, especially today? Um

learning skills that going to be useful

and maybe getting themselves jobs like

the ones you have right now 10 years

later? It's hard because I think the

timing is very different. Like I just

think we're like we've made we made a

lot of progress. So like what I would do

10 years ago is different from what I

would do today.

Totally.

But I think certainly if I went back 10

years ago I would be like focus on AI.

It's like the most important thing and

particularly focus on engineering which

I think felt very wouldn't have seemed

obvious to me at the time that like the

important thing was these engineering

skills and not the like math and

theoretical understanding of like you

know uh SPMs and like all the kind of

standard

ML literature. Um, I think today I would

probably focus a bunch on the like

engineering and on the like figuring out

what to do with AGI as sort of the two

like main things that feel top of mind

for me.

Let's call it there. Thanks so much,

Nick. Appreciate it.

Key Vocabulary

Start Practicing

Vocabulary

Meanings

model

/ˈmɒdəl/

B1

noun
- a simplified representation of a system or concept

noun
- an example for imitation or replication

train

/treɪn/

A2

verb
- to teach or coach someone or something

verb
- to practice or develop a skill

compute

/kəmˈpjuːt/

B2

verb
- to calculate or process data

noun
- computational resources or power

scale

/skeɪl/

B1

verb
- to increase in size or extent

noun
- the size or extent of something

data

/ˈdeɪtə/

A2

noun
- information, especially facts or statistics

alignment

/əˈlaɪnmənt/

C1

noun
- the arrangement in a straight line or correct relative position

noun
- agreement between ideas or standards

loss

/lɒs/

B2

noun
- the state of no longer having something

efficient

/ɪˈfɪʃənt/

B1

adjective
- capable of producing the desired result with minimal waste

predict

/prɪˈdɪkt/

B1

verb
- to say what will happen in the future

intelligence

/ɪnˈtelɪdʒəns/

B2

noun
- the ability to learn, understand, and make judgments

infrastructure

/ˈɪnfrəstrʌktʃər/

C1

noun
- the basic systems and services needed for a country or organization

evaluation

/ɪˌvæljuˈeɪʃən/

B2

noun
- the making of a judgment about the amount, number, or value of something

robust

/roʊˈbʌst/

C1

adjective
- strong and healthy; vigorous

parallelize

/ˈpærəlelaɪz/

C2

verb
- to make something occur or operate at the same time as something else

distributed

/dɪˈstrɪbjutɪd/

B2

adjective
- spread out over a large area

paradigm

/ˈpærədaɪm/

C1

noun
- a typical example or model

iteration

/ˌɪtəˈreɪʃən/

C1

noun
- the repetition of a process

empirical

/ɪmˈpɪrɪkəl/

C1

adjective
- based on observation or experience rather than theory

optimize

/ˈɒptɪmaɪz/

C1

verb
- to make the best or most effective use of

What does “model” mean in the song ""?

Learn fast – go deep – and remember longer with interactive exercises in the app!

Key Grammar Structures

Coming Soon!

We're updating this section. Stay tuned!

Related Songs