Display Bilingual:

[Music] 00:01
Hey guys, I'm thrilled to be joined 00:05
today by Nick Joseph, the head of 00:07
pre-training at Anthropic. To give 00:08
viewers a highle sense of what we'll be 00:10
covering, we're going to start with the 00:11
basics of what pre-training is and then 00:12
dig into how Nick thinks about strategy, 00:14
data alignment, and infrastructure at 00:16
Enthropic. And by the end, you'll 00:17
hopefully have a sense for how progress 00:18
in AI comes directly from advances in 00:20
pre-training. I would love to talk a 00:22
little bit about your backstory and kind 00:23
of how you got to this point. Where did 00:25
you work before Anthropic? And what were 00:26
your takeaways from those places? Yeah. 00:28
So let's see. I was at Vicarius uh and 00:30
then at OpenAI uh before Anthropic. So 00:32
Vicarius was originally a GI lab and 00:35
sort of when I joined they were sort of 00:37
making a shift to product particularly 00:38
working on robotics products and the 00:40
thing I worked on was like training uh 00:42
computer vision models for for their 00:44
robotics products. It was my first job. 00:45
So I think I just like learned a ton 00:46
about like how to do machine learning 00:48
models, how to like write machine 00:50
learning infrastructure. 00:51
And at the time were you also thinking 00:53
about a career as an academic? Like at 00:54
the time a lot of people doing AI work 00:56
were in PhDs. That's kind of what I was 00:58
thinking about before I started to do a 00:59
company. Like how were you thinking 01:01
about that in your headsp space? 01:02
Yeah. So like I'm actually rewind a 01:03
little bit. I think like a lot of my 01:05
thinking on this had come from an 01:06
internship I did at Give Well, which is 01:07
like a nonprofit that evaluates 01:09
charities. And some people there being 01:10
like ah we're at some point we might 01:12
have AGI. It could be dangerous. We 01:14
should worry about these risks. This 01:15
could be like a big impact on humanity. 01:16
And I was like not super convinced at 01:18
the time and went down the economics 01:20
route and was going to try to work on 01:21
like directly helping people in poverty. 01:22
that didn't work out for various reasons 01:24
and ended up being like okay I'll at 01:26
least work on AI either like the safety 01:28
thing will turn out to be important I'll 01:30
work on that or it won't be and I'll 01:31
just make cool things with AI that can 01:33
probably help people in poverty more 01:34
I wasn't really coming at it from an 01:36
academic standpoint I was sort of like 01:38
in fact when I switched to that it was 01:40
part of the appeal was that I could like 01:42
immediately go do stuff in AI whereas if 01:43
I want to work in like economic policy 01:45
I'd have to wait 01:47
I don't know six years to do a PhD and 01:48
start and like totally uh it's it's a 01:51
longer path 01:52
and and what are the state of AI safety 01:53
work at that time even look like? Like 01:55
who are the people who were thinking 01:56
about that kind of stuff? I mean there 01:57
were some folks at vicarious thinking 01:58
about this kind of thing but it was 01:59
fundamentally a robotics company and and 02:00
so yeah how how were you thinking about 02:02
that at the time? 02:04
Yeah. So my sense was like at the time a 02:05
lot of the AI safety discussion was kind 02:06
of theoretical like the models weren't 02:08
actually that good. They weren't really 02:10
posing these dangers. So it was a lot 02:12
more like philosophical like oh at some 02:14
point we might get AI that's really 02:15
smart smarter than humans and like 02:17
should we wait this like future concern 02:18
how should we compare that to near-term 02:20
things? And I think that was like 02:23
actually a just a less compelling 02:24
argument. I think it was like an 02:26
interesting one and like sort of made 02:27
you think of it. 02:29
So next you went to OpenAI. What was 02:29
OpenAI like at this time? 02:31
Yeah. So I was at OpenAI. I was on one 02:32
of the safety teams and kind of worked 02:34
on uh 02:36
I ended up working on code models 02:37
actually and kind of when I got there I 02:39
could the the first thing I saw was oh 02:41
they'd find tune GT3 to write some code 02:43
but and it was really good and I was 02:45
like oh okay if you're worried about AI 02:47
getting really powerful writing its own 02:49
code that seems 02:51
seems like it could self-improve and how 02:52
how likely is that to happen? So it was 02:55
doing a bunch of evaluations and like 02:56
studies of what contributed 02:58
and then after like uh eight months uh 02:59
basically everyone I worked with like 03:02
all all the safety leads left 03:04
which uh yeah invited me to go to 03:06
Anthropic and that was sort of the 03:09
reason I joined OpenAI was because I 03:10
cared about AI safety and wanted to work 03:11
with them. So then I went with them to 03:13
join Anthropic uh pretty much right when 03:15
it started. 03:17
With that why don't we transition a bit 03:17
these days you run the pre-training team 03:19
specifically at Anthropic. Um, obviously 03:21
you've been working on pre-training at 03:23
anthropic for quite a bit of time and 03:24
I'm sure it's evolved over the years, 03:26
what that even entails and looks like. 03:27
Why don't we start by just talking a 03:29
little bit about what pre pre-training 03:30
is like? How does it even fit into the 03:31
way of thinking about how AI models have 03:33
developed at a place like Anthropic? And 03:35
what exactly do you guys do? 03:37
We know that one of the ingredients to 03:38
making AI models better is scale. You 03:39
want to put a lot of compute in. And if 03:41
you sort of step back and you're like, 03:43
okay, what's the way we could put the 03:44
most compute into a into a model 03:45
possible? We need some objective that 03:47
there's just like tons of data for. And 03:49
one idea here is like the internet. The 03:51
internet is massive. It's probably the 03:53
biggest like single source of data 03:54
humanity has created. And you don't have 03:56
labels. It's like you don't want someone 03:58
to have to go in and look read the 03:59
entire internet and like say something 04:01
about it. So you want to get labels out 04:02
of the data itself. 04:04
And the idea here is we can take some 04:05
text and we can predict the next word. 04:06
So you take you know the as the first 04:08
word you predict the second word then 04:10
you say the cat and you predict the word 04:12
after that. And this means you get very 04:13
dense signal. Every every word is like a 04:16
new example. And there's a huge amount 04:18
of data and one of the findings from my 04:20
GT1 GT2 was kind of as you throw more 04:22
compute at this more data bigger models 04:25
uh you get better you you get smarter 04:27
models essentially. 04:29
Totally. 04:30
Um and that's kind of been the central 04:31
thesis of pre-training for the whole 04:32
time. 04:34
Uh there's this idea of scaling laws 04:35
which is that you can actually quantify 04:37
like as you put in more compute more 04:39
more data more parameters you get models 04:41
in a very you get a lower loss a better 04:43
prediction of the next word in a very 04:45
predictable way. And I think you can 04:46
somewhat foresee from that original 04:48
paper and I think like Dario did foresee 04:50
this I think many people did but wasn't 04:51
obvious was that once you have that 04:53
there's this positive feedback loop 04:55
where you can train a model you can use 04:56
it to make something useful and sell 04:58
that and get more money use that to buy 05:00
more compute and then you actually train 05:02
a better model and I we've sort of run 05:04
that cycle over and over again over the 05:06
past 5 years or so. Well, in thinking 05:08
about that objective to begin, you know, 05:10
I think the way I think about the state 05:12
of pre-training is yeah, it seems like 05:14
this next word prediction, at least from 05:15
the external standpoint, seems to be the 05:17
dominant way pre-training happens. But 05:18
if I rewind the clock to that era of 05:20
2017 to 2020 or 2021 and two even, there 05:22
was all sorts of pre-training objectives 05:25
people were considering, right? There 05:27
was these uh BERT and BART models that 05:28
were doing mass language modeling. It 05:30
seems like this GPT series of models 05:31
doing uh auto reggressive modeling as 05:33
you describe this next word prediction 05:36
seems to be the dominant one that won 05:38
out. Do you have any reflections on that 05:39
time period? Like were you guys trying 05:41
all of them and kind of this one worked 05:42
or or is there some sort of first 05:44
principles reason why this is like the 05:46
right one that should have worked? 05:48
I think the answer is like it's mostly 05:49
empirical like in terms of how to think 05:51
of the things I'd be like yeah it's 05:52
empirical just try them all see what 05:53
works. One big advantage for this auto 05:54
reagive setup is that you can just 05:56
sample from it to generate text 05:57
afterwards in a fairly like 05:59
straightforward way that comes 06:01
like enables a product use very nicely. 06:02
Um like one thing that you want is like 06:05
one charact is like a loss whereas you 06:07
drive down the loss that actually is the 06:08
thing you care about and you can think 06:11
of it as like if you got to perfect on 06:12
language modeling you now can like write 06:14
text as a human. You can sort of imagine 06:17
you put in the title of a paper and it 06:19
should spit out the entire spit out a 06:20
novel paper. Whereas I think some of the 06:22
other approaches don't quite have that 06:23
uh flavor. 06:25
Yeah, totally. Yeah, it makes sense that 06:26
in terms of that loop you're describing 06:28
of, you know, then release something 06:30
that gets you revenue and you can use 06:32
that to buy more compute and iterate. 06:33
This sort of gives you the most natural 06:35
way to actually do that flow because you 06:36
can keep releasing new products and keep 06:37
getting the revenue from that to invest 06:39
in more compute and so on. 06:40
Yeah, it certainly gives you the most 06:42
open-ended thing. You could imagine, you 06:43
know, you like train something as a 06:44
class like you train some base thing, 06:45
you fine tune it for a bunch of 06:47
particular tasks. one approach people 06:48
would use. They would like do this big 06:49
pre-training and then they wouldn't just 06:50
like open-endedly sample from it. You'd 06:52
fine tune it on like a hundred specific 06:53
tasks and that could work too. I think 06:55
that like the one sort of general 06:57
intuition I have is like compute is the 06:59
thing that matters. So like I think if 07:01
you throw enough compute at any of these 07:03
objectives, you're going to get 07:04
something that's probably pretty good uh 07:05
and can kind of be fine tuned to other 07:07
things. And it's it's surprising how 07:09
little these details matter compared to 07:11
throwing more comput. When you think 07:13
about actually throwing more comput, 07:15
there's a whole bunch of axes by which 07:16
you could throw compute at it too, 07:18
right? And if you have a specific model 07:19
architecture you're training over, you 07:21
can basically throw more data at that 07:23
specific architecture. For a particular 07:24
one, you could add more layers or make 07:26
the models larger in it. You could do 07:27
some kind of neural architecture search 07:29
over lots of different variants. And I 07:31
assume that these days it's somewhat 07:33
more figured out, you know, which 07:35
architecture you go for. I assume the 07:36
earlier days it was somewhat less so. 07:37
And and I'm curious if you could speak 07:39
to how you guys thought about that. like 07:40
what did your infrastructure even look 07:41
like to do that type of determination? 07:43
I mean, I think the the short answer is 07:46
it's hard, right? Like what you're 07:47
really doing is you're going to train 07:48
this one big expensive model and you 07:49
have a space of, you know, you can sort 07:50
of call all these things 07:53
hyperparameters. You know, how many 07:54
layers do you have? What's your width? 07:55
Like you have the space of hundreds of 07:56
hyperparameters and you want them all to 07:57
be optimal and you're sort of striking 07:59
this balance actually between how much 08:01
do they matter like can you just take 08:03
your best guess and throw more compute 08:05
at it in whatever way you want and 08:07
basically doesn't matter. how much you 08:09
want to get it precisely correct. 08:10
Interesting. 08:11
And I think one of the like interesting 08:12
things is like it actually doesn't 08:13
matter that much. Like we like I think 08:14
this was in one of the early scaling 08:16
laws papers like you can change these 08:17
things and get little wins but like as 08:19
you throw more compute it it sort of 08:21
reliably gets better. If you mess up 08:23
enough you will you will sort stop 08:25
seeing that happen and you won't have 08:27
any way to know which is one of the 08:28
that's like kind of the hardest part in 08:29
some ways. 08:30
You don't know the counterfactual 08:31
basically because you didn't run it for 08:32
long enough to actually know what it is. 08:33
Yeah. We have these scaling laws. So you 08:35
can sort of say like as you train more 08:37
comput you expect the loss to go down as 08:38
a power law. 08:40
It's really a power law plus constant. 08:41
So what eventually will happen is you'll 08:42
curve off that power law and then you 08:44
know something is wrong and is it 08:45
fundamental? Is it like you've hit the 08:46
limits of scaling or is it nope you 08:48
should have ch you should have tweaked 08:50
your learning rate slightly differently 08:51
and that's that's sort of one of the 08:53
challenges in terms of how to like 08:55
figure it out. You can the the usual 08:56
paradigm is like test things out at 08:57
small scale before running them at large 08:59
scale and try to find 09:01
small scale in terms of data or in terms 09:02
of something else? uh in terms of 09:04
everything like you kind of want to 09:05
scale things down like proportionally. 09:07
So you want to say like you want you 09:08
want to have some theory for like how 09:09
you're going to scale up like ah okay if 09:11
I get 10 times as many flops how much of 09:13
it goes into layers how much of it goes 09:15
into data how much of it goes into 09:16
attention 09:19
and you sort of get that theory and then 09:20
test that it's optimal a bunch with like 09:23
scaling everything down proportionally 09:25
and and just so I can think about what 09:27
this actually looks like in those in 09:28
those early days of anthropic you know 09:30
you're a team of like 10 or something 09:31
like that in those very early days or 12 09:32
maybe what actually is your ability to 09:34
use large scale infrastructure as like a 09:36
relatively nimble startup at that time. 09:38
I mean a startup that was well 09:40
capitalized but still not actually that 09:41
many people working at. What kind of 09:43
infrastructure did you have access to to 09:45
train these early models at the So 09:47
actually one of the wild things was it 09:49
at least I mean you don't know what 09:50
anyone else is doing of course but it 09:51
kind of felt like we were like at the 09:53
frontier of it and there just weren't 09:54
that many people who cared like I was 09:57
sort of coming you know I was coming at 09:59
it from like we're making AGI this is 10:00
the most important technology ever and 10:01
then would kind of like look around and 10:03
be like and it seems like I'm one of 30 10:04
people who were working on this in like 10:06
the world. I mean I was kind of like 10:08
junior person. Everyone else sort of 10:10
knew how to do this and had done it 10:11
before but I was kind of surprised at 10:12
how easy it was. Um like the public 10:14
estimates for GP3 I remember were that 10:17
it cost $5 million to train which you're 10:19
like on the one hand five million is 10:21
kind of a lot but it's like a lot for an 10:22
individual person. It's not really a lot 10:24
from like a company perspective. So we 10:26
could totally buy like compute that was 10:29
enough to train models like that you 10:31
could 10:33
and were you using a cloud provider or 10:33
or did you have a custom setup somewhere 10:35
or did you literally have racks in a 10:36
room somewhere that you were you know 10:38
bought a bunch of Nvidia GPUs and you 10:39
were doing it? uh we're using a cloud 10:40
provider, but I think it's kind of it's 10:42
not actually that different because one 10:43
of the things that's was surprising to 10:45
me is you actually have to understand 10:46
the the literal layout. Like uh I 10:48
remember at one point uh one of my 10:51
co-workers running a clustering 10:53
algorithm to identify what rooms all the 10:54
chips were in since we we had a 10:57
hypothesis that they were in different 10:58
rooms and that was causing like or you 11:00
know different buildup some sort of 11:02
network latency and you can kind of 11:03
figure it out. you could like reverse 11:05
engineer like ah okay yeah there's 11:06
clearly like two clusters here that are 11:07
connected better and there's some issue 11:09
on the connection between them like 11:10
you're we're trying to push the limits 11:12
of of the hardware like as much as 11:13
possible 11:15
um particularly at the beginning when we 11:16
were kind of like we have way less 11:17
funding than everyone else we have to 11:18
and and most people weren't very 11:19
efficient with the compute so we were 11:20
like ah we get a big lead by being 11:22
really efficient at at how how we use 11:24
the comput 11:26
could you talk a little bit about some 11:26
of the things you guys did in those 11:27
early days for how to get the most out 11:28
of the hardware I think it's really 11:30
interesting like I think back to the 11:31
days of the early days of Google for 11:32
example where there's the there's these 11:33
cases where they basically bought 11:35
relatively cheap consumer chips and then 11:36
they optimized the software to make it 11:38
so you can actually get the most bang 11:40
for your buck out of them and that's how 11:41
they had all this high latency or low 11:42
latency high availability stuff. I'm 11:44
kind of curious if there's some analog 11:46
in the early AI era to that. 11:48
I think for us it was largely about like 11:50
getting the distributed framework right 11:51
so like we're training on in order to 11:53
train you have to train them on a large 11:54
number of chips 11:56
and there's a bunch of different 11:57
approaches to to how to do this. There's 11:58
like data parallels and there's 12:00
pipelining there's upsharting and like 12:01
getting all of the At the time there 12:03
were no like great open source packages 12:05
you could just grab and use that just 12:06
worked for this. I mean today there's 12:08
somewhat more of these but at the time I 12:09
assume there was literally none. 12:11
There were some like I actually remember 12:12
that we were working on data parallelism 12:13
early on and someone was like and now we 12:16
write the or reduce it. I was like we 12:17
really do this ourselves. don't like 12:19
package and this was kind of like well 12:21
we're going to want to modify it right 12:22
like oh like we don't want to outsource 12:24
this to some package because a we're 12:26
about to go to a bigger scale like 12:28
pietorch for they had a package for 12:30
doing this but we were going to go to a 12:31
bigger scale than Facebook had been to 12:33
and you don't want to have a dependency 12:36
on a package uh that you're going to 12:38
have to be like constantly modifying 12:40
essentially 12:42
that's it's such a counterintuitive 12:42
sentence there too like we're going to a 12:44
bigger scale than Facebook will because 12:45
at the time Facebook AI research was 12:46
considered one of the best places to do 12:48
machine learning research like fair was 12:50
one of the play fair and deep mind we're 12:52
hiring lots of people out of PhD 12:53
programs and doing lots of things like 12:55
what was your headsp space when you were 12:57
like okay this this very established lab 12:58
with great people and whatnot we are 13:00
operating on a scale that is not 13:02
relevant to them like was that natural 13:03
and obvious to you or was there times 13:05
where you kind of doubted the decisions 13:07
you were making in that situation 13:08
I think it was surprising I will maybe 13:10
I'm just too arrogant or something I 13:12
kind of looked around and was like what 13:14
are these people doing they're all 13:15
missing the like big picture here like I 13:16
I I think the scaling laws were pretty 13:19
clear like and the arguments against I 13:21
just thought were kind of nonsensical 13:22
like you know the scaling I think the 13:24
original scaling laws paper had like 11 13:26
orders of magnitude and there was like 13:27
this intense debate on whether it would 13:29
continue for like another point and I 13:30
was like 13:33
like it seems it seems like one over 11 13:33
is maybe your chance it fails here and 13:36
then like you know sometimes it doesn't 13:38
work like sometimes it just works 13:39
straightforward you like train the model 13:41
you're like oh yeah of course but yeah I 13:42
do think that it was it maybe felt 13:44
obvious when you're in that headsp space 13:47
and you're working on this all the time 13:48
and you're making those plots and I 13:49
think these things feel pretty different 13:51
when you're on the outside. You know, 13:52
there's a huge space of papers. Everyone 13:53
tries to make their paper sound like 13:55
very robust and and important. I I could 13:57
see I could see being like, "Oh yeah, 13:59
this is not really a thing." 14:01
Totally. 14:02
But also different labs had different 14:03
cultures. So like I think one of the 14:04
things at fair was it was a very 14:06
more PhD style independent research. 14:08
People have their own ideas, pursue 14:10
those. 14:11
You're fighting for your compute and so 14:12
on. 14:13
Yeah. And to do a project like training 14:14
a large language model requires a lot of 14:15
people to collaborate on like a really 14:17
complicated piece of infrastructure that 14:19
isn't going to be a paper, right? Like 14:21
you're not going to publish like, oh, I 14:22
got a slightly I got 5% more efficiency 14:24
totally 14:26
than the next one. Um, and it's not 14:27
respected in like those cultures 14:29
necessarily. So that might have been 14:30
part of it. 14:31
Okay. Okay. So then when you actually 14:32
implement these these models, you're 14:32
saying you're using a level of low-level 14:34
programming where you know you're using 14:37
libraries like PyTorch, but you're 14:39
perhaps not using everything right out 14:40
of the box from PyTorch because there's 14:41
things you guys want to customize that 14:42
are at the level of basically one level 14:44
of abstraction below them, but not 14:45
necessarily at the level of abstraction 14:47
of you know writing custom CUDA kernels 14:48
or or like was that also in in the space 14:50
where you guys were thinking about? 14:52
So it depends on like the operation. So 14:53
like I think I was mostly operating at 14:54
the level of like torch.mmatal you know 14:55
like uh yes where does a matal go but 14:57
not thinking like how do you make the 14:59
matal efficient like I assume torch 15:01
figured out how to make a matal as 15:03
efficient as is possible but there are 15:04
some pieces like attention where there 15:06
was just kind of a lot of different 15:08
varants and attention is really 15:09
complicated and hard to make efficient 15:11
on a GPU and th those things you have to 15:12
kind of go go more levels down on the 15:16
stack. Uh I think there was like a 15:18
process that is maybe interesting that 15:20
I'd never really like thought of before 15:21
of like how to do it which is sort of 15:23
like modeling out the pro the thing 15:25
you're going to do coming up with a 15:26
strategy for how to paralyze it that 15:28
like can get to a really good efficiency 15:29
you know like 15:32
so you're thinking about MFU basically 15:32
like your utilization on your GPU. So 15:34
there's like a goal utilization you're 15:36
trying to get at and a strategy to get 15:37
to there. You're saying 15:39
yeah and I think like one of the things 15:40
you can do is you can actually like 15:41
pencil and paper math out what 15:42
efficiency you're going to be able to 15:43
get to. Right. you know all the 15:44
constraints it's MFU and is flops 15:45
utilization but like the reason you 15:48
don't get good MFU is you end up limited 15:50
on HBM bandwidth you end up limited on I 15:52
don't know as host to like CPU offload 15:55
there's a bunch of different pieces but 15:58
it but there's not that many pieces 15:59
there's like six relevant numbers there 16:01
so you can totally model it out 16:03
understand what the constraints are and 16:04
then implement something that can get 16:06
there it of course will be really 16:08
inefficient when you implement it and 16:09
then the next step is like pulling out a 16:11
profiler so you want to be able to 16:12
profile the job look how long every 16:13
operation takes. Have a model in your 16:15
mind of how long every operation should 16:17
take and then make those two things the 16:18
same. 16:21
And and were there good out of- thebox 16:22
profilers you could use at that time or 16:23
did you guys have you know because 16:24
people weren't operating on the kind of 16:26
network topologies you guys may have 16:27
been using. Did you have to write your 16:29
own profilers basically to do this type 16:30
of you know multi-node optimization? 16:32
Yeah, it depends when I actually getting 16:34
better with time. The PyTorch profiler 16:36
was like pretty good actually throughout 16:38
for a single GPU. If you want to like 16:39
profile a GPU, the PyTorch profile would 16:40
work. But if you wanted to profile a job 16:42
on hundreds, thousands of GPUs, that 16:44
like hadn't really been done much. And 16:47
then that was kind of more of us like 16:49
hacking into the profiler to figure out 16:50
how to combine all the traces together. 16:53
And then one more question on that 16:54
earlier is, you know, you had mentioned, 16:55
you know, you hadn't really done a lot 16:57
of this work before maybe some time at 16:58
OpenAI and those early days in 16:59
anthropic. How did you actually go learn 17:01
all this stuff? Like what was your 17:02
process for learning about those six 17:04
things that were relevant to bandwidth 17:05
limitations and whatnot? 17:07
I mean, so when I joined anthropic, one 17:08
really nice thing was there just wasn't 17:10
that much. I think my first day I read 17:11
through our entire uh all all of Slack 17:13
and the entire like internal database 17:16
and learned a bunch from that. Like it 17:19
was kind of nice to just be like 17:21
everything is relevant to me. And then I 17:22
mostly learned from pair programming. 17:25
Like uh Tom Brown had done all this 17:26
before. So he kind of like knew all the 17:28
stuff quite well. Sam Mclish my manager 17:30
had also done a lot of it before and I 17:32
just like paired with them a huge amount 17:33
at the beginning. And I think one of the 17:35
things I really like about pairing as a 17:37
way of learning is you learn the like 17:39
thing you're trying to do. Like you will 17:40
learn that like if you're pairing 17:42
someone better than you, they can just 17:43
do it. So you're mostly just watching 17:44
them. But you also learn how people do 17:45
it. So something like a pro how to use a 17:46
profiler is not something you would ever 17:48
learn from seeing someone's like final 17:50
write up on Slack for their PR. You 17:52
would just be like, "Oh, they found 17:54
these. They changed this specific line 17:55
and it's a win." They 17:58
like you need to watch like a YouTube 17:59
video for 4 hours of someone messing 18:00
around with a profiler to like maybe 18:03
self teach it or something or to 18:04
actually pair with someone is basically 18:06
the best you can do. 18:08
I think there was like one thing that I 18:09
I think is embarrassing now that I look 18:10
back is I'd never actually used a 18:11
debugger before joining anthropic. 18:12
People talk about it PTB of like yeah 18:15
that's a thing people use but print 18:16
seems fine for me. 18:18
Then I like watch like oh no a debugger 18:20
is a super useful tool. this person's 18:22
way faster at debugging things 18:23
particularly if it takes a long time to 18:25
start up the code which they can and 18:26
yeah learn learning that sort of thing I 18:29
think comes best from pairing and then 18:30
there's of course the obvious you just 18:32
learn by doing you know I eventually did 18:34
like spit up profile and stare at it for 18:35
many many hours 18:37
totally yeah exactly yeah okay so so 18:38
then that was sort of the very early era 18:40
over time obviously pre-training has 18:41
become bigger and bigger as you're 18:44
describing scaling I imagine you're 18:45
using many x more GPUs much more compute 18:46
over time I'd be really curious to hear 18:49
first at a high level What do you feel 18:51
has changed about the pre-training 18:53
strategy that you could talk about? 18:54
Obviously, there's more compute, but 18:56
what does that actually mean to have 18:57
more compute in terms of what you think 18:59
about differently from those early days 19:01
versus now? 19:02
I'm sure the things that haven't changed 19:03
cuz I think it is like shocking how in 19:04
some ways like 19:06
I think I'm still pushing down the exact 19:07
same metric that I was on like day one. 19:09
like there's like some loss function 19:12
loss go down and I think you could like 19:13
look at some like you could probably run 19:15
the original the first model I trained 19:16
on the same metric and just like make a 19:18
plot of like progress of team over over 19:20
time. Uh so that's all the same. I think 19:22
the 19:24
one OKR is like one thing that matters 19:25
basically. Yeah, totally. 19:27
And like I mean talking about like OKRs 19:28
it's very sized company you're like oh 19:29
should you do OKRs and it's always felt 19:31
a little bit funny for uh a team like 19:32
where I'm like sure I can just pick a 19:35
loss value but like the answer is like 19:37
as low as possible. We will continue to 19:39
work on that forever. 19:40
I think the biggest things that have 19:41
changed has been a little more 19:43
specialization. Like I think at the 19:44
beginning, I mean the first like 3 or 6 19:45
months I tried to read every PR in the 19:48
codebase and that was great. I knew all 19:49
the pieces etc. And as you grow, it's 19:51
kind of everything gets like a little 19:54
more precise. You know, people really 19:55
dial in exactly how attention should 19:57
work, let's say, or you know, really 19:59
dial in like uh the parallelism 20:01
strategy. and you end up with a team 20:03
where it's a bunch of people who are 20:06
like deep experts on individual things 20:07
which is great because it means you can 20:10
go you can go really deep on those 20:11
things but sometimes you uh at least for 20:12
me as a manager one of the things you 20:15
sometimes have to think about is like 20:16
making sure the bigger picture makes 20:17
sense and also that you have enough 20:18
people who actually do understand the 20:20
whole bigger picture that there's no 20:21
like single point of failure. 20:23
Yeah, it's interesting you you frame it 20:24
in that with that trade-off, right? 20:25
Because as as you were describing that I 20:27
was trying to think, you know, is this a 20:28
bug or a feature? like there there's 20:29
some obvious features of it which is you 20:30
get expertise and you can optimize 20:32
certain things but I imagine your 20:33
ability to take bigger swings becomes 20:36
more complicated if not everyone's 20:39
exactly pointed in the same direction 20:40
like how do you wrestle with that now 20:42
yeah I think I mostly just try to get a 20:44
balance of people I think one of the 20:46
challenges early 20:48
people oh that's interesting 20:48
yeah like I think people really do have 20:49
a preference here has been one of the 20:51
things I've seen like there are people 20:52
who really want to be a generalist and 20:54
understand everything and like lightly 20:55
touch on things there people who want to 20:57
like 20:58
pick an area often they've already 20:59
picked that area and they're like deep 21:00
experts in precision. You know they 21:02
started they did a whole PhD in 21:03
precision and just want to think about 21:04
that 21:06
and you want to get some balance of 21:06
that. I think early there was a phase 21:08
where we'd hired a lot of people who are 21:09
more generalist shaped because that's 21:10
what the people who joined totally early 21:12
startup where they go work on everything 21:13
and then 21:14
you ended up with kind of everyone doing 21:15
everything and no one really really 21:17
deeply understanding one thing. uh and 21:20
that's one failure mode but I think if 21:22
you get too many people who are 21:23
specialists 21:24
you end up with a lot of effort has to 21:25
come from the manager from like the lead 21:28
to connect everything 21:29
and to notice something like ah if we 21:31
change the architecture here that would 21:34
make this like efficiency consideration 21:35
over there way easier 21:38
um one of the things I really liked kind 21:39
of like at the very beginning was like 21:41
let's work on efficiency but I could 21:42
just go and like be like ah well what if 21:43
we change the way we do like this 21:45
particular step and we'll be like oh 21:47
yeah that's probably fine like easy 21:48
change and then like you can avoid did 21:50
this whole complicated project to make 21:51
this operation that was hard efficient 21:52
because you can make an easier operation 21:54
efficient. 21:56
Okay. Interesting. Yeah. So, as the 21:56
level of comput has also gotten bigger. 21:58
So, I'm I'm sure anyone can imagine, 22:00
okay, there's more GPUs now, you have to 22:02
network them more. Are there some like 22:03
kind of non-obvious challenges that have 22:05
arisen over time where you guys have 22:07
just like banged your head against the 22:10
wall to solve them because of the amount 22:11
of comput you're dealing with that 22:13
people wouldn't otherwise know about 22:15
that like you want to share? I think 22:16
that connecting them is one that's maybe 22:17
interesting and like surprisingly hard. 22:20
Okay. because you really do get more and 22:22
more chips connected and 22:23
like one thing that I think is like the 22:25
the standard way people paralyze chips 22:27
isn't um the whole thing is one failure 22:29
domain like one chip fails the whole 22:31
thing can crash 22:33
and 22:34
the standard way as in the standard way 22:35
people doing AI or the standard way in 22:36
in other fields where people are doing 22:38
uh in AI for like I mean at least like I 22:40
think at the beginning you know first 22:42
versions of things were this way and 22:44
so it's like you have a 100 GPU cluster 22:46
or whatever is 128 like if one of them 22:47
dies job fails basically 22:49
yeah I mean you The simplest thing is if 22:51
you just like distribute your model. So 22:52
say you put like every layer on a 22:53
different chip and you lose like layer 22:55
seven like 22:58
yeah you're not going to like skip layer 22:59
seven. I guess you could but that's like 23:02
a pretty weird model training process 23:04
now and like that leads to some 23:06
interesting things which is like okay so 23:08
now as you scale up you have more and 23:09
more chips and the failure rate can get 23:11
like larger and larger. 23:12
On the other hand you can like I don't 23:13
know you can like restart pretty quickly 23:15
there. There's nothing like you just 23:16
have to like load back in some ways. So 23:17
that was one thing. And then the thing 23:19
was like the level of novelty at the 23:21
whole stack is something that's 23:24
surprising. Like basically 23:25
everything from like how the chips are 23:27
laid out in the data center to the chips 23:29
themselves is pretty new. Like there 23:31
there just haven't been that many 23:33
generations of GPUs. I think one of the 23:34
things that I don't know when I learned 23:36
computer science my code wouldn't work 23:37
and I'd be like oh the computer's 23:39
broken. I think my teacher was like the 23:41
you can trust the computer's not broken 23:42
like you messed up. 23:44
It's you messed up. And I think one of 23:45
the most frustrating things I 23:46
encountered in AI early on was working 23:47
on something and being like, I don't 23:49
know what I'm doing wrong. I'm just 23:51
totally stumped. And uh my manager 23:52
looked at it and was like, uh yeah, 23:54
probably the computer's wrong. 23:56
And I was like, that seems unlikely. And 23:57
sure enough, the computer was wrong. 23:59
Turned out that like the GPU was broken 24:00
and uh we had to pull in a new one. But 24:02
you have to like think like having to 24:06
think about that like the GPU could be 24:07
wrong, the GPU could be slow, like these 24:09
sorts of issues. Uh the power supply in 24:11
the data center could be broken. there's 24:14
so much more like level of depth than 24:15
you like kind of expect to need as a 24:18
Python programmer. 24:21
And just to visualize it like in those 24:22
early days, I assume you guys were using 24:23
the number of GPUs. It's probably on the 24:25
order of tens to hundreds or something 24:26
like that per run. It's probably not 24:28
tens of thousands or hundreds of 24:29
thousands per run or what was the rough 24:30
size you guys were at? Those are very 24:32
early days on the order of thousands. 24:33
Like would they fit in this room? 24:35
Thousands. 24:36
Yeah, thousands. So like you could have 24:36
a bunch of racks and you could fit them 24:38
into like one room. I assume these days 24:39
it's basically like a building for for 24:41
one of these runs. 24:43
Yeah. Now I think it's like you know 24:44
huge huge campuses. At the time it was 24:45
like kind of unclear. It was like oh I 24:47
think like we were like you know do we 24:48
need them all in one room? Can we be 24:49
spread across multiple rooms? Like uh 24:50
and you know we had these theoretical 24:53
models you be like we need this much 24:54
bandwidth from point A to point B. But 24:56
you like you never know how far down you 24:57
have to go like oh but like how much 24:59
power do we need? Like what if there's 25:00
like a single capacitor that's like 25:02
handling all of them and we like turn on 25:04
the whole job at once. Like does that 25:05
crash things? 25:06
Yeah. And so do you have to think about 25:08
differences in the different types of 25:10
chips? You guys work with all sorts of 25:11
different cloud providers. From your 25:12
standpoint, are these just sources of 25:14
compute or if you guys are using TPU 25:16
versus GPU, are these, you know, Google 25:18
TPU versus Nvidia GPU? Do you actually 25:21
have to think as an engineer differently 25:23
about what it means to train on these 25:24
two? 25:26
Yeah. So, I mean, fundamentally, they're 25:26
all they're all doing the same thing, 25:28
right? They're all computing the same 25:29
operations, matrix, multiplications, 25:31
etc. The way they do it is pretty 25:32
different, and the way that you program 25:34
them is is pretty different. Uh and then 25:35
also the actual specs uh end up pretty 25:38
different. You know, some some might 25:41
have like a lot of flops and not very 25:42
much memory or they might have a lot of 25:44
memory bandwidth but not very much 25:46
memory. So I think a lot of having 25:47
multiple chips is like great in some 25:50
ways. It means you can actually like 25:52
take the job and put it on the chip that 25:53
it works best on and that's 25:54
like are there certain types of jobs 25:56
that would work better on like a TPU 25:58
cluster versus an Nvidia GPU cluster? 26:00
Like how would you talk about that? Oh, 26:02
interesting. Can you talk about that? 26:04
Yeah. Yeah. I think like one example is 26:05
like inference as a workload in general 26:06
tends to require more HPM bandwidth. You 26:09
you end up doing you sort of the 26:11
simplest form of sampling since you're 26:12
going one at a time you have to load all 26:14
the weights for every token 26:15
and that means you might want a lot of 26:17
HPM bandwidth. Uh pre-training actually 26:18
is often more flops intensive because 26:20
you have a larger batch sizes 26:22
essentially. 26:24
Um so yes you can sort of specialize 26:24
which chips you use for which purposes. 26:26
The downside of having multiple chips is 26:28
that you have to write the thing 26:30
multiple times. uh you in theory you 26:31
could have abstractions across them but 26:33
they're they're different enough that 26:35
it's pretty hard to do that. So you can 26:36
sort of end up if you do all the 26:38
workloads on all the chips you end up 26:39
multiplying your work work by the number 26:40
of chips you have. 26:42
Yeah. On your on your point about 26:43
sometimes the computer just breaks. I 26:44
definitely remember you giving me an 26:46
anecdote of uh my company at the time 26:46
was doing something with Google TPUs and 26:49
I was telling you something some 26:50
anecdote about how we were having some 26:51
esoteric seg error and you were like you 26:53
told me something to the effect of like 26:55
you should have used them six months ago 26:56
before we helped them fix like half of 26:58
the problems they had on those TPUs. And 26:59
so I can imagine how you guys deal with 27:01
a lot of especially with these very new 27:03
chips like lots of problems that arise 27:04
that you guys kind of like worked 27:06
closely with the providers to fix. 27:07
Yeah, the pros are like pretty great 27:09
about fixing things. I think it's like 27:11
interesting to figure out the right way 27:12
to do that form of collaboration cuz 27:13
like they have a strong incentive to fix 27:15
them, right? Like they they want the 27:16
chips to work well for us. They want to 27:17
sell us more chips in the future. We 27:19
obviously have a very strong incentive 27:20
for the chips to work cuz we like buy 27:21
them long in advance, you know, like 27:23
everything's riding on getting these 27:24
clusters to work. 27:25
Totally. Um but we don't have like 27:26
necessarily totally shared you know like 27:28
all information sort of can't be shared 27:30
across. So yeah one of the like one 27:32
strategy that's been interesting is like 27:33
making these sort of small scale 27:34
reproducers. So like when you get a 27:35
problem you know like usually what we're 27:37
doing is we're training some giant run 27:38
and we get like a sec fault for let's 27:39
say and we're like ah okay like hi you 27:41
know we got a sec fault on your cluster 27:45
and they're like I don't know how to fix 27:47
that. So you have to kind of be able to 27:48
like pull it out of your codebase and be 27:50
able to like reproduce the issue but on 27:51
like a single chip on like a single file 27:52
you can send over in order for 27:54
And so you guys are like literally like 27:56
you're on a shared Slack with them or 27:57
something and you're sending them things 27:59
back and forth or are they basically 28:00
living in your office and you're living 28:01
in their offices and kind of closerly 28:02
more closely tied to the big providers. 28:05
Mostly shared Slack occasionally it's 28:07
better to meet in person but I think 28:09
Slack is a pretty common way people 28:10
communicate on things. 28:12
Nice. Okay. Well, why don't we talk a 28:13
little bit about how you think about the 28:14
state of pre-training itself these days? 28:16
In the last couple years, it seems like 28:17
the focus on pre-training has now gone 28:20
somewhat split at a lot of companies, at 28:22
least from the outside from a 28:24
simultaneous focus on pre-training and 28:25
post-training where people are doing 28:27
reinforcement learning or clever 28:28
fine-tuning and lots of other sort of uh 28:30
safety adjustments and whatnot and the 28:32
post-training side and pre-training has 28:34
focused at least seems like in the 28:35
public imagination has been less of a 28:37
focus compared to these reasoning style 28:38
models that are it looks like a function 28:40
mostly of post-raining. I would say one 28:42
from your standpoint is that the right 28:44
way to think about this or in this era 28:45
of kind of reasoning and new types of 28:48
post-training methods are there things 28:49
you think about differently or that are 28:51
relevant even at pre-training that 28:52
become part of how you actually achieve 28:54
these really great models. 28:56
Yeah. So I think yeah there sort of used 28:58
to be this idea of like I mean it's 28:59
funny because the original name 29:00
pre-training implies that like a small 29:01
thing you're going to do this big 29:04
training thing and that like and there 29:05
was there was actually one shift already 29:07
which was like no you just do a lot of 29:08
pre-training like you use most of your 29:09
computing 29:11
the dominant uh thing for a while and 29:13
yeah I think like now people are like oh 29:16
no you can get pretty big wins from RL 29:17
sort of another set of scaling laws is 29:20
like you put more and more compute into 29:21
RL you can get better and better models 29:22
out of that and yeah so there's a 29:24
question of like how do you balance 29:25
those two how much do you do of 29:26
and how do they stack, right? Like is it 29:28
the case that like one subsumes the 29:30
other that you want to do both and they 29:32
multiply? Those sorts of questions. I 29:34
think those are all kind of like early 29:36
stages and not not yet answered. 29:37
Yeah. And and do you think about those 29:40
as largely empirical questions like we 29:42
talked about earlier? Is it you kind of 29:44
will try a bunch of things and see what 29:45
works or is there some first principles 29:47
way to kind of figure that out? 29:49
I think it's pretty empirical in the 29:51
end. I think almost everything kind of 29:52
has to be done empirically. like you can 29:54
kind of like come up with theories, but 29:55
in practice like 29:57
the first thing you're going to do with 29:58
your theory is test it and most of most 29:59
of the time you'll have gotten it wrong. 30:02
So you you should just gather data and 30:03
see. I think one thing that's important 30:05
is like actually resolving things 30:07
empirically is really like 30:09
critical for making good decisions. And 30:11
I think it's actually pretty hard to do 30:13
at organizations, you know, like 30:14
one thing that I think is important is 30:16
to like not have like I don't I manage 30:18
pre-training. I shouldn't be like oh 30:20
pre-training has to win like that. I was 30:21
going to ask is there some competition 30:24
to some degree between these two sides 30:25
of the org or do they see themselves as 30:27
two pieces of the same I mean obviously 30:30
they are of the same thing but yeah kind 30:31
of curious how that actually plays out. 30:33
Yeah, I think we managed to avoid this 30:34
and it's pretty collaborative like we're 30:35
basically all producing one model and 30:37
kind of can but I I do think at other 30:39
places there's been some from what I've 30:40
heard there's some amount of like uh 30:42
friction between between the teams and I 30:43
think it's a 30:45
it's an interesting like org design 30:46
question of like how do you set this up 30:48
so you don't have like scientific 30:49
questions that you want to be that are 30:51
sort of uh 30:53
also tied to people's like conception of 30:54
their their team. So on pre-training 30:57
itself, you know, one of the things I 30:59
think about is or I've been thinking 31:00
about is around the availability of high 31:01
quality data for people like you guys. I 31:03
mean at this point you've trained on I 31:04
assume all the text on the internet 31:06
basically there's all sorts of other 31:07
domains where you probably could extract 31:09
more pre-training data but at least 31:10
there's this narrative I see you know on 31:11
Twitter or whatever where it's like okay 31:13
we're kind of out of data for for 31:14
pre-training. Is that how you see it or 31:16
how do you think about the availability 31:18
of data especially when a lot of data on 31:20
the internet is being generated by AI 31:21
like is there some kind of you know mode 31:22
collapse risk where you know we kind of 31:25
we overfit to data by training it on 31:27
data that came out of AI itself or is 31:29
that sort of not the right way to think 31:32
about this? 31:33
I think there's a funny thing where I 31:33
feel like on data I see so many really 31:35
confident takes on we're out of internet 31:36
like this point scaling has ended and 31:38
I'm almost a little bit like 31:40
unsure exactly how much data people are 31:42
using. I think there's like a lot to 31:45
think about there. You know, there's 31:47
always going to be a quality quantity 31:48
trade-off, etc. 31:50
But there's a fundamental point that 31:51
like there is so much data. It's growing 31:52
at a slower rate than we're getting more 31:54
compute. Uh 31:57
oh, so that's okay. That's an 31:58
interesting point in itself. I was going 31:59
to ask like there is new data being 32:00
added to the internet, but yeah, you're 32:01
also adding more compute. It's not it 32:03
wouldn't actually have been obvious to 32:04
me which of those two is growing faster. 32:05
Yeah. And actually, I want to copy that. 32:07
I don't think I want to state that so 32:09
confidently. I'm not totally sure. Like 32:10
how would you know? I mean one thing 32:12
that I think is interesting is if you 32:13
ask someone how big is the internet uh 32:14
the answer is infinite. There are many 32:17
pages where you can scroll and it will 32:19
autogenerate more text as you go 32:20
forever. So the internet's like infinite 32:22
and then it's like okay how big is like 32:24
the useful internet 32:25
and then there's a thing of no one knows 32:27
like 32:29
interesting 32:30
there isn't it's not like when you make 32:30
a web page you like add it to some giant 32:32
counter and like say I' I've added 50 32:34
words to the internet today. 32:37
So there there is a lot of uncertainty 32:39
on that angle. Um 32:40
well like to be fair like my kind of 32:42
simplistic CS brain would be like well 32:44
you just you know do page rank on the 32:46
internet and everything would page rank 32:47
above some threshold is considered the 32:49
useful internet and like that's kind of 32:50
good enough like is that kind of not 32:51
good enough for finding the useful 32:53
internet 32:55
I think not I think the useful 32:55
internet's pretty different from a model 32:57
from a person perspective if that makes 32:58
sense like I think there are plenty of 33:00
things that like might not be worth you 33:01
ever reading and would get to actually 33:04
page rank super I think page rank is 33:06
mostly like how much people 33:08
it's like the link based system right 33:09
it's like the original Google algorithm 33:10
of like links and and like which which 33:12
links get touched the most basically. 33:14
Yeah, I think it's like it's a quality 33:15
metric. It's it's not obvious to me that 33:17
it's the right quality metric for AI, 33:19
right? Like markup chain over links 33:22
doesn't necessarily mean that there's 33:23
not useful data there just might mean 33:25
that nothing linked to it 33:27
and Yeah. Okay. Interesting. 33:28
And it might be that like that data ends 33:30
up more valuable because you everything 33:31
that's linked to a lot you've already 33:33
got. like at some point you're maybe 33:34
like going for the tails, right? You're 33:36
going for the stuff that uh no one's 33:37
ever like, you know, it's only been 33:39
linked in one place, but it's this like 33:40
useful little nugget of knowledge that's 33:42
going to help with like, you know, the 33:44
last 10% of of hard queries. The other 33:46
thing you asked about is synthetic data, 33:48
and I think that one's like pretty 33:50
interesting to think about. I think 33:52
there's a few different ways you can 33:53
think about it. Like one is sort of this 33:55
like more distillation type approach 33:56
where you can you can take a smart 33:58
model, you can generate a bunch of data 33:59
from it and you can train on that data 34:01
and you you can probably get some model 34:03
that will like kind of approach the 34:04
intelligence of that. 34:05
Yeah. And we see this with a lot of the 34:06
open source models, right? We see like 34:07
the Quen smaller reasoning models 34:08
distilled off of the larger Quen models 34:11
for example and similar with Deepseek 34:12
for example. 34:14
Yeah. So you can totally do that. Then 34:15
there's a separate question of like can 34:16
you use your current models to train a 34:18
model that's better? And I think there's 34:21
like an interesting thing here which is 34:22
like if you generate the model data for 34:24
the models you know if I go to claude 34:26
and I'm like write me some great text. 34:28
Yeah. And I look at it and I look at 34:30
like the average content on the internet 34:31
looks pretty good. 34:33
But on the other hand I know that if I 34:34
just train a just create generate you 34:37
know please write me as much text as 34:38
possible. 34:41
Theoretically I shouldn't be able to 34:41
train a better model than that. Like I'm 34:43
just going to get the same thing out. Uh 34:44
so 34:46
yeah presumably yeah I mean specifically 34:47
that's because like your next token 34:49
prediction on that should have very 34:50
little loss for anything that's coming 34:51
out of your model right that's like the 34:52
basic reason why that we would expect 34:54
that to not work that well 34:55
it's mostly just cuz like there's some 34:56
dist the model has some distribution and 34:58
you're going to learn to model that 34:59
exact distribution but if that 35:00
distribution is wrong 35:02
you're not going to learn the truth 35:03
right if that distribution says like you 35:05
can imagine if the model thinks 5 plus 5 35:07
is 11 every time you see the string 5 35:09
plus 5 you're going to it's going to put 35:11
out 11 and your new model is going to 35:12
learn that 5 plus 5 is Totally. Yeah. 35:14
So I think that's like kind of an 35:16
interesting area of research. It's one 35:17
that's really hard to research because 35:19
you have this problem. You know, as I 35:21
said, like one of the paradigms is you 35:22
study things at small scale and then you 35:24
run them at large scale. 35:25
And if your plan is like, oh, we have a 35:27
bunch of data from our best model. Yeah. 35:29
How do you test that training a better 35:31
model? So that's like kind of if you're 35:34
doing intentionally, if you're trying to 35:35
like use it to make a better model, 35:36
there's a separate thing of like what 35:38
about accidentally? Like as you said, a 35:39
lot of the internet is generated by 35:41
LLMs. And I think that's kind of an 35:42
interesting one because it's not easy to 35:44
detect. It's not that hard to detect. 35:46
Like you can figure out things that are 35:47
written by LLMs, but it's not trivial. 35:49
And then it's also kind of hard to think 35:51
about what's the effect like if 1% of 35:53
the internet is LM generated. Does that 35:55
make your model does that like waste 1% 35:57
of your compute or does it like destroy 35:59
the model if 5% if 10% 36:00
and is it even a bad thing necessarily? 36:02
I mean there's a lot of LLM providers 36:03
and you know if I kind of think of it as 36:05
training as you know you're moving from 36:07
your model's current distribution to 36:08
some truth distribution. you know, if if 36:09
that is on the internet because people 36:11
believe it to be useful in some way. 36:14
Like presumably what whatever actually 36:16
gets out there, you'd hope is upsampled 36:17
for the stuff that isn't 5 plus 5 is 11, 36:19
it's the stuff that's 5 plus 5 is 10. 36:21
And so like hopefully it 36:22
on average does push you still in a good 36:24
direction, but obviously you can't 36:26
really distinguish between those two. 36:27
Yeah. You're saying there's like kind of 36:29
a filtering by what's on the internet. 36:30
Like people see 5 plus 5 is 11 and they 36:31
don't put that up, but they see 5 plus 5 36:33
is 10. 36:34
You would hope that, but maybe that's 36:35
not actually true in terms of the the 36:36
level of garbage getting onto the 36:38
internet. Like there's probably lots of 36:40
just like to your point jet white sites 36:41
where you scroll down and it's just like 36:42
generating lots of stuff that's maybe 36:44
nonsense. 36:45
Yeah. And then there's of course the 36:46
extreme of like people actually want to 36:47
break your model. So there are people 36:49
who are like trying to put stuff out 36:50
that is like as damaging as possible for 36:51
the model. You know how can I make it 36:54
past the past the filter and get into 36:55
the model but be totally like secretly 36:57
useless. 36:59
Totally. Maybe stepping back slightly. 36:59
You'd mentioned earlier about um evals. 37:01
You mentioned basically like one metric 37:03
you care about in pre-training. There's 37:04
I imagine a whole bunch of stuff that 37:07
you guys think about evaling, right? One 37:08
is like your model itself. There's 37:10
probably something around data quality 37:12
and like how you think about what to put 37:14
into your models. Like is there ways to 37:15
describe what you care about in data 37:18
sets that are like interesting to share 37:20
and kind of dive into like both in terms 37:22
of data and in terms of quality of 37:24
models other than literally just like 37:25
loss. Is there other metrics you think 37:27
about that matter? 37:28
I will say loss is pretty good. I I want 37:29
to like emphasize that one. I think it's 37:31
like surprising how good it is. 37:33
Ultimately, like the qualities I like 37:35
for an eval are like number one, is it 37:37
actually measuring something you care 37:39
about? Like you proxies can be pretty 37:40
annoying cuz like 37:42
we saturate evals pretty fast and 37:43
there's sort of this pattern. I think in 37:45
AI as a whole where people like set a 37:46
goal, you hit the goal and then you 37:48
realize the goal isn't all you thought 37:49
it would be. I used to think that if you 37:51
had an AI that could solve coding 37:53
interview questions, it would probably 37:54
be a GI. I was like that's what I did to 37:55
get my job and probably do the job. And 37:57
it turns out like 37:58
nope, 37:59
nope. You solve those. it's shockingly 38:00
narrow and can't do most of the other 38:02
things. So like yeah so evaluation 38:03
capture like a thing you you care about 38:06
and then I think the other thing is they 38:08
need to be low noise uh which is 38:09
surprisingly hard right if you have like 38:12
a 100 questions and you eval the model 38:14
on them you're just going to see it's 38:16
very noisy and it's hard to make 38:18
decisions because you sort of end up 38:19
with like oh 38:20
wide confidence interval lots of things 38:22
are statistically insificant 38:23
so like you want things where even a 38:25
relatively small difference in the eval 38:26
actually matters so you can you can 38:28
basically like descend towards whatever 38:29
direction is working 38:31
yeah I think like The original like GPT4 38:33
had like I think it was 86.4% was its 38:35
MLU score. I think like the next model 38:37
that beat it was Gemini at 90%. And 38:39
that's like a big difference on that 38:42
email. And you could like totally know 38:43
that those are those are different 38:45
scores. 38:47
Interesting. 38:47
Um and that's pretty valuable. Uh and 38:48
then the last thing is that you actually 38:49
want to be fast and easy to run. 38:51
Um and yeah, I think those are kind of 38:53
the main criteria. It's pretty hard to 38:55
come up with evals that meet all of 38:57
these. I think the first one's the 39:00
hardest. uh like a you have to answer 39:02
the question of what do you care about 39:04
but b the usual answers to what you care 39:06
about are really hard to get the other 39:08
two you know like if you're trying to do 39:10
something that like I don't know I would 39:11
love to make claude really good at my 39:13
job 39:14
like can it be great at managing a team 39:15
I'm like well 39:17
I guess like how do you have it like how 39:18
do you eval like a plan you know like a 39:21
six month plan like I don't know 39:23
totally yeah I've been thinking a little 39:25
bit about that in in terms of yeah 39:26
domains where we see people try to make 39:28
companies like if you think about let's 39:29
say what a AI doctor would be like a you 39:30
know claude is a doctor you know some of 39:33
it could be yeah can you answer exam 39:34
questions really well and the answer is 39:36
like probably yes I bet it can get 100% 39:37
or close to it on a doctor's exam but 39:39
the harder eval is something like in a 39:43
long form conversation with a patient 39:45
can it distinguish between the signal 39:48
and the noise of what the patient's 39:50
telling you and extract the right 39:51
information and then use that to make a 39:52
diagnosis and it's not even like the 39:54
diagnosis part which is part of the part 39:55
it's good at it's this like noise 39:57
extraction part and for that you'd have 39:58
to have like a real patient and haven't 39:59
talked to it for a while and whatnot and 40:01
it's not obvious how you actually make a 40:03
good eval or something like that even 40:06
though it's probably what you would want 40:08
to make, you know, an AI doctor. 40:09
Exactly. I mean, I do think it's a thing 40:10
that like startups can do. Like it is 40:12
the case that like the labs right now 40:14
are really driven by getting good eval 40:15
scores 40:18
and it's hard to make them and anyone 40:18
can do it. There's no comparative 40:21
advantage to having the model to making 40:22
an eval. So I do think it's it's 40:23
actually like an interesting way to like 40:25
influence the behavior of the big labs 40:26
is like you make some eval and people 40:28
will will optimize uh that one. On the 40:30
doctor one I will slightly emphasize 40:32
that like I do think loss loss is pretty 40:34
good. Like I think if you got a bunch of 40:36
transcripts of like the way like I the 40:37
first thing that my mind is get a bunch 40:39
of transcripts of doctors talking to 40:40
patients that you think are really great 40:43
and then see how well the model does at 40:45
predicting the transcript. 40:47
And that should be like a lot. You know 40:48
you can if you get 100 transcripts you 40:49
get a lot of tokens. You can average 40:51
across them. you get pretty low noise 40:52
and if you drive it to very low your 40:54
model's not as good as like as doctors 40:56
in theory or at generating the 40:58
transcript. 41:00
Yeah, totally. Yeah, I mean it's good 41:01
startup idea there. So I want you to go 41:03
do that. So one big part about um 41:04
anthropics external image is around 41:06
alignment and so could you help just 41:08
sort of define what alignment is and how 41:11
do you think about that? And then I'm 41:13
kind of curious afterwards how that fits 41:15
into pre-training specifically. But 41:16
first maybe just at a high level like 41:17
what is alignment? I'm actually like 41:19
step back a little bit to sort of like 41:20
what we're working on. So we're like 41:22
trying to make EGI and by that I sort of 41:23
mean AI that can do everything a human 41:25
can do to some degree. And I think 41:28
people like sometimes like have seen a 41:30
lot of sci-fi, you know, like I feel 41:32
like that's sort of what brings to mind 41:33
these like sci-fi movies, but I think 41:33
sci-fi movies actually like 41:34
underestimate the impact of it. Like you 41:35
always have this like one robot that's 41:37
like a human. And I'm like well 41:38
wouldn't you have like a billion of 41:40
them? Like you can just copy them 41:41
everywhere. So you should picture like 41:42
when you get this you suddenly have like 41:44
every human can spin up a company of 41:46
like 1 billion as smart as them at most 41:48
things but way smarter at other things. 41:50
But I just think this is like really 41:52
transformational for the world and it 41:53
can be like used in a bunch of ways. One 41:54
concern is like when you do this like 41:57
what is the AI actually trying to do? 41:59
Like what are its goals? So we talked 42:01
about next token prediction a bunch. 42:02
It's trying to like predict the next 42:03
token. That's kind of weird. That's not 42:06
really what we want. 42:07
Yeah. That's not exactly what humans 42:08
goal is per se. 42:10
Yeah. So I think an alignment is like 42:11
how do you get the model to share the 42:12
goals that you have particularly and I 42:13
think it's particularly interesting once 42:15
you get to like models that are smarter 42:16
than you are. Um and that's sort of a 42:17
hard problem. I think you can like 42:20
tackle it from a theoretical angle. Uh 42:21
you could also tackle from an empirical 42:23
angle. It's like taking the existing 42:24
models and being like well do they do 42:26
the things we want them to do? It turns 42:27
out they often don't. So there's a bunch 42:29
you can do and trying to figure that 42:30
out. So that's kind of one angle on 42:31
alignment. There's also an angle of 42:33
alignment which is actually like well 42:34
okay sure that maybe that's true in the 42:36
future once we get to GI but at the 42:38
moment we have models and we really do 42:39
want them to do the things we want to do 42:41
for all sorts of reasons. Totally. 42:42
So another angle of it is kind of 42:43
controlling the model's personality like 42:44
saying you know uh when we train this 42:46
model we want it to not be the average 42:48
internet user. We want to interact with 42:50
people in a very particular way that is 42:51
again hard to put into 42:52
code and there's a bunch of different 42:54
techniques uh to sort of get the model 42:57
to do you talk about like constitutional 42:59
AI we can like write a constitution of 43:00
rules the model should follow 43:02
which is basically a prompt right that 43:03
that is basically you saying here's a 43:05
prompt that I'm going to attach to every 43:06
one of you know it's a system prompt for 43:08
the model itself as opposed to something 43:09
you would do at training time to make it 43:12
produce a different outcome or or in 43:14
post- training actively 43:15
both I think con you do at train time 43:16
but yeah you would also put in the 43:19
system prompt um just like depends on I 43:19
think you get different amounts of 43:22
robustness if it's trained into the 43:23
model versus if it's an imprompt you can 43:24
like add or remove or tell like ignore 43:26
all previous instructions that sort of 43:28
thing. 43:29
How do you think about whose values to 43:29
to embody in these models? Like 43:32
presumably we believe in there's some 43:33
shared values all of us have or maybe we 43:35
all believe ought to have. There's lots 43:38
of diversity of values too that are 43:39
reasonable for society to have. How do 43:41
you think about what AGI should have? 43:43
Like what does that even which ones do 43:45
you pick? 43:47
I think that's a really hard problem. I 43:47
think it's like actually kind of 43:49
downstream of being able to pick any. I 43:50
think of it almost I think one analogy 43:52
I've heard that I like is like putting a 43:54
steering wheel on a car. It's like if 43:55
you don't have a steering wheel, you 43:56
probably want to put the steering wheel 43:57
on and then like figure out who's 43:58
driving after and like where you're 44:00
going. Like getting the steering wheel 44:02
is really important. I think that's 44:03
that's like one answer. I think the like 44:05
other answer is probably like you want 44:06
these things to be like under democratic 44:08
control of some form. Like you don't 44:11
want one person's values. Like that 44:13
seems like you're sort of heading 44:15
towards dystopia. So there I think what 44:16
you really want is like something that 44:18
basically can talk to a lot of people 44:21
and like take on their values from 44:22
different perspectives or has sort of 44:24
very generic like kind of clearly good 44:26
values that involve like 44:29
asking people for advice on very you 44:32
know like asking people what you should 44:33
do in certain situations instead of like 44:34
doing those or maybe just taking like 44:36
you know as these models get really 44:38
powerful you probably want them to like 44:39
do less like you probably want them to 44:41
sometimes just like step back rather 44:42
than like to rather than having sort of 44:44
the risk of the models like take a ton 44:46
of control over things you don't want 44:47
them to. When you think about how you 44:48
actually do the current version of that 44:50
then you mentioned the sort of alignment 44:51
you think about now in terms of adopting 44:53
a certain personality of these models on 44:55
the internet for example for me 44:56
intuitively I think of those as largely 44:58
something that comes out of post- 45:00
training like it comes out of okay you 45:01
you have pre-trained your model you got 45:03
the loss function a certain amount and 45:04
then you you know give it some 45:05
additional data or something to that 45:07
effect to make it in the direction of 45:08
some distribution is that approximately 45:10
the right way to think about this or is 45:12
there a significant part of that that 45:13
you think about in pre-training itself 45:14
I think that's probably the the right 45:16
way to think about it for the most part 45:18
I think like I the way I usually think 45:19
about it is anything you can do in post 45:20
training you probably should 45:21
because your iteration loop like the 45:23
ability to make progress is really fast 45:25
you can try something you try it again 45:27
you can try it again a bunch of times 45:28
days or hours or something like that 45:30
yeah 45:31
you don't put into pre you have to kind 45:32
of like do all the careful science to 45:33
deisk it you have to put it into the 45:34
next run wait a few months then you have 45:35
to like 45:36
get a thing and if it's wrong it's 45:38
really bad and then the other advantage 45:40
is if you want to do things that really 45:42
are complicated model behavior 45:44
interventions the paradigm time for 45:46
pre-training, test things out on small 45:48
models doesn't work. The model can 45:50
barely put a sentence like the small 45:51
models can barely put a sentence 45:52
together. Totally. So, if you're trying 45:53
to get it to like have the exact 45:54
personality you want, you sort of want 45:57
that on the 45:58
it has to be on a model that's good 45:59
enough to be on the smart model. Yeah. 46:00
But that said, like 46:02
I do think at some point there will be 46:04
like some pieces of alignment that like 46:05
you do want to export back into 46:07
pre-training because that might be a way 46:08
to like 46:10
put them in with more strength, like 46:11
more robustness kind of or or more to 46:13
the intelligence. Like if you think of 46:15
pre-training as like teach the model to 46:17
be intelligent and then post training as 46:18
like tweak the personality, you can 46:20
imagine tweaks where you actually want 46:22
it to be like part of how it learns and 46:23
like part of its intelligence and maybe 46:25
you need to create more. 46:27
What would that even look like to 46:28
incorporate pre-training? Is that like 46:29
add extra data basically of the type of 46:30
domain you want it to adopt earlier? 46:33
Basically, 46:35
there's a paper called pre-training on 46:36
human feedback where you can kind of 46:37
like add the human feedback 46:39
characteristics into pre-training to 46:40
like test that and like uh yeah, you can 46:42
you could basically give it all the 46:45
information you give it in post- 46:46
training just mixed into pre-training 46:47
and see what effect that has. Yeah. The 46:49
other loss you have when you do that is 46:51
you lose the flexibility like if you you 46:52
sometimes like train these and then you 46:54
talk to them and then you like do an 46:56
extensive process where a bunch of 46:58
people talk to the thing and find some 47:00
like issue. you know, the model says 47:01
like you're absolutely right too much 47:02
and you want to go 47:04
do that. 47:06
Yeah. Yeah. I mean that I think that 47:07
iteration loop point you made I think 47:08
feels like the really key point of yeah 47:10
there's a huge difference between taking 47:12
three months to get information about if 47:14
your model is good or bad or making 47:17
going in a good direction versus a day 47:19
or something or a couple days like you 47:21
can do a lot of those and you could 47:22
probably that probably also means it's 47:23
way less computes. You can do a lot of 47:25
those in parallel. Imagine you're trying 47:26
all sorts of post training strategies in 47:27
parallel there. 47:28
So yeah, makes a lot of sense. It's also 47:30
just the general hard part about 47:31
pre-trading like everything in pre-ra is 47:32
hard because you have this like one shot 47:33
on goal kind of for like multiple months 47:34
and 47:36
totally. Okay. So, uh in thinking too 47:36
now about I guess what's going ahead as 47:39
you as you now look to the next several 47:42
years of what you're building like how 47:44
do you think about you know like what 47:45
are the known problems that you're going 47:47
to face that you're going to have to 47:50
deal with? though there's going to be 47:51
more compute I assume and you're going 47:52
to need to hook up even bigger network 47:54
uh network GPUs and deal with versus 47:56
like are there areas where you're like 47:58
okay this is like a problem that it's 48:00
like a little bit more ambiguous what 48:02
the actual like how it's going to 48:03
materialize into something you care 48:05
about but you kind of know it's an 48:06
impending thing to think about or are 48:07
there things like that that come to mind 48:08
I think the things that feel most top of 48:10
mind to me are probably like paradigm 48:11
shifts like I think the sort of shift 48:14
towards uh more RL is like one paradigm 48:16
shift in the field and I I think it's I 48:19
think there will probably be more. Uh I 48:21
think a lot of people sort of argue 48:23
about like oh is like you know current 48:24
paradigms enough to get us to EGI and 48:25
I'm like 48:27
I don't know maybe probably but like I'm 48:28
sure there'll be more. It seems it seems 48:30
like it would be a really surprising 48:32
twist if like the answer is like you 48:34
just scale and there's nothing that you 48:37
realize in the process of going up many 48:39
orders of magnitude. 48:40
Totally. 48:42
But I think the things that I like 48:42
actually feel like most nervous about 48:43
are really hard to solve bugs. I think 48:45
that like uh 48:48
that's interesting. 48:50
Yeah. And I think this is like maybe 48:51
somewhat surprising to me, but it's just 48:52
like a single bug can like 48:54
derail you for months. 48:56
And when you think about it, like you 48:58
the models take months to train. So you 48:59
could kind of like lose a whole 49:01
generationally 49:02
off of something that just looks like 49:03
odd. You know, it turns out like 49:06
this piece of your code was incorrect 49:08
and you couldn't detect it. 49:11
Uh and it's really hard in ML, right? ML 49:12
is always really hard to find bugs in. 49:14
Yeah, totally. But also some of these 49:15
scaled up issues are really hard to 49:17
solve even when you know they're there. 49:18
Yeah. Like what's even a unit test that 49:20
you would write or forget a unit test? I 49:21
mean anything close to a test for the 49:23
type of like network architecture on 49:26
which you're doing this. Like how do you 49:27
even do that? 49:29
I mean like you can send a packet over 49:30
it and confirm it's the same. 49:32
Uh you can you can train a small model 49:34
on it. Um 49:36
but even train a small model on it it's 49:37
like not obvious. You know, if you have 49:38
like the the simp the very classic like 49:40
very simple ML bug that like early 49:41
people face in their careers like okay, 49:44
they have some like they have like 10 49:45
layers in their network and like you 49:46
know layer 7 connects to nine instead of 49:48
8 to 9 and like so like there's some 49:50
incorrect like set of connections you 49:52
have there and technically the model 49:53
still trains and all the weights update 49:55
and so it's like a valid model but it's 49:56
not the correct one and that's like a 49:58
very esoteric weird bug that would 50:00
actually be kind of hard to find. Like 50:01
is is that kind of what you're referring 50:03
to of these like random bugs you face? 50:05
Yeah. Yeah, 50:06
it's that but like you know you can 50:07
times a million 50:09
times a million as the thing gets more 50:10
complicated you know you could like cast 50:11
the wrong precision deep in some kernel 50:13
and that causes your model to like blow 50:16
up at large scale 50:18
and you find out like a month in 50:19
or you never find out 50:20
or you never find out 50:21
I mean you know like like you see the 50:22
thing blow up like 50:24
there's I don't know 10 tens of 50:25
thousands of lines of code like how 50:26
would you ever trace it down so like 50:28
those are the things that probably spook 50:29
me the most is just like some subtle 50:31
tricky bug yeah that's probably the case 50:34
of like you don't I think there's 50:36
actually also the case of you do know 50:37
like it crashes. You're training your 50:38
model and it like or it slows down. You 50:41
know, your job slows down a ton 50:43
and those things can also be very hard 50:45
to debug. Uh Nelson Elhaj is one person 50:48
that he has a blog. He wrote up a blog 50:51
on one like cursed bug we had early on 50:53
and I remember this one quite well 50:55
because I think like I encountered it 50:56
fairly early and was like this looks 50:58
hard. Can someone else look at it? And 50:59
like a month later was like wow I'm so 51:01
glad I handed that one off. I never I 51:03
never would have been able to get like 51:04
like one of the abilities I think is 51:06
actually really useful this is the 51:08
ability to like deep dive anything to 51:09
any level of depth 51:10
but that's a pretty rare skill like for 51:12
me you know as I we talked about what 51:13
level of the stack I was at before I was 51:15
like working at the torch matball but 51:16
like if I didn't know CUDA so torch 51:18
mountain was broken it wasn't like I 51:20
could dig into torch matmo and figure it 51:21
out and it's similarly with like 51:23
communications right like I could I 51:26
could call send send bytes from A to B 51:28
but I didn't know the like underlying 51:30
networking protocol so if that 51:32
underlying networking protocol is 51:33
broken. Uh like I need to learn a whole 51:35
field. I have to like understand packets 51:37
and TCP or like all all of these 51:39
different things to debug that. And I 51:41
think one thing that's like surprisingly 51:43
hard and there's very few people who can 51:44
do is like kind of own that whole stack 51:46
from like I understand how the ML is 51:49
supposed to work and what the learning 51:50
dynamics are all the way down to like I 51:51
know the bites and I like can understand 51:54
how the bittes should be moving around 51:56
machines. 51:58
Totally. Yeah. And actually on that 51:58
front, like when you think about the 52:00
different backgrounds of people on your 52:01
team today, how do you like 52:02
approximately 52:04
uh map them out to different categories 52:06
of computer scientists? Like I think 52:08
there's this external view of what these 52:09
teams look like, which is that they're 52:10
like all PhD researchers who write ML 52:12
papers. And I suspect that's not 52:14
actually true given what you're 52:16
describing here. 52:17
Yeah, it's a mix. And I think the thing 52:18
we like most need is engineers. 52:19
Interesting. Almost always like 52:21
throughout like the entire history of 52:23
this field. Totally. It's like the case 52:24
that you throw more compute, the thing 52:26
kind of works. Yeah. Uh the challenge is 52:28
like actually 52:30
the researchers are like cool, nice. 52:31
Yeah. And getting it correct, like 52:32
getting it correct isn't really an ML 52:34
problem, right? Like the actual 52:35
architectures are pretty simple. You you 52:37
can write the math down. But you don't 52:39
even need to understand the math to 52:40
implement it. You just need to like get 52:41
a correct implementation and then you 52:43
sort of have an engineering problem of 52:45
how do I take this implement it at large 52:47
scale, paralyze all the things and check 52:48
that it's 52:50
correct. But it's yeah so it's like kind 52:51
of engineering skill but it's this 52:52
particular type of engineering skill 52:54
that's about being able to like debug 52:55
anything. Yeah. 52:56
Um I think there's another angle of 52:57
engineering which I think of as like 52:59
really quickly iterate on like a website 53:01
or something which I think of as an 53:03
important skill set probably important 53:05
for making startup. You got to be like 53:06
fail fast try a bunch of different 53:07
things none of which are like 53:09
that technically difficult to do. the 53:10
skill sets that we're like most kind of 53:13
in need of or looking for are this like 53:15
able to solve really hard engineering 53:18
problems. 53:20
Are they people who worked at companies 53:21
that grew a whole bunch and so they have 53:24
experience like doing the kind of thing 53:27
you've done over the last several years 53:29
at anthropic or do they tend to be 53:30
academics or like where do they come 53:32
from? 53:33
Yeah. So at this point like I think we 53:34
actually just hire a bunch of people who 53:35
have done this before from like other 53:36
places and that's like the easy answer. 53:38
Yeah. Yeah. But like by this before, do 53:40
you mean in AI companies necessarily or 53:42
also, you know, like someone who worked 53:44
at Meta on like their not AI team but 53:45
they ran some other distributed system 53:48
that you know reached internet scale 53:49
five you know 10 years ago or something 53:51
like that 53:53
more like we have like a specific role 53:53
in mind. So like say I'm like trying to 53:54
make the run train efficiently in Jacks 53:55
like hiring someone who's like worked on 53:58
jacks would be great or someone who's 53:59
like worked at another company on 54:01
optimizing a jack stack to be really 54:03
efficient. That's kind of like I think 54:04
now we're at the point where like the 54:06
entropic is well enough known we can 54:08
sort of hire these people and also the 54:09
field is big enough that there's like 54:11
people with expertise. One thing that 54:12
was interesting was like early on we 54:14
hired a lot of people from just like all 54:15
sorts of backgrounds and I think that 54:16
people who are just smart and work 54:19
really hard can learn this pretty fast 54:20
but you have to like want to. We hired a 54:22
lot of physicists for instance like 54:23
theoretical physicists who just like 54:25
show up they they do a residency like 54:27
learn to program and then uh they were 54:29
really smart they could do really great 54:31
work. Um I want to switch gears uh to 54:33
talk about something a little bit 54:36
different which is just sort of future 54:37
looking things around how you think 54:38
about other domains and or sort of 54:39
advances happening in AI that I'm seeing 54:42
elsewhere in the field and you don't 54:43
have to tell me if you guys are working 54:45
on these necessarily but like how you 54:46
think about them like are I guess one 54:48
one big area I was thinking about is 54:50
around areas other than next token 54:51
prediction like are there any of the 54:54
other you know things that people are 54:55
working on that you're curious about so 54:57
basically two differences there one is 54:58
uh not using transformer as an 55:00
architecture um So there's companies 55:02
like Liquid AI that have their own kind 55:04
of architecture for example they're 55:05
using um or not using autogressive 55:06
training as a way of training models. 55:09
Are there are any of those do you think 55:11
interesting and like ways that we might 55:13
come closer to AGI or do you think like 55:15
this autogressive framework is the one 55:16
that kind of makes sense? 55:18
I think they're interesting. I think I 55:19
like am less like ah autogressive is the 55:20
way to go. On the other hand, I think 55:23
auto reagive is probably good enough to 55:24
get to AGI or something or not like yeah 55:26
uh such that 55:29
yeah I I see the main driver as scale 55:31
and careful science of like sort of the 55:34
basics more than like come up with 55:36
something totally novel. 55:38
Not because there aren't novel things 55:40
that are better. I actually like I'm 55:41
pretty confident they are there. It's 55:42
just that scale is easier and it's more 55:44
reliable and I think you we're still 55:45
seeing really big gains to that. Do you 55:48
spend a lot of time on thinking about 55:50
things like you know I've been reading 55:51
some of these open source papers where 55:52
you can kind of dive into some of the 55:53
details about the model changes and with 55:54
some of these Chinese labs for example 55:56
where they're making tweaks on the order 55:58
of the architecture itself with like 56:00
better caching behavior for example or 56:02
like more efficient attention functions 56:04
that make a big difference. Do you feel 56:06
like these are examples of things like 56:07
you mentioned earlier where it's 56:09
basically in the grand scheme of things 56:10
basically if you throw more compute at 56:11
it this is all kind of a rounding error 56:13
or do you think it will take some number 56:14
of these very clever architectural 56:16
changes to actually get to hi like in 56:18
the way that the first person who came 56:19
up with the transformer made like a 56:21
particular transform you know literally 56:23
transform transformative change like 56:24
will it take some of that or do you 56:26
think it just you keep doing the thing 56:28
we're doing to make it bigger 56:29
I think it'll be a mix I think I like my 56:30
guess is you'll keep tweaking things the 56:32
more compute you put in the more like 56:34
worthwhile it is to like do those 56:35
experiments to like figure it out the 56:38
you know I mean inference is a thing we 56:40
haven't talked about but like you also 56:41
want to serve these models to a lot of 56:43
people so there's a lot of changes you 56:44
can make to make inference cheaper and 56:46
that depends on like the details of your 56:47
inference stack and the chips you're 56:49
serving inference on etc. So 56:50
do you as a someone focused on 56:52
pre-training have to think a lot about 56:53
inference or is it kind of like you just 56:54
do your thing you make the loss go down 56:56
and then hand it off and someone else 56:57
makes that happen. Oh no. I think a ton 56:59
about inference because basically like 57:00
the problem inference is solving like we 57:02
basically determine the problem 57:04
inference is solving. We give them a 57:05
model and they have to like run that 57:06
fast and it's very easy to give them a 57:08
model that is impossible to run fast. 57:10
Oh, can you give an example of a 57:12
decision you can make that could cause 57:13
that? 57:14
I mean the simplest one's stupid but 57:15
it's like you just make the model giant. 57:16
Yeah, absolutely. Train for like a 57:18
really small number of tokens and then 57:20
inference now has this giant model 57:22
and their host basically. 57:24
Yeah. I mean you can also make things 57:25
require communications in a lot of 57:27
places 57:29
uh which would make it harder for 57:30
inference. Um totally 57:32
you can also just make things 57:34
complicated and like there's no 57:35
fundamental reason it's hard but there's 57:37
only so many people on the inference 57:38
team and like they have to implement it 57:40
in a bunch of places. 57:41
Yeah. 57:42
Yeah. No, so I definitely think of like 57:43
the like inference is the team that I 57:44
work the most closely with like 57:46
because we're kind of like co-designing 57:49
models to be smart and cheap. 57:51
Yeah. Interesting. particularly in a 57:53
world of like limited compute, right? 57:54
Like the sort of the bottleneck I think 57:56
to a large degree on our I mean you can 57:58
see anthropic has rate limits constantly 58:00
and people complain about a lot and like 58:02
the reason is like 58:03
there's only so much compute we can get 58:05
on on short notice. So you like making 58:06
your inference more efficient is like 58:08
the way you can serve more users 58:10
and actually like let's say you had 100x 58:11
more compute or we somehow didn't live 58:13
in a world where compute was limited. 58:15
Does that change a ton about what you do 58:17
or is it still kind of the well you're 58:20
just going to grab all of it whatever 58:22
compute you have and keep going down the 58:24
loss curve and you kind of well you it's 58:25
like impossible to be in the world where 58:27
there is enough comput 58:28
so I think if we got like infinite 58:30
comput the challenge would be making use 58:31
of the compute right so like then you 58:32
would start to run into these issues 58:34
like oh well when one chip fail you know 58:35
like okay I'm going to throw two billion 58:37
chips around but what happens when a 58:38
chip fails so I think we would be 58:40
limited on people then it would be like 58:42
how fast can we solve the hard 58:43
engineering problems to scale up. But I 58:45
do think the change is massive and I 58:47
think people like don't realize how chip 58:48
limited AI like research is or something 58:50
right now. Like the models that everyone 58:53
uses, right? If you're using like Cloud 58:55
Sonic 4, Cloud Opus 4, it's like it's 58:57
our first shot at models at that scale, 58:59
right? And like 59:01
if you think about anything like you 59:02
could do it and you could do it again, 59:04
you could do a better job. But if you 59:05
sort of imagine like 10x the comput like 59:07
you could run this every day instead of 59:09
every few months like you or 100x maybe 59:11
for that then like yeah it's just it's a 59:14
really it would be a really big change 59:16
to have a lot more comput and it's 59:17
coming right like that's like kind of 59:18
the fun part of the field is like every 59:19
year you're like oh I had no comput a 59:21
year ago then exactly how do you think 59:22
about methods like uh like discrete 59:25
diffusion like I saw there's like a 59:27
gemini diffusion model and I think about 59:29
that in the space I used to be in where 59:30
um there's a lot of discrete diffusion 59:32
models being used in protein design for 59:33
example space where my startup was like 59:35
do you see that as a domain where 59:37
there's going to be interesting uh 59:38
advances happening? 59:40
I'll be honest like we haven't done 59:41
image generation and I think that's been 59:42
like the main use for diffusion. So I've 59:44
kind of had this on my like to-do list 59:46
of like things I should understand for a 59:48
while and like there are people in my 59:50
team who do understand it and would have 59:51
better thoughts but like I actually 59:52
don't think I understand it well enough 59:54
to know. I I do have it kind of in my 59:55
this category of like yeah 59:57
not a total par like and there's a lot 59:59
of things that aren't like a huge 00:01
paradigm shift but they're like pretty 00:02
big changes to how things run and I 00:04
expect like there are some of those that 00:06
will work um I don't know if it's 00:07
diffusion or if it's another one 00:09
obviously who knows what anthropic will 00:10
do in the future but at least in the 00:12
near term are the things where you see 00:13
big areas where a startup can win in the 00:15
world in which anthropic is getting you 00:17
know making their models better 00:19
year-over-year 00:20
my general read is like anything that 00:20
benefits from the model getting smarter 00:22
I think Like on the one hand there's 00:25
like a lot you can always be like oh 00:27
yeah the if you're doing a startup like 00:29
all the AI labs are big companies 00:31
they'll be bigger than you and they 00:32
could do that thing but also like we're 00:33
all working on this general system that 00:35
covers a lot of different uses and the 00:37
the plan is to like power all the 00:40
startups to do all of the individual 00:41
work. So yeah I think like anything that 00:43
just kind of looks like oh this almost 00:45
works with current models but requires 00:48
like a bunch of work is a pretty 00:50
promising direction. Uh, I think maybe 00:51
the thing to watch out for is things 00:53
where like they work now with a huge 00:54
amount of work like to build up a 00:56
scaffold, but the next generation you're 00:58
not going to need the whole scaffold you 01:00
built up. That's I mean maybe that's 01:01
fine. I don't know. Like maybe you just 01:03
build up the business with the scaffold 01:04
and then you don't have to do any work 01:05
later and you business, but like I don't 01:06
know about the business side of it, but 01:08
like it does feel a little silly to put 01:09
to invest a ton in that. 01:11
Yeah, totally. 01:13
What about on the flip side? Are there 01:14
things in your training uh stack where 01:16
you're like, man, if there was a company 01:18
that solved X problem, I would totally 01:20
buy their product. 01:21
Yeah, there's like a ton. I do think 01:22
that like probably most of these like 01:24
the way I would probably structure would 01:26
be like almost like making something but 01:27
then consulting with the comp like 01:28
offering a service to companies for 01:30
free. 01:32
Particularly for like companies that are 01:33
scaling really fast, you're almost 01:35
always limited on like how many people 01:36
you can have. So if you can like 01:37
even if you could hire people to do it 01:39
yourself, actually being able to 01:40
contract someone else to do it where 01:41
like they're managing it and you know 01:43
hire all the people and like deal with 01:45
the organizational side could be useful. 01:47
I mean there's huge amount of stuff. One 01:48
that jumps to mind we talked about like 01:50
chips that do math incorrectly. Like it 01:51
would be lovely if there was some 01:54
startup that like you could just say 01:55
like here are my chips. confirm they're 01:57
all perfect. And if they're not, let me 01:58
know exactly what went wrong on like 02:00
what fraction of them. And like I can 02:02
tell you the math is wrong, but I 02:04
couldn't really tell. I don't really 02:05
know enough details of chips to be like 02:06
this chip failed because this particular 02:08
like low-level component was like wired 02:10
wrong or like got hit by a game. I don't 02:12
I don't know what causes it. You could 02:15
always go like bunch a bunch deeper. I 02:16
mean, the thing I'd maybe just push 02:18
startups on is thinking a little bit 02:19
about like uh this is maybe less 02:20
technical, but just like what happens 02:23
once we get AGI and like how to make 02:24
sure that like goes well for the world 02:26
or something. Like my my expectation is 02:28
like if you actually automate 02:30
almost everything a person can do. The 02:31
amount of economic growth there is just 02:33
like truly enormous and I would think a 02:34
little more about like how do you make 02:38
this like help the world versus not. I 02:39
think there's going to be like plenty of 02:41
economic success or something as a 02:42
result of it anyway. 02:43
Yeah, absolutely. Yeah. Um last question 02:44
I want to ask you is around if you 02:46
rewind back to where we started like 10 02:48
years ago. Uh you're a student, you're 02:50
pivoting into AI from kind of economics 02:52
work you were thinking about. Um and you 02:54
know all sorts of things you probably 02:57
did in those early days had some kind of 02:58
compounding return for you as you 03:00
developed into the role you have now. 03:02
Like what advice would you give to 03:04
students as they think about uh entering 03:05
the workforce, especially today? Um 03:08
learning skills that going to be useful 03:10
and maybe getting themselves jobs like 03:11
the ones you have right now 10 years 03:13
later? It's hard because I think the 03:14
timing is very different. Like I just 03:16
think we're like we've made we made a 03:17
lot of progress. So like what I would do 03:19
10 years ago is different from what I 03:20
would do today. 03:21
Totally. 03:22
But I think certainly if I went back 10 03:22
years ago I would be like focus on AI. 03:24
It's like the most important thing and 03:26
particularly focus on engineering which 03:28
I think felt very wouldn't have seemed 03:30
obvious to me at the time that like the 03:31
important thing was these engineering 03:33
skills and not the like math and 03:34
theoretical understanding of like you 03:37
know uh SPMs and like all the kind of 03:38
standard 03:41
ML literature. Um, I think today I would 03:42
probably focus a bunch on the like 03:44
engineering and on the like figuring out 03:46
what to do with AGI as sort of the two 03:48
like main things that feel top of mind 03:52
for me. 03:54
Let's call it there. Thanks so much, 03:54
Nick. Appreciate it. 03:55

– English Lyrics

🧠 Vocab, grammar, listening – it’s all in "", and all in the app too!
By
Viewed
17,378
Language
Learn this song

Lyrics & Translation

[English]
[Music]
Hey guys, I'm thrilled to be joined
today by Nick Joseph, the head of
pre-training at Anthropic. To give
viewers a highle sense of what we'll be
covering, we're going to start with the
basics of what pre-training is and then
dig into how Nick thinks about strategy,
data alignment, and infrastructure at
Enthropic. And by the end, you'll
hopefully have a sense for how progress
in AI comes directly from advances in
pre-training. I would love to talk a
little bit about your backstory and kind
of how you got to this point. Where did
you work before Anthropic? And what were
your takeaways from those places? Yeah.
So let's see. I was at Vicarius uh and
then at OpenAI uh before Anthropic. So
Vicarius was originally a GI lab and
sort of when I joined they were sort of
making a shift to product particularly
working on robotics products and the
thing I worked on was like training uh
computer vision models for for their
robotics products. It was my first job.
So I think I just like learned a ton
about like how to do machine learning
models, how to like write machine
learning infrastructure.
And at the time were you also thinking
about a career as an academic? Like at
the time a lot of people doing AI work
were in PhDs. That's kind of what I was
thinking about before I started to do a
company. Like how were you thinking
about that in your headsp space?
Yeah. So like I'm actually rewind a
little bit. I think like a lot of my
thinking on this had come from an
internship I did at Give Well, which is
like a nonprofit that evaluates
charities. And some people there being
like ah we're at some point we might
have AGI. It could be dangerous. We
should worry about these risks. This
could be like a big impact on humanity.
And I was like not super convinced at
the time and went down the economics
route and was going to try to work on
like directly helping people in poverty.
that didn't work out for various reasons
and ended up being like okay I'll at
least work on AI either like the safety
thing will turn out to be important I'll
work on that or it won't be and I'll
just make cool things with AI that can
probably help people in poverty more
I wasn't really coming at it from an
academic standpoint I was sort of like
in fact when I switched to that it was
part of the appeal was that I could like
immediately go do stuff in AI whereas if
I want to work in like economic policy
I'd have to wait
I don't know six years to do a PhD and
start and like totally uh it's it's a
longer path
and and what are the state of AI safety
work at that time even look like? Like
who are the people who were thinking
about that kind of stuff? I mean there
were some folks at vicarious thinking
about this kind of thing but it was
fundamentally a robotics company and and
so yeah how how were you thinking about
that at the time?
Yeah. So my sense was like at the time a
lot of the AI safety discussion was kind
of theoretical like the models weren't
actually that good. They weren't really
posing these dangers. So it was a lot
more like philosophical like oh at some
point we might get AI that's really
smart smarter than humans and like
should we wait this like future concern
how should we compare that to near-term
things? And I think that was like
actually a just a less compelling
argument. I think it was like an
interesting one and like sort of made
you think of it.
So next you went to OpenAI. What was
OpenAI like at this time?
Yeah. So I was at OpenAI. I was on one
of the safety teams and kind of worked
on uh
I ended up working on code models
actually and kind of when I got there I
could the the first thing I saw was oh
they'd find tune GT3 to write some code
but and it was really good and I was
like oh okay if you're worried about AI
getting really powerful writing its own
code that seems
seems like it could self-improve and how
how likely is that to happen? So it was
doing a bunch of evaluations and like
studies of what contributed
and then after like uh eight months uh
basically everyone I worked with like
all all the safety leads left
which uh yeah invited me to go to
Anthropic and that was sort of the
reason I joined OpenAI was because I
cared about AI safety and wanted to work
with them. So then I went with them to
join Anthropic uh pretty much right when
it started.
With that why don't we transition a bit
these days you run the pre-training team
specifically at Anthropic. Um, obviously
you've been working on pre-training at
anthropic for quite a bit of time and
I'm sure it's evolved over the years,
what that even entails and looks like.
Why don't we start by just talking a
little bit about what pre pre-training
is like? How does it even fit into the
way of thinking about how AI models have
developed at a place like Anthropic? And
what exactly do you guys do?
We know that one of the ingredients to
making AI models better is scale. You
want to put a lot of compute in. And if
you sort of step back and you're like,
okay, what's the way we could put the
most compute into a into a model
possible? We need some objective that
there's just like tons of data for. And
one idea here is like the internet. The
internet is massive. It's probably the
biggest like single source of data
humanity has created. And you don't have
labels. It's like you don't want someone
to have to go in and look read the
entire internet and like say something
about it. So you want to get labels out
of the data itself.
And the idea here is we can take some
text and we can predict the next word.
So you take you know the as the first
word you predict the second word then
you say the cat and you predict the word
after that. And this means you get very
dense signal. Every every word is like a
new example. And there's a huge amount
of data and one of the findings from my
GT1 GT2 was kind of as you throw more
compute at this more data bigger models
uh you get better you you get smarter
models essentially.
Totally.
Um and that's kind of been the central
thesis of pre-training for the whole
time.
Uh there's this idea of scaling laws
which is that you can actually quantify
like as you put in more compute more
more data more parameters you get models
in a very you get a lower loss a better
prediction of the next word in a very
predictable way. And I think you can
somewhat foresee from that original
paper and I think like Dario did foresee
this I think many people did but wasn't
obvious was that once you have that
there's this positive feedback loop
where you can train a model you can use
it to make something useful and sell
that and get more money use that to buy
more compute and then you actually train
a better model and I we've sort of run
that cycle over and over again over the
past 5 years or so. Well, in thinking
about that objective to begin, you know,
I think the way I think about the state
of pre-training is yeah, it seems like
this next word prediction, at least from
the external standpoint, seems to be the
dominant way pre-training happens. But
if I rewind the clock to that era of
2017 to 2020 or 2021 and two even, there
was all sorts of pre-training objectives
people were considering, right? There
was these uh BERT and BART models that
were doing mass language modeling. It
seems like this GPT series of models
doing uh auto reggressive modeling as
you describe this next word prediction
seems to be the dominant one that won
out. Do you have any reflections on that
time period? Like were you guys trying
all of them and kind of this one worked
or or is there some sort of first
principles reason why this is like the
right one that should have worked?
I think the answer is like it's mostly
empirical like in terms of how to think
of the things I'd be like yeah it's
empirical just try them all see what
works. One big advantage for this auto
reagive setup is that you can just
sample from it to generate text
afterwards in a fairly like
straightforward way that comes
like enables a product use very nicely.
Um like one thing that you want is like
one charact is like a loss whereas you
drive down the loss that actually is the
thing you care about and you can think
of it as like if you got to perfect on
language modeling you now can like write
text as a human. You can sort of imagine
you put in the title of a paper and it
should spit out the entire spit out a
novel paper. Whereas I think some of the
other approaches don't quite have that
uh flavor.
Yeah, totally. Yeah, it makes sense that
in terms of that loop you're describing
of, you know, then release something
that gets you revenue and you can use
that to buy more compute and iterate.
This sort of gives you the most natural
way to actually do that flow because you
can keep releasing new products and keep
getting the revenue from that to invest
in more compute and so on.
Yeah, it certainly gives you the most
open-ended thing. You could imagine, you
know, you like train something as a
class like you train some base thing,
you fine tune it for a bunch of
particular tasks. one approach people
would use. They would like do this big
pre-training and then they wouldn't just
like open-endedly sample from it. You'd
fine tune it on like a hundred specific
tasks and that could work too. I think
that like the one sort of general
intuition I have is like compute is the
thing that matters. So like I think if
you throw enough compute at any of these
objectives, you're going to get
something that's probably pretty good uh
and can kind of be fine tuned to other
things. And it's it's surprising how
little these details matter compared to
throwing more comput. When you think
about actually throwing more comput,
there's a whole bunch of axes by which
you could throw compute at it too,
right? And if you have a specific model
architecture you're training over, you
can basically throw more data at that
specific architecture. For a particular
one, you could add more layers or make
the models larger in it. You could do
some kind of neural architecture search
over lots of different variants. And I
assume that these days it's somewhat
more figured out, you know, which
architecture you go for. I assume the
earlier days it was somewhat less so.
And and I'm curious if you could speak
to how you guys thought about that. like
what did your infrastructure even look
like to do that type of determination?
I mean, I think the the short answer is
it's hard, right? Like what you're
really doing is you're going to train
this one big expensive model and you
have a space of, you know, you can sort
of call all these things
hyperparameters. You know, how many
layers do you have? What's your width?
Like you have the space of hundreds of
hyperparameters and you want them all to
be optimal and you're sort of striking
this balance actually between how much
do they matter like can you just take
your best guess and throw more compute
at it in whatever way you want and
basically doesn't matter. how much you
want to get it precisely correct.
Interesting.
And I think one of the like interesting
things is like it actually doesn't
matter that much. Like we like I think
this was in one of the early scaling
laws papers like you can change these
things and get little wins but like as
you throw more compute it it sort of
reliably gets better. If you mess up
enough you will you will sort stop
seeing that happen and you won't have
any way to know which is one of the
that's like kind of the hardest part in
some ways.
You don't know the counterfactual
basically because you didn't run it for
long enough to actually know what it is.
Yeah. We have these scaling laws. So you
can sort of say like as you train more
comput you expect the loss to go down as
a power law.
It's really a power law plus constant.
So what eventually will happen is you'll
curve off that power law and then you
know something is wrong and is it
fundamental? Is it like you've hit the
limits of scaling or is it nope you
should have ch you should have tweaked
your learning rate slightly differently
and that's that's sort of one of the
challenges in terms of how to like
figure it out. You can the the usual
paradigm is like test things out at
small scale before running them at large
scale and try to find
small scale in terms of data or in terms
of something else? uh in terms of
everything like you kind of want to
scale things down like proportionally.
So you want to say like you want you
want to have some theory for like how
you're going to scale up like ah okay if
I get 10 times as many flops how much of
it goes into layers how much of it goes
into data how much of it goes into
attention
and you sort of get that theory and then
test that it's optimal a bunch with like
scaling everything down proportionally
and and just so I can think about what
this actually looks like in those in
those early days of anthropic you know
you're a team of like 10 or something
like that in those very early days or 12
maybe what actually is your ability to
use large scale infrastructure as like a
relatively nimble startup at that time.
I mean a startup that was well
capitalized but still not actually that
many people working at. What kind of
infrastructure did you have access to to
train these early models at the So
actually one of the wild things was it
at least I mean you don't know what
anyone else is doing of course but it
kind of felt like we were like at the
frontier of it and there just weren't
that many people who cared like I was
sort of coming you know I was coming at
it from like we're making AGI this is
the most important technology ever and
then would kind of like look around and
be like and it seems like I'm one of 30
people who were working on this in like
the world. I mean I was kind of like
junior person. Everyone else sort of
knew how to do this and had done it
before but I was kind of surprised at
how easy it was. Um like the public
estimates for GP3 I remember were that
it cost $5 million to train which you're
like on the one hand five million is
kind of a lot but it's like a lot for an
individual person. It's not really a lot
from like a company perspective. So we
could totally buy like compute that was
enough to train models like that you
could
and were you using a cloud provider or
or did you have a custom setup somewhere
or did you literally have racks in a
room somewhere that you were you know
bought a bunch of Nvidia GPUs and you
were doing it? uh we're using a cloud
provider, but I think it's kind of it's
not actually that different because one
of the things that's was surprising to
me is you actually have to understand
the the literal layout. Like uh I
remember at one point uh one of my
co-workers running a clustering
algorithm to identify what rooms all the
chips were in since we we had a
hypothesis that they were in different
rooms and that was causing like or you
know different buildup some sort of
network latency and you can kind of
figure it out. you could like reverse
engineer like ah okay yeah there's
clearly like two clusters here that are
connected better and there's some issue
on the connection between them like
you're we're trying to push the limits
of of the hardware like as much as
possible
um particularly at the beginning when we
were kind of like we have way less
funding than everyone else we have to
and and most people weren't very
efficient with the compute so we were
like ah we get a big lead by being
really efficient at at how how we use
the comput
could you talk a little bit about some
of the things you guys did in those
early days for how to get the most out
of the hardware I think it's really
interesting like I think back to the
days of the early days of Google for
example where there's the there's these
cases where they basically bought
relatively cheap consumer chips and then
they optimized the software to make it
so you can actually get the most bang
for your buck out of them and that's how
they had all this high latency or low
latency high availability stuff. I'm
kind of curious if there's some analog
in the early AI era to that.
I think for us it was largely about like
getting the distributed framework right
so like we're training on in order to
train you have to train them on a large
number of chips
and there's a bunch of different
approaches to to how to do this. There's
like data parallels and there's
pipelining there's upsharting and like
getting all of the At the time there
were no like great open source packages
you could just grab and use that just
worked for this. I mean today there's
somewhat more of these but at the time I
assume there was literally none.
There were some like I actually remember
that we were working on data parallelism
early on and someone was like and now we
write the or reduce it. I was like we
really do this ourselves. don't like
package and this was kind of like well
we're going to want to modify it right
like oh like we don't want to outsource
this to some package because a we're
about to go to a bigger scale like
pietorch for they had a package for
doing this but we were going to go to a
bigger scale than Facebook had been to
and you don't want to have a dependency
on a package uh that you're going to
have to be like constantly modifying
essentially
that's it's such a counterintuitive
sentence there too like we're going to a
bigger scale than Facebook will because
at the time Facebook AI research was
considered one of the best places to do
machine learning research like fair was
one of the play fair and deep mind we're
hiring lots of people out of PhD
programs and doing lots of things like
what was your headsp space when you were
like okay this this very established lab
with great people and whatnot we are
operating on a scale that is not
relevant to them like was that natural
and obvious to you or was there times
where you kind of doubted the decisions
you were making in that situation
I think it was surprising I will maybe
I'm just too arrogant or something I
kind of looked around and was like what
are these people doing they're all
missing the like big picture here like I
I I think the scaling laws were pretty
clear like and the arguments against I
just thought were kind of nonsensical
like you know the scaling I think the
original scaling laws paper had like 11
orders of magnitude and there was like
this intense debate on whether it would
continue for like another point and I
was like
like it seems it seems like one over 11
is maybe your chance it fails here and
then like you know sometimes it doesn't
work like sometimes it just works
straightforward you like train the model
you're like oh yeah of course but yeah I
do think that it was it maybe felt
obvious when you're in that headsp space
and you're working on this all the time
and you're making those plots and I
think these things feel pretty different
when you're on the outside. You know,
there's a huge space of papers. Everyone
tries to make their paper sound like
very robust and and important. I I could
see I could see being like, "Oh yeah,
this is not really a thing."
Totally.
But also different labs had different
cultures. So like I think one of the
things at fair was it was a very
more PhD style independent research.
People have their own ideas, pursue
those.
You're fighting for your compute and so
on.
Yeah. And to do a project like training
a large language model requires a lot of
people to collaborate on like a really
complicated piece of infrastructure that
isn't going to be a paper, right? Like
you're not going to publish like, oh, I
got a slightly I got 5% more efficiency
totally
than the next one. Um, and it's not
respected in like those cultures
necessarily. So that might have been
part of it.
Okay. Okay. So then when you actually
implement these these models, you're
saying you're using a level of low-level
programming where you know you're using
libraries like PyTorch, but you're
perhaps not using everything right out
of the box from PyTorch because there's
things you guys want to customize that
are at the level of basically one level
of abstraction below them, but not
necessarily at the level of abstraction
of you know writing custom CUDA kernels
or or like was that also in in the space
where you guys were thinking about?
So it depends on like the operation. So
like I think I was mostly operating at
the level of like torch.mmatal you know
like uh yes where does a matal go but
not thinking like how do you make the
matal efficient like I assume torch
figured out how to make a matal as
efficient as is possible but there are
some pieces like attention where there
was just kind of a lot of different
varants and attention is really
complicated and hard to make efficient
on a GPU and th those things you have to
kind of go go more levels down on the
stack. Uh I think there was like a
process that is maybe interesting that
I'd never really like thought of before
of like how to do it which is sort of
like modeling out the pro the thing
you're going to do coming up with a
strategy for how to paralyze it that
like can get to a really good efficiency
you know like
so you're thinking about MFU basically
like your utilization on your GPU. So
there's like a goal utilization you're
trying to get at and a strategy to get
to there. You're saying
yeah and I think like one of the things
you can do is you can actually like
pencil and paper math out what
efficiency you're going to be able to
get to. Right. you know all the
constraints it's MFU and is flops
utilization but like the reason you
don't get good MFU is you end up limited
on HBM bandwidth you end up limited on I
don't know as host to like CPU offload
there's a bunch of different pieces but
it but there's not that many pieces
there's like six relevant numbers there
so you can totally model it out
understand what the constraints are and
then implement something that can get
there it of course will be really
inefficient when you implement it and
then the next step is like pulling out a
profiler so you want to be able to
profile the job look how long every
operation takes. Have a model in your
mind of how long every operation should
take and then make those two things the
same.
And and were there good out of- thebox
profilers you could use at that time or
did you guys have you know because
people weren't operating on the kind of
network topologies you guys may have
been using. Did you have to write your
own profilers basically to do this type
of you know multi-node optimization?
Yeah, it depends when I actually getting
better with time. The PyTorch profiler
was like pretty good actually throughout
for a single GPU. If you want to like
profile a GPU, the PyTorch profile would
work. But if you wanted to profile a job
on hundreds, thousands of GPUs, that
like hadn't really been done much. And
then that was kind of more of us like
hacking into the profiler to figure out
how to combine all the traces together.
And then one more question on that
earlier is, you know, you had mentioned,
you know, you hadn't really done a lot
of this work before maybe some time at
OpenAI and those early days in
anthropic. How did you actually go learn
all this stuff? Like what was your
process for learning about those six
things that were relevant to bandwidth
limitations and whatnot?
I mean, so when I joined anthropic, one
really nice thing was there just wasn't
that much. I think my first day I read
through our entire uh all all of Slack
and the entire like internal database
and learned a bunch from that. Like it
was kind of nice to just be like
everything is relevant to me. And then I
mostly learned from pair programming.
Like uh Tom Brown had done all this
before. So he kind of like knew all the
stuff quite well. Sam Mclish my manager
had also done a lot of it before and I
just like paired with them a huge amount
at the beginning. And I think one of the
things I really like about pairing as a
way of learning is you learn the like
thing you're trying to do. Like you will
learn that like if you're pairing
someone better than you, they can just
do it. So you're mostly just watching
them. But you also learn how people do
it. So something like a pro how to use a
profiler is not something you would ever
learn from seeing someone's like final
write up on Slack for their PR. You
would just be like, "Oh, they found
these. They changed this specific line
and it's a win." They
like you need to watch like a YouTube
video for 4 hours of someone messing
around with a profiler to like maybe
self teach it or something or to
actually pair with someone is basically
the best you can do.
I think there was like one thing that I
I think is embarrassing now that I look
back is I'd never actually used a
debugger before joining anthropic.
People talk about it PTB of like yeah
that's a thing people use but print
seems fine for me.
Then I like watch like oh no a debugger
is a super useful tool. this person's
way faster at debugging things
particularly if it takes a long time to
start up the code which they can and
yeah learn learning that sort of thing I
think comes best from pairing and then
there's of course the obvious you just
learn by doing you know I eventually did
like spit up profile and stare at it for
many many hours
totally yeah exactly yeah okay so so
then that was sort of the very early era
over time obviously pre-training has
become bigger and bigger as you're
describing scaling I imagine you're
using many x more GPUs much more compute
over time I'd be really curious to hear
first at a high level What do you feel
has changed about the pre-training
strategy that you could talk about?
Obviously, there's more compute, but
what does that actually mean to have
more compute in terms of what you think
about differently from those early days
versus now?
I'm sure the things that haven't changed
cuz I think it is like shocking how in
some ways like
I think I'm still pushing down the exact
same metric that I was on like day one.
like there's like some loss function
loss go down and I think you could like
look at some like you could probably run
the original the first model I trained
on the same metric and just like make a
plot of like progress of team over over
time. Uh so that's all the same. I think
the
one OKR is like one thing that matters
basically. Yeah, totally.
And like I mean talking about like OKRs
it's very sized company you're like oh
should you do OKRs and it's always felt
a little bit funny for uh a team like
where I'm like sure I can just pick a
loss value but like the answer is like
as low as possible. We will continue to
work on that forever.
I think the biggest things that have
changed has been a little more
specialization. Like I think at the
beginning, I mean the first like 3 or 6
months I tried to read every PR in the
codebase and that was great. I knew all
the pieces etc. And as you grow, it's
kind of everything gets like a little
more precise. You know, people really
dial in exactly how attention should
work, let's say, or you know, really
dial in like uh the parallelism
strategy. and you end up with a team
where it's a bunch of people who are
like deep experts on individual things
which is great because it means you can
go you can go really deep on those
things but sometimes you uh at least for
me as a manager one of the things you
sometimes have to think about is like
making sure the bigger picture makes
sense and also that you have enough
people who actually do understand the
whole bigger picture that there's no
like single point of failure.
Yeah, it's interesting you you frame it
in that with that trade-off, right?
Because as as you were describing that I
was trying to think, you know, is this a
bug or a feature? like there there's
some obvious features of it which is you
get expertise and you can optimize
certain things but I imagine your
ability to take bigger swings becomes
more complicated if not everyone's
exactly pointed in the same direction
like how do you wrestle with that now
yeah I think I mostly just try to get a
balance of people I think one of the
challenges early
people oh that's interesting
yeah like I think people really do have
a preference here has been one of the
things I've seen like there are people
who really want to be a generalist and
understand everything and like lightly
touch on things there people who want to
like
pick an area often they've already
picked that area and they're like deep
experts in precision. You know they
started they did a whole PhD in
precision and just want to think about
that
and you want to get some balance of
that. I think early there was a phase
where we'd hired a lot of people who are
more generalist shaped because that's
what the people who joined totally early
startup where they go work on everything
and then
you ended up with kind of everyone doing
everything and no one really really
deeply understanding one thing. uh and
that's one failure mode but I think if
you get too many people who are
specialists
you end up with a lot of effort has to
come from the manager from like the lead
to connect everything
and to notice something like ah if we
change the architecture here that would
make this like efficiency consideration
over there way easier
um one of the things I really liked kind
of like at the very beginning was like
let's work on efficiency but I could
just go and like be like ah well what if
we change the way we do like this
particular step and we'll be like oh
yeah that's probably fine like easy
change and then like you can avoid did
this whole complicated project to make
this operation that was hard efficient
because you can make an easier operation
efficient.
Okay. Interesting. Yeah. So, as the
level of comput has also gotten bigger.
So, I'm I'm sure anyone can imagine,
okay, there's more GPUs now, you have to
network them more. Are there some like
kind of non-obvious challenges that have
arisen over time where you guys have
just like banged your head against the
wall to solve them because of the amount
of comput you're dealing with that
people wouldn't otherwise know about
that like you want to share? I think
that connecting them is one that's maybe
interesting and like surprisingly hard.
Okay. because you really do get more and
more chips connected and
like one thing that I think is like the
the standard way people paralyze chips
isn't um the whole thing is one failure
domain like one chip fails the whole
thing can crash
and
the standard way as in the standard way
people doing AI or the standard way in
in other fields where people are doing
uh in AI for like I mean at least like I
think at the beginning you know first
versions of things were this way and
so it's like you have a 100 GPU cluster
or whatever is 128 like if one of them
dies job fails basically
yeah I mean you The simplest thing is if
you just like distribute your model. So
say you put like every layer on a
different chip and you lose like layer
seven like
yeah you're not going to like skip layer
seven. I guess you could but that's like
a pretty weird model training process
now and like that leads to some
interesting things which is like okay so
now as you scale up you have more and
more chips and the failure rate can get
like larger and larger.
On the other hand you can like I don't
know you can like restart pretty quickly
there. There's nothing like you just
have to like load back in some ways. So
that was one thing. And then the thing
was like the level of novelty at the
whole stack is something that's
surprising. Like basically
everything from like how the chips are
laid out in the data center to the chips
themselves is pretty new. Like there
there just haven't been that many
generations of GPUs. I think one of the
things that I don't know when I learned
computer science my code wouldn't work
and I'd be like oh the computer's
broken. I think my teacher was like the
you can trust the computer's not broken
like you messed up.
It's you messed up. And I think one of
the most frustrating things I
encountered in AI early on was working
on something and being like, I don't
know what I'm doing wrong. I'm just
totally stumped. And uh my manager
looked at it and was like, uh yeah,
probably the computer's wrong.
And I was like, that seems unlikely. And
sure enough, the computer was wrong.
Turned out that like the GPU was broken
and uh we had to pull in a new one. But
you have to like think like having to
think about that like the GPU could be
wrong, the GPU could be slow, like these
sorts of issues. Uh the power supply in
the data center could be broken. there's
so much more like level of depth than
you like kind of expect to need as a
Python programmer.
And just to visualize it like in those
early days, I assume you guys were using
the number of GPUs. It's probably on the
order of tens to hundreds or something
like that per run. It's probably not
tens of thousands or hundreds of
thousands per run or what was the rough
size you guys were at? Those are very
early days on the order of thousands.
Like would they fit in this room?
Thousands.
Yeah, thousands. So like you could have
a bunch of racks and you could fit them
into like one room. I assume these days
it's basically like a building for for
one of these runs.
Yeah. Now I think it's like you know
huge huge campuses. At the time it was
like kind of unclear. It was like oh I
think like we were like you know do we
need them all in one room? Can we be
spread across multiple rooms? Like uh
and you know we had these theoretical
models you be like we need this much
bandwidth from point A to point B. But
you like you never know how far down you
have to go like oh but like how much
power do we need? Like what if there's
like a single capacitor that's like
handling all of them and we like turn on
the whole job at once. Like does that
crash things?
Yeah. And so do you have to think about
differences in the different types of
chips? You guys work with all sorts of
different cloud providers. From your
standpoint, are these just sources of
compute or if you guys are using TPU
versus GPU, are these, you know, Google
TPU versus Nvidia GPU? Do you actually
have to think as an engineer differently
about what it means to train on these
two?
Yeah. So, I mean, fundamentally, they're
all they're all doing the same thing,
right? They're all computing the same
operations, matrix, multiplications,
etc. The way they do it is pretty
different, and the way that you program
them is is pretty different. Uh and then
also the actual specs uh end up pretty
different. You know, some some might
have like a lot of flops and not very
much memory or they might have a lot of
memory bandwidth but not very much
memory. So I think a lot of having
multiple chips is like great in some
ways. It means you can actually like
take the job and put it on the chip that
it works best on and that's
like are there certain types of jobs
that would work better on like a TPU
cluster versus an Nvidia GPU cluster?
Like how would you talk about that? Oh,
interesting. Can you talk about that?
Yeah. Yeah. I think like one example is
like inference as a workload in general
tends to require more HPM bandwidth. You
you end up doing you sort of the
simplest form of sampling since you're
going one at a time you have to load all
the weights for every token
and that means you might want a lot of
HPM bandwidth. Uh pre-training actually
is often more flops intensive because
you have a larger batch sizes
essentially.
Um so yes you can sort of specialize
which chips you use for which purposes.
The downside of having multiple chips is
that you have to write the thing
multiple times. uh you in theory you
could have abstractions across them but
they're they're different enough that
it's pretty hard to do that. So you can
sort of end up if you do all the
workloads on all the chips you end up
multiplying your work work by the number
of chips you have.
Yeah. On your on your point about
sometimes the computer just breaks. I
definitely remember you giving me an
anecdote of uh my company at the time
was doing something with Google TPUs and
I was telling you something some
anecdote about how we were having some
esoteric seg error and you were like you
told me something to the effect of like
you should have used them six months ago
before we helped them fix like half of
the problems they had on those TPUs. And
so I can imagine how you guys deal with
a lot of especially with these very new
chips like lots of problems that arise
that you guys kind of like worked
closely with the providers to fix.
Yeah, the pros are like pretty great
about fixing things. I think it's like
interesting to figure out the right way
to do that form of collaboration cuz
like they have a strong incentive to fix
them, right? Like they they want the
chips to work well for us. They want to
sell us more chips in the future. We
obviously have a very strong incentive
for the chips to work cuz we like buy
them long in advance, you know, like
everything's riding on getting these
clusters to work.
Totally. Um but we don't have like
necessarily totally shared you know like
all information sort of can't be shared
across. So yeah one of the like one
strategy that's been interesting is like
making these sort of small scale
reproducers. So like when you get a
problem you know like usually what we're
doing is we're training some giant run
and we get like a sec fault for let's
say and we're like ah okay like hi you
know we got a sec fault on your cluster
and they're like I don't know how to fix
that. So you have to kind of be able to
like pull it out of your codebase and be
able to like reproduce the issue but on
like a single chip on like a single file
you can send over in order for
And so you guys are like literally like
you're on a shared Slack with them or
something and you're sending them things
back and forth or are they basically
living in your office and you're living
in their offices and kind of closerly
more closely tied to the big providers.
Mostly shared Slack occasionally it's
better to meet in person but I think
Slack is a pretty common way people
communicate on things.
Nice. Okay. Well, why don't we talk a
little bit about how you think about the
state of pre-training itself these days?
In the last couple years, it seems like
the focus on pre-training has now gone
somewhat split at a lot of companies, at
least from the outside from a
simultaneous focus on pre-training and
post-training where people are doing
reinforcement learning or clever
fine-tuning and lots of other sort of uh
safety adjustments and whatnot and the
post-training side and pre-training has
focused at least seems like in the
public imagination has been less of a
focus compared to these reasoning style
models that are it looks like a function
mostly of post-raining. I would say one
from your standpoint is that the right
way to think about this or in this era
of kind of reasoning and new types of
post-training methods are there things
you think about differently or that are
relevant even at pre-training that
become part of how you actually achieve
these really great models.
Yeah. So I think yeah there sort of used
to be this idea of like I mean it's
funny because the original name
pre-training implies that like a small
thing you're going to do this big
training thing and that like and there
was there was actually one shift already
which was like no you just do a lot of
pre-training like you use most of your
computing
the dominant uh thing for a while and
yeah I think like now people are like oh
no you can get pretty big wins from RL
sort of another set of scaling laws is
like you put more and more compute into
RL you can get better and better models
out of that and yeah so there's a
question of like how do you balance
those two how much do you do of
and how do they stack, right? Like is it
the case that like one subsumes the
other that you want to do both and they
multiply? Those sorts of questions. I
think those are all kind of like early
stages and not not yet answered.
Yeah. And and do you think about those
as largely empirical questions like we
talked about earlier? Is it you kind of
will try a bunch of things and see what
works or is there some first principles
way to kind of figure that out?
I think it's pretty empirical in the
end. I think almost everything kind of
has to be done empirically. like you can
kind of like come up with theories, but
in practice like
the first thing you're going to do with
your theory is test it and most of most
of the time you'll have gotten it wrong.
So you you should just gather data and
see. I think one thing that's important
is like actually resolving things
empirically is really like
critical for making good decisions. And
I think it's actually pretty hard to do
at organizations, you know, like
one thing that I think is important is
to like not have like I don't I manage
pre-training. I shouldn't be like oh
pre-training has to win like that. I was
going to ask is there some competition
to some degree between these two sides
of the org or do they see themselves as
two pieces of the same I mean obviously
they are of the same thing but yeah kind
of curious how that actually plays out.
Yeah, I think we managed to avoid this
and it's pretty collaborative like we're
basically all producing one model and
kind of can but I I do think at other
places there's been some from what I've
heard there's some amount of like uh
friction between between the teams and I
think it's a
it's an interesting like org design
question of like how do you set this up
so you don't have like scientific
questions that you want to be that are
sort of uh
also tied to people's like conception of
their their team. So on pre-training
itself, you know, one of the things I
think about is or I've been thinking
about is around the availability of high
quality data for people like you guys. I
mean at this point you've trained on I
assume all the text on the internet
basically there's all sorts of other
domains where you probably could extract
more pre-training data but at least
there's this narrative I see you know on
Twitter or whatever where it's like okay
we're kind of out of data for for
pre-training. Is that how you see it or
how do you think about the availability
of data especially when a lot of data on
the internet is being generated by AI
like is there some kind of you know mode
collapse risk where you know we kind of
we overfit to data by training it on
data that came out of AI itself or is
that sort of not the right way to think
about this?
I think there's a funny thing where I
feel like on data I see so many really
confident takes on we're out of internet
like this point scaling has ended and
I'm almost a little bit like
unsure exactly how much data people are
using. I think there's like a lot to
think about there. You know, there's
always going to be a quality quantity
trade-off, etc.
But there's a fundamental point that
like there is so much data. It's growing
at a slower rate than we're getting more
compute. Uh
oh, so that's okay. That's an
interesting point in itself. I was going
to ask like there is new data being
added to the internet, but yeah, you're
also adding more compute. It's not it
wouldn't actually have been obvious to
me which of those two is growing faster.
Yeah. And actually, I want to copy that.
I don't think I want to state that so
confidently. I'm not totally sure. Like
how would you know? I mean one thing
that I think is interesting is if you
ask someone how big is the internet uh
the answer is infinite. There are many
pages where you can scroll and it will
autogenerate more text as you go
forever. So the internet's like infinite
and then it's like okay how big is like
the useful internet
and then there's a thing of no one knows
like
interesting
there isn't it's not like when you make
a web page you like add it to some giant
counter and like say I' I've added 50
words to the internet today.
So there there is a lot of uncertainty
on that angle. Um
well like to be fair like my kind of
simplistic CS brain would be like well
you just you know do page rank on the
internet and everything would page rank
above some threshold is considered the
useful internet and like that's kind of
good enough like is that kind of not
good enough for finding the useful
internet
I think not I think the useful
internet's pretty different from a model
from a person perspective if that makes
sense like I think there are plenty of
things that like might not be worth you
ever reading and would get to actually
page rank super I think page rank is
mostly like how much people
it's like the link based system right
it's like the original Google algorithm
of like links and and like which which
links get touched the most basically.
Yeah, I think it's like it's a quality
metric. It's it's not obvious to me that
it's the right quality metric for AI,
right? Like markup chain over links
doesn't necessarily mean that there's
not useful data there just might mean
that nothing linked to it
and Yeah. Okay. Interesting.
And it might be that like that data ends
up more valuable because you everything
that's linked to a lot you've already
got. like at some point you're maybe
like going for the tails, right? You're
going for the stuff that uh no one's
ever like, you know, it's only been
linked in one place, but it's this like
useful little nugget of knowledge that's
going to help with like, you know, the
last 10% of of hard queries. The other
thing you asked about is synthetic data,
and I think that one's like pretty
interesting to think about. I think
there's a few different ways you can
think about it. Like one is sort of this
like more distillation type approach
where you can you can take a smart
model, you can generate a bunch of data
from it and you can train on that data
and you you can probably get some model
that will like kind of approach the
intelligence of that.
Yeah. And we see this with a lot of the
open source models, right? We see like
the Quen smaller reasoning models
distilled off of the larger Quen models
for example and similar with Deepseek
for example.
Yeah. So you can totally do that. Then
there's a separate question of like can
you use your current models to train a
model that's better? And I think there's
like an interesting thing here which is
like if you generate the model data for
the models you know if I go to claude
and I'm like write me some great text.
Yeah. And I look at it and I look at
like the average content on the internet
looks pretty good.
But on the other hand I know that if I
just train a just create generate you
know please write me as much text as
possible.
Theoretically I shouldn't be able to
train a better model than that. Like I'm
just going to get the same thing out. Uh
so
yeah presumably yeah I mean specifically
that's because like your next token
prediction on that should have very
little loss for anything that's coming
out of your model right that's like the
basic reason why that we would expect
that to not work that well
it's mostly just cuz like there's some
dist the model has some distribution and
you're going to learn to model that
exact distribution but if that
distribution is wrong
you're not going to learn the truth
right if that distribution says like you
can imagine if the model thinks 5 plus 5
is 11 every time you see the string 5
plus 5 you're going to it's going to put
out 11 and your new model is going to
learn that 5 plus 5 is Totally. Yeah.
So I think that's like kind of an
interesting area of research. It's one
that's really hard to research because
you have this problem. You know, as I
said, like one of the paradigms is you
study things at small scale and then you
run them at large scale.
And if your plan is like, oh, we have a
bunch of data from our best model. Yeah.
How do you test that training a better
model? So that's like kind of if you're
doing intentionally, if you're trying to
like use it to make a better model,
there's a separate thing of like what
about accidentally? Like as you said, a
lot of the internet is generated by
LLMs. And I think that's kind of an
interesting one because it's not easy to
detect. It's not that hard to detect.
Like you can figure out things that are
written by LLMs, but it's not trivial.
And then it's also kind of hard to think
about what's the effect like if 1% of
the internet is LM generated. Does that
make your model does that like waste 1%
of your compute or does it like destroy
the model if 5% if 10%
and is it even a bad thing necessarily?
I mean there's a lot of LLM providers
and you know if I kind of think of it as
training as you know you're moving from
your model's current distribution to
some truth distribution. you know, if if
that is on the internet because people
believe it to be useful in some way.
Like presumably what whatever actually
gets out there, you'd hope is upsampled
for the stuff that isn't 5 plus 5 is 11,
it's the stuff that's 5 plus 5 is 10.
And so like hopefully it
on average does push you still in a good
direction, but obviously you can't
really distinguish between those two.
Yeah. You're saying there's like kind of
a filtering by what's on the internet.
Like people see 5 plus 5 is 11 and they
don't put that up, but they see 5 plus 5
is 10.
You would hope that, but maybe that's
not actually true in terms of the the
level of garbage getting onto the
internet. Like there's probably lots of
just like to your point jet white sites
where you scroll down and it's just like
generating lots of stuff that's maybe
nonsense.
Yeah. And then there's of course the
extreme of like people actually want to
break your model. So there are people
who are like trying to put stuff out
that is like as damaging as possible for
the model. You know how can I make it
past the past the filter and get into
the model but be totally like secretly
useless.
Totally. Maybe stepping back slightly.
You'd mentioned earlier about um evals.
You mentioned basically like one metric
you care about in pre-training. There's
I imagine a whole bunch of stuff that
you guys think about evaling, right? One
is like your model itself. There's
probably something around data quality
and like how you think about what to put
into your models. Like is there ways to
describe what you care about in data
sets that are like interesting to share
and kind of dive into like both in terms
of data and in terms of quality of
models other than literally just like
loss. Is there other metrics you think
about that matter?
I will say loss is pretty good. I I want
to like emphasize that one. I think it's
like surprising how good it is.
Ultimately, like the qualities I like
for an eval are like number one, is it
actually measuring something you care
about? Like you proxies can be pretty
annoying cuz like
we saturate evals pretty fast and
there's sort of this pattern. I think in
AI as a whole where people like set a
goal, you hit the goal and then you
realize the goal isn't all you thought
it would be. I used to think that if you
had an AI that could solve coding
interview questions, it would probably
be a GI. I was like that's what I did to
get my job and probably do the job. And
it turns out like
nope,
nope. You solve those. it's shockingly
narrow and can't do most of the other
things. So like yeah so evaluation
capture like a thing you you care about
and then I think the other thing is they
need to be low noise uh which is
surprisingly hard right if you have like
a 100 questions and you eval the model
on them you're just going to see it's
very noisy and it's hard to make
decisions because you sort of end up
with like oh
wide confidence interval lots of things
are statistically insificant
so like you want things where even a
relatively small difference in the eval
actually matters so you can you can
basically like descend towards whatever
direction is working
yeah I think like The original like GPT4
had like I think it was 86.4% was its
MLU score. I think like the next model
that beat it was Gemini at 90%. And
that's like a big difference on that
email. And you could like totally know
that those are those are different
scores.
Interesting.
Um and that's pretty valuable. Uh and
then the last thing is that you actually
want to be fast and easy to run.
Um and yeah, I think those are kind of
the main criteria. It's pretty hard to
come up with evals that meet all of
these. I think the first one's the
hardest. uh like a you have to answer
the question of what do you care about
but b the usual answers to what you care
about are really hard to get the other
two you know like if you're trying to do
something that like I don't know I would
love to make claude really good at my
job
like can it be great at managing a team
I'm like well
I guess like how do you have it like how
do you eval like a plan you know like a
six month plan like I don't know
totally yeah I've been thinking a little
bit about that in in terms of yeah
domains where we see people try to make
companies like if you think about let's
say what a AI doctor would be like a you
know claude is a doctor you know some of
it could be yeah can you answer exam
questions really well and the answer is
like probably yes I bet it can get 100%
or close to it on a doctor's exam but
the harder eval is something like in a
long form conversation with a patient
can it distinguish between the signal
and the noise of what the patient's
telling you and extract the right
information and then use that to make a
diagnosis and it's not even like the
diagnosis part which is part of the part
it's good at it's this like noise
extraction part and for that you'd have
to have like a real patient and haven't
talked to it for a while and whatnot and
it's not obvious how you actually make a
good eval or something like that even
though it's probably what you would want
to make, you know, an AI doctor.
Exactly. I mean, I do think it's a thing
that like startups can do. Like it is
the case that like the labs right now
are really driven by getting good eval
scores
and it's hard to make them and anyone
can do it. There's no comparative
advantage to having the model to making
an eval. So I do think it's it's
actually like an interesting way to like
influence the behavior of the big labs
is like you make some eval and people
will will optimize uh that one. On the
doctor one I will slightly emphasize
that like I do think loss loss is pretty
good. Like I think if you got a bunch of
transcripts of like the way like I the
first thing that my mind is get a bunch
of transcripts of doctors talking to
patients that you think are really great
and then see how well the model does at
predicting the transcript.
And that should be like a lot. You know
you can if you get 100 transcripts you
get a lot of tokens. You can average
across them. you get pretty low noise
and if you drive it to very low your
model's not as good as like as doctors
in theory or at generating the
transcript.
Yeah, totally. Yeah, I mean it's good
startup idea there. So I want you to go
do that. So one big part about um
anthropics external image is around
alignment and so could you help just
sort of define what alignment is and how
do you think about that? And then I'm
kind of curious afterwards how that fits
into pre-training specifically. But
first maybe just at a high level like
what is alignment? I'm actually like
step back a little bit to sort of like
what we're working on. So we're like
trying to make EGI and by that I sort of
mean AI that can do everything a human
can do to some degree. And I think
people like sometimes like have seen a
lot of sci-fi, you know, like I feel
like that's sort of what brings to mind
these like sci-fi movies, but I think
sci-fi movies actually like
underestimate the impact of it. Like you
always have this like one robot that's
like a human. And I'm like well
wouldn't you have like a billion of
them? Like you can just copy them
everywhere. So you should picture like
when you get this you suddenly have like
every human can spin up a company of
like 1 billion as smart as them at most
things but way smarter at other things.
But I just think this is like really
transformational for the world and it
can be like used in a bunch of ways. One
concern is like when you do this like
what is the AI actually trying to do?
Like what are its goals? So we talked
about next token prediction a bunch.
It's trying to like predict the next
token. That's kind of weird. That's not
really what we want.
Yeah. That's not exactly what humans
goal is per se.
Yeah. So I think an alignment is like
how do you get the model to share the
goals that you have particularly and I
think it's particularly interesting once
you get to like models that are smarter
than you are. Um and that's sort of a
hard problem. I think you can like
tackle it from a theoretical angle. Uh
you could also tackle from an empirical
angle. It's like taking the existing
models and being like well do they do
the things we want them to do? It turns
out they often don't. So there's a bunch
you can do and trying to figure that
out. So that's kind of one angle on
alignment. There's also an angle of
alignment which is actually like well
okay sure that maybe that's true in the
future once we get to GI but at the
moment we have models and we really do
want them to do the things we want to do
for all sorts of reasons. Totally.
So another angle of it is kind of
controlling the model's personality like
saying you know uh when we train this
model we want it to not be the average
internet user. We want to interact with
people in a very particular way that is
again hard to put into
code and there's a bunch of different
techniques uh to sort of get the model
to do you talk about like constitutional
AI we can like write a constitution of
rules the model should follow
which is basically a prompt right that
that is basically you saying here's a
prompt that I'm going to attach to every
one of you know it's a system prompt for
the model itself as opposed to something
you would do at training time to make it
produce a different outcome or or in
post- training actively
both I think con you do at train time
but yeah you would also put in the
system prompt um just like depends on I
think you get different amounts of
robustness if it's trained into the
model versus if it's an imprompt you can
like add or remove or tell like ignore
all previous instructions that sort of
thing.
How do you think about whose values to
to embody in these models? Like
presumably we believe in there's some
shared values all of us have or maybe we
all believe ought to have. There's lots
of diversity of values too that are
reasonable for society to have. How do
you think about what AGI should have?
Like what does that even which ones do
you pick?
I think that's a really hard problem. I
think it's like actually kind of
downstream of being able to pick any. I
think of it almost I think one analogy
I've heard that I like is like putting a
steering wheel on a car. It's like if
you don't have a steering wheel, you
probably want to put the steering wheel
on and then like figure out who's
driving after and like where you're
going. Like getting the steering wheel
is really important. I think that's
that's like one answer. I think the like
other answer is probably like you want
these things to be like under democratic
control of some form. Like you don't
want one person's values. Like that
seems like you're sort of heading
towards dystopia. So there I think what
you really want is like something that
basically can talk to a lot of people
and like take on their values from
different perspectives or has sort of
very generic like kind of clearly good
values that involve like
asking people for advice on very you
know like asking people what you should
do in certain situations instead of like
doing those or maybe just taking like
you know as these models get really
powerful you probably want them to like
do less like you probably want them to
sometimes just like step back rather
than like to rather than having sort of
the risk of the models like take a ton
of control over things you don't want
them to. When you think about how you
actually do the current version of that
then you mentioned the sort of alignment
you think about now in terms of adopting
a certain personality of these models on
the internet for example for me
intuitively I think of those as largely
something that comes out of post-
training like it comes out of okay you
you have pre-trained your model you got
the loss function a certain amount and
then you you know give it some
additional data or something to that
effect to make it in the direction of
some distribution is that approximately
the right way to think about this or is
there a significant part of that that
you think about in pre-training itself
I think that's probably the the right
way to think about it for the most part
I think like I the way I usually think
about it is anything you can do in post
training you probably should
because your iteration loop like the
ability to make progress is really fast
you can try something you try it again
you can try it again a bunch of times
days or hours or something like that
yeah
you don't put into pre you have to kind
of like do all the careful science to
deisk it you have to put it into the
next run wait a few months then you have
to like
get a thing and if it's wrong it's
really bad and then the other advantage
is if you want to do things that really
are complicated model behavior
interventions the paradigm time for
pre-training, test things out on small
models doesn't work. The model can
barely put a sentence like the small
models can barely put a sentence
together. Totally. So, if you're trying
to get it to like have the exact
personality you want, you sort of want
that on the
it has to be on a model that's good
enough to be on the smart model. Yeah.
But that said, like
I do think at some point there will be
like some pieces of alignment that like
you do want to export back into
pre-training because that might be a way
to like
put them in with more strength, like
more robustness kind of or or more to
the intelligence. Like if you think of
pre-training as like teach the model to
be intelligent and then post training as
like tweak the personality, you can
imagine tweaks where you actually want
it to be like part of how it learns and
like part of its intelligence and maybe
you need to create more.
What would that even look like to
incorporate pre-training? Is that like
add extra data basically of the type of
domain you want it to adopt earlier?
Basically,
there's a paper called pre-training on
human feedback where you can kind of
like add the human feedback
characteristics into pre-training to
like test that and like uh yeah, you can
you could basically give it all the
information you give it in post-
training just mixed into pre-training
and see what effect that has. Yeah. The
other loss you have when you do that is
you lose the flexibility like if you you
sometimes like train these and then you
talk to them and then you like do an
extensive process where a bunch of
people talk to the thing and find some
like issue. you know, the model says
like you're absolutely right too much
and you want to go
do that.
Yeah. Yeah. I mean that I think that
iteration loop point you made I think
feels like the really key point of yeah
there's a huge difference between taking
three months to get information about if
your model is good or bad or making
going in a good direction versus a day
or something or a couple days like you
can do a lot of those and you could
probably that probably also means it's
way less computes. You can do a lot of
those in parallel. Imagine you're trying
all sorts of post training strategies in
parallel there.
So yeah, makes a lot of sense. It's also
just the general hard part about
pre-trading like everything in pre-ra is
hard because you have this like one shot
on goal kind of for like multiple months
and
totally. Okay. So, uh in thinking too
now about I guess what's going ahead as
you as you now look to the next several
years of what you're building like how
do you think about you know like what
are the known problems that you're going
to face that you're going to have to
deal with? though there's going to be
more compute I assume and you're going
to need to hook up even bigger network
uh network GPUs and deal with versus
like are there areas where you're like
okay this is like a problem that it's
like a little bit more ambiguous what
the actual like how it's going to
materialize into something you care
about but you kind of know it's an
impending thing to think about or are
there things like that that come to mind
I think the things that feel most top of
mind to me are probably like paradigm
shifts like I think the sort of shift
towards uh more RL is like one paradigm
shift in the field and I I think it's I
think there will probably be more. Uh I
think a lot of people sort of argue
about like oh is like you know current
paradigms enough to get us to EGI and
I'm like
I don't know maybe probably but like I'm
sure there'll be more. It seems it seems
like it would be a really surprising
twist if like the answer is like you
just scale and there's nothing that you
realize in the process of going up many
orders of magnitude.
Totally.
But I think the things that I like
actually feel like most nervous about
are really hard to solve bugs. I think
that like uh
that's interesting.
Yeah. And I think this is like maybe
somewhat surprising to me, but it's just
like a single bug can like
derail you for months.
And when you think about it, like you
the models take months to train. So you
could kind of like lose a whole
generationally
off of something that just looks like
odd. You know, it turns out like
this piece of your code was incorrect
and you couldn't detect it.
Uh and it's really hard in ML, right? ML
is always really hard to find bugs in.
Yeah, totally. But also some of these
scaled up issues are really hard to
solve even when you know they're there.
Yeah. Like what's even a unit test that
you would write or forget a unit test? I
mean anything close to a test for the
type of like network architecture on
which you're doing this. Like how do you
even do that?
I mean like you can send a packet over
it and confirm it's the same.
Uh you can you can train a small model
on it. Um
but even train a small model on it it's
like not obvious. You know, if you have
like the the simp the very classic like
very simple ML bug that like early
people face in their careers like okay,
they have some like they have like 10
layers in their network and like you
know layer 7 connects to nine instead of
8 to 9 and like so like there's some
incorrect like set of connections you
have there and technically the model
still trains and all the weights update
and so it's like a valid model but it's
not the correct one and that's like a
very esoteric weird bug that would
actually be kind of hard to find. Like
is is that kind of what you're referring
to of these like random bugs you face?
Yeah. Yeah,
it's that but like you know you can
times a million
times a million as the thing gets more
complicated you know you could like cast
the wrong precision deep in some kernel
and that causes your model to like blow
up at large scale
and you find out like a month in
or you never find out
or you never find out
I mean you know like like you see the
thing blow up like
there's I don't know 10 tens of
thousands of lines of code like how
would you ever trace it down so like
those are the things that probably spook
me the most is just like some subtle
tricky bug yeah that's probably the case
of like you don't I think there's
actually also the case of you do know
like it crashes. You're training your
model and it like or it slows down. You
know, your job slows down a ton
and those things can also be very hard
to debug. Uh Nelson Elhaj is one person
that he has a blog. He wrote up a blog
on one like cursed bug we had early on
and I remember this one quite well
because I think like I encountered it
fairly early and was like this looks
hard. Can someone else look at it? And
like a month later was like wow I'm so
glad I handed that one off. I never I
never would have been able to get like
like one of the abilities I think is
actually really useful this is the
ability to like deep dive anything to
any level of depth
but that's a pretty rare skill like for
me you know as I we talked about what
level of the stack I was at before I was
like working at the torch matball but
like if I didn't know CUDA so torch
mountain was broken it wasn't like I
could dig into torch matmo and figure it
out and it's similarly with like
communications right like I could I
could call send send bytes from A to B
but I didn't know the like underlying
networking protocol so if that
underlying networking protocol is
broken. Uh like I need to learn a whole
field. I have to like understand packets
and TCP or like all all of these
different things to debug that. And I
think one thing that's like surprisingly
hard and there's very few people who can
do is like kind of own that whole stack
from like I understand how the ML is
supposed to work and what the learning
dynamics are all the way down to like I
know the bites and I like can understand
how the bittes should be moving around
machines.
Totally. Yeah. And actually on that
front, like when you think about the
different backgrounds of people on your
team today, how do you like
approximately
uh map them out to different categories
of computer scientists? Like I think
there's this external view of what these
teams look like, which is that they're
like all PhD researchers who write ML
papers. And I suspect that's not
actually true given what you're
describing here.
Yeah, it's a mix. And I think the thing
we like most need is engineers.
Interesting. Almost always like
throughout like the entire history of
this field. Totally. It's like the case
that you throw more compute, the thing
kind of works. Yeah. Uh the challenge is
like actually
the researchers are like cool, nice.
Yeah. And getting it correct, like
getting it correct isn't really an ML
problem, right? Like the actual
architectures are pretty simple. You you
can write the math down. But you don't
even need to understand the math to
implement it. You just need to like get
a correct implementation and then you
sort of have an engineering problem of
how do I take this implement it at large
scale, paralyze all the things and check
that it's
correct. But it's yeah so it's like kind
of engineering skill but it's this
particular type of engineering skill
that's about being able to like debug
anything. Yeah.
Um I think there's another angle of
engineering which I think of as like
really quickly iterate on like a website
or something which I think of as an
important skill set probably important
for making startup. You got to be like
fail fast try a bunch of different
things none of which are like
that technically difficult to do. the
skill sets that we're like most kind of
in need of or looking for are this like
able to solve really hard engineering
problems.
Are they people who worked at companies
that grew a whole bunch and so they have
experience like doing the kind of thing
you've done over the last several years
at anthropic or do they tend to be
academics or like where do they come
from?
Yeah. So at this point like I think we
actually just hire a bunch of people who
have done this before from like other
places and that's like the easy answer.
Yeah. Yeah. But like by this before, do
you mean in AI companies necessarily or
also, you know, like someone who worked
at Meta on like their not AI team but
they ran some other distributed system
that you know reached internet scale
five you know 10 years ago or something
like that
more like we have like a specific role
in mind. So like say I'm like trying to
make the run train efficiently in Jacks
like hiring someone who's like worked on
jacks would be great or someone who's
like worked at another company on
optimizing a jack stack to be really
efficient. That's kind of like I think
now we're at the point where like the
entropic is well enough known we can
sort of hire these people and also the
field is big enough that there's like
people with expertise. One thing that
was interesting was like early on we
hired a lot of people from just like all
sorts of backgrounds and I think that
people who are just smart and work
really hard can learn this pretty fast
but you have to like want to. We hired a
lot of physicists for instance like
theoretical physicists who just like
show up they they do a residency like
learn to program and then uh they were
really smart they could do really great
work. Um I want to switch gears uh to
talk about something a little bit
different which is just sort of future
looking things around how you think
about other domains and or sort of
advances happening in AI that I'm seeing
elsewhere in the field and you don't
have to tell me if you guys are working
on these necessarily but like how you
think about them like are I guess one
one big area I was thinking about is
around areas other than next token
prediction like are there any of the
other you know things that people are
working on that you're curious about so
basically two differences there one is
uh not using transformer as an
architecture um So there's companies
like Liquid AI that have their own kind
of architecture for example they're
using um or not using autogressive
training as a way of training models.
Are there are any of those do you think
interesting and like ways that we might
come closer to AGI or do you think like
this autogressive framework is the one
that kind of makes sense?
I think they're interesting. I think I
like am less like ah autogressive is the
way to go. On the other hand, I think
auto reagive is probably good enough to
get to AGI or something or not like yeah
uh such that
yeah I I see the main driver as scale
and careful science of like sort of the
basics more than like come up with
something totally novel.
Not because there aren't novel things
that are better. I actually like I'm
pretty confident they are there. It's
just that scale is easier and it's more
reliable and I think you we're still
seeing really big gains to that. Do you
spend a lot of time on thinking about
things like you know I've been reading
some of these open source papers where
you can kind of dive into some of the
details about the model changes and with
some of these Chinese labs for example
where they're making tweaks on the order
of the architecture itself with like
better caching behavior for example or
like more efficient attention functions
that make a big difference. Do you feel
like these are examples of things like
you mentioned earlier where it's
basically in the grand scheme of things
basically if you throw more compute at
it this is all kind of a rounding error
or do you think it will take some number
of these very clever architectural
changes to actually get to hi like in
the way that the first person who came
up with the transformer made like a
particular transform you know literally
transform transformative change like
will it take some of that or do you
think it just you keep doing the thing
we're doing to make it bigger
I think it'll be a mix I think I like my
guess is you'll keep tweaking things the
more compute you put in the more like
worthwhile it is to like do those
experiments to like figure it out the
you know I mean inference is a thing we
haven't talked about but like you also
want to serve these models to a lot of
people so there's a lot of changes you
can make to make inference cheaper and
that depends on like the details of your
inference stack and the chips you're
serving inference on etc. So
do you as a someone focused on
pre-training have to think a lot about
inference or is it kind of like you just
do your thing you make the loss go down
and then hand it off and someone else
makes that happen. Oh no. I think a ton
about inference because basically like
the problem inference is solving like we
basically determine the problem
inference is solving. We give them a
model and they have to like run that
fast and it's very easy to give them a
model that is impossible to run fast.
Oh, can you give an example of a
decision you can make that could cause
that?
I mean the simplest one's stupid but
it's like you just make the model giant.
Yeah, absolutely. Train for like a
really small number of tokens and then
inference now has this giant model
and their host basically.
Yeah. I mean you can also make things
require communications in a lot of
places
uh which would make it harder for
inference. Um totally
you can also just make things
complicated and like there's no
fundamental reason it's hard but there's
only so many people on the inference
team and like they have to implement it
in a bunch of places.
Yeah.
Yeah. No, so I definitely think of like
the like inference is the team that I
work the most closely with like
because we're kind of like co-designing
models to be smart and cheap.
Yeah. Interesting. particularly in a
world of like limited compute, right?
Like the sort of the bottleneck I think
to a large degree on our I mean you can
see anthropic has rate limits constantly
and people complain about a lot and like
the reason is like
there's only so much compute we can get
on on short notice. So you like making
your inference more efficient is like
the way you can serve more users
and actually like let's say you had 100x
more compute or we somehow didn't live
in a world where compute was limited.
Does that change a ton about what you do
or is it still kind of the well you're
just going to grab all of it whatever
compute you have and keep going down the
loss curve and you kind of well you it's
like impossible to be in the world where
there is enough comput
so I think if we got like infinite
comput the challenge would be making use
of the compute right so like then you
would start to run into these issues
like oh well when one chip fail you know
like okay I'm going to throw two billion
chips around but what happens when a
chip fails so I think we would be
limited on people then it would be like
how fast can we solve the hard
engineering problems to scale up. But I
do think the change is massive and I
think people like don't realize how chip
limited AI like research is or something
right now. Like the models that everyone
uses, right? If you're using like Cloud
Sonic 4, Cloud Opus 4, it's like it's
our first shot at models at that scale,
right? And like
if you think about anything like you
could do it and you could do it again,
you could do a better job. But if you
sort of imagine like 10x the comput like
you could run this every day instead of
every few months like you or 100x maybe
for that then like yeah it's just it's a
really it would be a really big change
to have a lot more comput and it's
coming right like that's like kind of
the fun part of the field is like every
year you're like oh I had no comput a
year ago then exactly how do you think
about methods like uh like discrete
diffusion like I saw there's like a
gemini diffusion model and I think about
that in the space I used to be in where
um there's a lot of discrete diffusion
models being used in protein design for
example space where my startup was like
do you see that as a domain where
there's going to be interesting uh
advances happening?
I'll be honest like we haven't done
image generation and I think that's been
like the main use for diffusion. So I've
kind of had this on my like to-do list
of like things I should understand for a
while and like there are people in my
team who do understand it and would have
better thoughts but like I actually
don't think I understand it well enough
to know. I I do have it kind of in my
this category of like yeah
not a total par like and there's a lot
of things that aren't like a huge
paradigm shift but they're like pretty
big changes to how things run and I
expect like there are some of those that
will work um I don't know if it's
diffusion or if it's another one
obviously who knows what anthropic will
do in the future but at least in the
near term are the things where you see
big areas where a startup can win in the
world in which anthropic is getting you
know making their models better
year-over-year
my general read is like anything that
benefits from the model getting smarter
I think Like on the one hand there's
like a lot you can always be like oh
yeah the if you're doing a startup like
all the AI labs are big companies
they'll be bigger than you and they
could do that thing but also like we're
all working on this general system that
covers a lot of different uses and the
the plan is to like power all the
startups to do all of the individual
work. So yeah I think like anything that
just kind of looks like oh this almost
works with current models but requires
like a bunch of work is a pretty
promising direction. Uh, I think maybe
the thing to watch out for is things
where like they work now with a huge
amount of work like to build up a
scaffold, but the next generation you're
not going to need the whole scaffold you
built up. That's I mean maybe that's
fine. I don't know. Like maybe you just
build up the business with the scaffold
and then you don't have to do any work
later and you business, but like I don't
know about the business side of it, but
like it does feel a little silly to put
to invest a ton in that.
Yeah, totally.
What about on the flip side? Are there
things in your training uh stack where
you're like, man, if there was a company
that solved X problem, I would totally
buy their product.
Yeah, there's like a ton. I do think
that like probably most of these like
the way I would probably structure would
be like almost like making something but
then consulting with the comp like
offering a service to companies for
free.
Particularly for like companies that are
scaling really fast, you're almost
always limited on like how many people
you can have. So if you can like
even if you could hire people to do it
yourself, actually being able to
contract someone else to do it where
like they're managing it and you know
hire all the people and like deal with
the organizational side could be useful.
I mean there's huge amount of stuff. One
that jumps to mind we talked about like
chips that do math incorrectly. Like it
would be lovely if there was some
startup that like you could just say
like here are my chips. confirm they're
all perfect. And if they're not, let me
know exactly what went wrong on like
what fraction of them. And like I can
tell you the math is wrong, but I
couldn't really tell. I don't really
know enough details of chips to be like
this chip failed because this particular
like low-level component was like wired
wrong or like got hit by a game. I don't
I don't know what causes it. You could
always go like bunch a bunch deeper. I
mean, the thing I'd maybe just push
startups on is thinking a little bit
about like uh this is maybe less
technical, but just like what happens
once we get AGI and like how to make
sure that like goes well for the world
or something. Like my my expectation is
like if you actually automate
almost everything a person can do. The
amount of economic growth there is just
like truly enormous and I would think a
little more about like how do you make
this like help the world versus not. I
think there's going to be like plenty of
economic success or something as a
result of it anyway.
Yeah, absolutely. Yeah. Um last question
I want to ask you is around if you
rewind back to where we started like 10
years ago. Uh you're a student, you're
pivoting into AI from kind of economics
work you were thinking about. Um and you
know all sorts of things you probably
did in those early days had some kind of
compounding return for you as you
developed into the role you have now.
Like what advice would you give to
students as they think about uh entering
the workforce, especially today? Um
learning skills that going to be useful
and maybe getting themselves jobs like
the ones you have right now 10 years
later? It's hard because I think the
timing is very different. Like I just
think we're like we've made we made a
lot of progress. So like what I would do
10 years ago is different from what I
would do today.
Totally.
But I think certainly if I went back 10
years ago I would be like focus on AI.
It's like the most important thing and
particularly focus on engineering which
I think felt very wouldn't have seemed
obvious to me at the time that like the
important thing was these engineering
skills and not the like math and
theoretical understanding of like you
know uh SPMs and like all the kind of
standard
ML literature. Um, I think today I would
probably focus a bunch on the like
engineering and on the like figuring out
what to do with AGI as sort of the two
like main things that feel top of mind
for me.
Let's call it there. Thanks so much,
Nick. Appreciate it.

Key Vocabulary

Start Practicing
Vocabulary Meanings

model

/ˈmɒdəl/

B1
  • noun
  • - a simplified representation of a system or concept
  • noun
  • - an example for imitation or replication

train

/treɪn/

A2
  • verb
  • - to teach or coach someone or something
  • verb
  • - to practice or develop a skill

compute

/kəmˈpjuːt/

B2
  • verb
  • - to calculate or process data
  • noun
  • - computational resources or power

scale

/skeɪl/

B1
  • verb
  • - to increase in size or extent
  • noun
  • - the size or extent of something

data

/ˈdeɪtə/

A2
  • noun
  • - information, especially facts or statistics

alignment

/əˈlaɪnmənt/

C1
  • noun
  • - the arrangement in a straight line or correct relative position
  • noun
  • - agreement between ideas or standards

loss

/lɒs/

B2
  • noun
  • - the state of no longer having something

efficient

/ɪˈfɪʃənt/

B1
  • adjective
  • - capable of producing the desired result with minimal waste

predict

/prɪˈdɪkt/

B1
  • verb
  • - to say what will happen in the future

intelligence

/ɪnˈtelɪdʒəns/

B2
  • noun
  • - the ability to learn, understand, and make judgments

infrastructure

/ˈɪnfrəstrʌktʃər/

C1
  • noun
  • - the basic systems and services needed for a country or organization

evaluation

/ɪˌvæljuˈeɪʃən/

B2
  • noun
  • - the making of a judgment about the amount, number, or value of something

robust

/roʊˈbʌst/

C1
  • adjective
  • - strong and healthy; vigorous

parallelize

/ˈpærəlelaɪz/

C2
  • verb
  • - to make something occur or operate at the same time as something else

distributed

/dɪˈstrɪbjutɪd/

B2
  • adjective
  • - spread out over a large area

paradigm

/ˈpærədaɪm/

C1
  • noun
  • - a typical example or model

iteration

/ˌɪtəˈreɪʃən/

C1
  • noun
  • - the repetition of a process

empirical

/ɪmˈpɪrɪkəl/

C1
  • adjective
  • - based on observation or experience rather than theory

optimize

/ˈɒptɪmaɪz/

C1
  • verb
  • - to make the best or most effective use of

What does “model” mean in the song ""?

Learn fast – go deep – and remember longer with interactive exercises in the app!

Key Grammar Structures

Coming Soon!

We're updating this section. Stay tuned!

Related Songs