Learn Language through the song - | English exercises

Translate:

Hi everyone. Um, I'm really excited to

talk about developing general purpose

robots and how we might uh actually like

truly develop and bring intelligence

into the physical world. So, um, to

start off, I'd like to talk about this

problem, which is that if you want to

truly solve a robotics application, you

essentially need to build an entire

company around that application. uh you

need to build a different company for

logistics, for wet lab automation, for

robots and kitchens, for surgical robots

and so on. And this is really really

hard to do because that company needs to

make new hardware, develop custom

software, design unique movement

primitives for that application, handle

edge cases and so on. And you have to do

all of that from scratch uh if you want

to solve a robotics problem. And as a

result, uh, a lot of robotics companies

haven't been very successful in actually

bringing robots into the physical world

successfully, uh, in our daily lives. I

co-founded a company called physical

intelligence, uh, that's trying to solve

this problem. And in particular, we're

trying to develop a general purpose

model that can enable any robot to do

any task in any environment. uh and we

think that this sort of generalist model

may work better and be easier to use

than purpose-built models just like

we've seen in the development of found

foundation models for language and other

applications. Uh for example, if you

want to build uh a coding assistant, you

don't nowadays develop something

specifically for coding, but you develop

and you build on models that were

trained on large amounts of data, not

just on code. And essentially this is

the problem of trying to develop these

sorts of foundation models and bring

this sort of intelligence into the

physical world rather than the digital

world where they largely are today. So

how do we do this? Uh in this talk I'd

like to talk about how we go about doing

this. And if we were to take a lesson

from language models we know that

language models have taught us the

importance of scale. And so one possible

conclusion would be that perhaps scale

is the most important ingredient for

developing these models. And if you were

to say this conclusion is true, then

you might look to certain data sources

for largecale uh data. So for example,

we might look at data from industrial

automation and you get tons and tons of

data of robots uh doing tasks over and

over again like this. But this sort of

data isn't going to allow robots to go

into disaster zones or to make a

sandwich uh or to bag groceries. And so

this massive scale doesn't have the

diversity of behaviors that we need in

order to solve this general problem.

Alternatively, maybe we look at data

from YouTube which has a also a massive

data source and many videos of humans

doing tasks uh that could be useful for

training robots. uh but at the same time

we don't learn how to write by watching

other people write and we don't become

expert tennis players by watching

Wimbledon. Uh and so even though there's

a massive scale of data here, it's very

challenging to use and there's also a a

gap between the embodiment of robots and

humans. Um and lastly, we might look at

data from simulation and you can also

get a massive scale of data here, but uh

this data lacks realism and also has a

gap from reality. And so I think the

lesson here is that scale is necessary

for developing these models that can

generalize in open world conditions, but

they're subordinate to actually solving

the problem. So you need scale, but it's

not sufficient uh for the entire

problem. And so at physical

intelligence, we've been um this is an

example of a data episode uh that we've

collected. Uh this is uh in honor of our

first anniversary, which was a few

months ago. uh where we um here you can

see a teley operator uh in person who's

operating um some leader arms to control

the robot uh to light a match and light

a candle with the match and with this

sort of data we can train robots to do a

variety of different tasks and so um

what I'd like to talk about is some of

our recent results at trying to develop

sort of physical intelligence with

largecale real robot data I should

mention this is large scale by today's

robot standards and arguably a minuscule

amount of data compared to the sorts of

robot data that we should have in the

years to come. And so in particular,

we'll be looking at whether robots can

do a variety of dextrous long horizon

tasks, whether robots can succeed in

places they've never been, whether

robots can respond to open-ended prompts

and interjections. Uh, and even if

you're not excited about robotics, I

think that the lessons uh that we've

learned from trying to address these

problems are applicable outside of the

physical world. So um can we develop

robots that can have uh complete

dextrous long horizon tasks? And in

particular uh in this first part I'd

like to talk about how we trained uh a

pi zero foundation model to do this task

which is to unload a dryer and fold

laundry. Uh and to date I think this is

the most impressive thing that I've seen

uh a robot do in the physical world.

It's really hard.

[Applause]

This is an incredibly difficult problem.

You can see that it's not perfect. Uh

here is making some miscrops, making

some mistakes, but it's really really

hard because you have to deal with the

variability in the clothes and the way

in which they might be positioned and

crumpled uh and be able to handle all

those sorts of things. And as you're

doing this task, which takes about 10

minutes for the robot, there's many

opportunities to fail uh to fail

catastrophically. For example, dropping

um things on the ground, which is hard

to recover from. uh and you have to be

able to recover from even small

mistakes. I was personally actually

working quite a bit um on this laundry

folding robot along with Michael and

Siraj uh and of course supported uh and

with contributions from the whole

physical intelligence team. Uh so how do

you even approach this sort of problem?

It's this is a really really hard thing

for a robot to do and what we did is we

started simple. Uh we started with can a

robot fold a single size single brand

shirt uh and can a robot dynamically

flatten one shirt again single brand

single sized and if you start simple

this makes the problem quite a bit

easier uh we collected some data with

teley operation and trained a policy

with imitation learning and our model

had around 100 million parameters

mapping from images from the robot's

cameras to joint target joint positions

on the robot arms and we do this source

of control at 50 hertz on the robot

Uh, and uh, we founded the company in

kind of mid-March of of 2024. Uh, and a

couple months later after we had set

everything up, we were able to get a

policy that could fairly reliably fold a

single size single brand shirt. Uh, you

can see that I'm testing the policy

right here. Uh, and we also wanted to

test some dynamic motions because you

need to be able to match the control

frequency accurately in order to do

these sorts of dynamic motions. Um and

so these were some of our very initial

tests at uh addressing this sort of

laundry folding problem. Then from there

we wanted to make the problem

incrementally harder. Uh and so we

instead of starting from the shirt flat

on the table, we started in a crumpled

position like these. And it turns out

that this actually makes it a lot

harder. Uh and so here are some videos

of some of our initial attempts at

trying to train the robot to fold these

shirts. And the robot struggles. uh the

the robot does some things that kind of

look somewhat sensible but generally

isn't able to make progress on the task.

Uh with many tests we frequently were

getting 0% success rate in our tests of

this uh system and really struggling to

make progress. So really here is the it

introduces this challenge of handling

the sorts of variability in the ways in

which shirts might be crumpled on the

table. We had some initial signs of life

in late June uh of of last year. Uh and

so in this case, the robot was able to

kind of make progress on flattening the

shirt. Uh it's also then able to fold

the shirt uh decently well uh from that

initial state. Still not perfect. Uh and

as you can see, it takes quite a while

to do this. So this is a video that was

sped up AEX. Uh so not something that

you might have the patience uh for a

robot to do.

Um, so with some initial signs of life,

the also very low success rate, we

started to transition to a slightly

harder version of the task where the

laundry starts in a laundry basket. We

also introduced variable size shirts and

shorts into the mix. Uh, and again, the

robot really struggled. So in many of

our tests, we were getting 0% success

rate across the board, and we're really

struggling to actually get the robots to

learn how to do these tasks. At this

point, we were trying to consider a lot

of different things. uh we thought that

maybe the robot needs memory, needs

history in some way. Uh maybe we need to

just train our models for longer. Maybe

we should be doing control and endeector

space rather than in joint space of the

robot. Uh maybe our encoders, we knew

that there were calibration issues and

maybe we need that calibration to be

more consistent. Uh maybe we need to

condition the model on more information

about the data. Uh maybe we need

hierarchy because this is a pretty long

horizon task and it needs to break it

down into different subtasks. Maybe we

need higher resolution images. uh maybe

we need to introduce kind of

interventions in data collection. A lot

of these things we also tried. We had

around two to three months of failure

where nothing was really working at

addressing this task. But then at some

point we actually had a bit of a

breakthrough uh which was that um we

found one thing that really seemed to

make a difference in the robot's ability

to do the task. And this was actually to

take some inspiration from the world of

language modeling to actually instead of

just training a policy on all of our

data, we pre-train on all the data and

then fine-tune on a highly on a curated

consistent highquality set of

demonstration data. When we did this,

uh, we found that the robot was actually

able to make progress and a lot more

reliably fold articles of clothing. Uh,

and so I think that this video was the

first video where the robot was able to

fold five items in a row and stack them.

Uh, I went home very excited this day.

Uh, this was in September of 2024, so

multiple months after our initial tests.

Uh, now this is far from perfect. Uh, it

takes 20 minutes to fold five items of

clothes. Uh and um at the same time

though it kind of suggested that this

sort of recipe was able to unlock uh the

capability in the robot to actually fold

these articles of clothing. So you can

see these sorts of failures here. In

this case, it attempted to fold the the

blue shirt around seven times uh before

eventually actually figuring out how to

do that. Um there's also other failure

modes as well. So, here's an example

where the robot pushes the stack to the

corner of the table uh and decides to

kind of fiddle with it a bit uh and then

eventually uh slides it off the table

and then it proceeds as if nothing had

happened and it's going to continue to

fold. We continue to iterate on this

recipe. We uh selected and worked on our

curation strategy for curating a higher

quality set of demonstration data. Uh we

got it from 20 minutes down to 12

minutes uh for these five items. This is

kind of how we were evaluating uh how

good our robot system was. uh it still

makes mistakes. It's still the full

quality still varies, but um it's still

significantly better than our previous

curation recipe. Now, at this point, we

were still training models largely um

kind of we were pre-training uh and

fine-tuning only on laundry data, and we

weren't leveraging uh kind of

pre-trained models in the community. And

there were some folks working at

physical intelligence that were working

on developing a pre-trained model

trained on all of the robot data. And um

we then started to try to introduce

these models into our um into our

recipe. And so we took an open- source

vision language model, a three billion

parameter model uh called Polygeemma.

Previously we're using the previous

videos were all with like a 100 to 300

million parameters that we're iterating

on. Um this model takes as input images

from the robot also a language command

uh and then has a head a diffusion head

that's going to attend to all the

internal values of the vision language

model. and uh with the joint angles uh

predict a chunk of 50 actions into the

future. So about 1 second uh of of

action steps and we're using a flow

matching a variant of diffusion uh to

actually output these actions and output

continuous actions. Um so we took this

pre-trained uh this model and instead of

pre-training only on laundry, we

pre-trained on all of the robot data

that we had collected. Uh and then we

just fine-tuned it with the same exact

post- training recipe that we had

developed uh without using the vision

language models. Uh and when we did

this, we actually saw the robot uh

continue to actually get better when we

just plugged in that new pre-trained

model. Uh and so in the left video, it's

able to do five items in 9 minutes,

which was faster than the 12 minutes we

had before. In the right videos, we were

testing with um some novel clothing

items and found that it was also quite

efficient at folding multiple items in a

row. Uh, and we also saw as a result

there was also more consistent bold

quality by using this model that was

about 10 times larger um, and had seen

more robot data as input. To look at a

few highlights of this, here's a pair of

shorts that the robot hasn't seen

before. And this is kind of a tricky

scenario where to flatten it, it

actually kind of needs to reach under

the kind of the bottom of the shorts.

And it's able to do that. is able to

kind of figure out that it should reach

under um the the left part of the shorts

in order to uh eventually flatten it. Uh

and then um once it actually

successfully flattens it, uh it's able

to fold it successfully. It also has to

do something similar at times to fold

shirts. So in this case, it needs to

actually kind of fold the shirt over on

itself with actually puts it in a more

crumpled state arguably, but allows it

to find the corners of the shirt and

then uh go ahead and fold it.

Uh, and then like I mentioned, it also

is able to handle unseen clothing items.

So, uh, here's an example of a shirt

with a V-neck, uh, that is able to fold

even though, um, like the the post

training data set didn't have, well,

didn't this shirt was completely held

out and the post training data set

didn't have any V-necks uh, as input in

the data set, it's also able to fold

shirts with buttons. So, it has some

degree of generalization to different

clothing items.

Um, and then lastly, because this policy

is a neural network and it's kind of uh

taking his input, the current image,

it's able to handle interruptions. So

here, Michael is uh continuing to mess

with the robot and the robot uh figures

out that it should put the the shirt

away uh while it's trying to fold the

other shirt. In this case, Michael's

going to continue messing with the

robot. So, Michael unfolds one side

and the robot reacts.

Michael goes in again

and the robot makes some mistakes here

but able to recover. Michael messes it

up again. So those are some results of

of what the robot's able to do. Now I

talked about this pre-training and

post-training recipe being really

important. We can actually

quantitatively measure that and actually

make sure that this is actually what's

leading to improvement. So, we compared

this pre-training and post-training

recipe to not using any pre-training and

only training on the curated data set

versus no post-training where you're

training on all of the data rather than

fine-tuning on the curated data set. Uh,

and we evaluated these models in terms

of their progress on the task where you

u make partial progress for getting it

out of the bin, which is the easiest

part, and then further progress for

flattening, folding, and stacking the

items. And we see that the pre-training

and post-training recipe is able to get

far higher performance than omitting

pre-training and omitting post-training.

Uh and notably omitting pre-training and

post- training is basically able to get

it out of the bin and make very little

progress after that. Whereas when we

combine pre-training and curated

post-raining, we get far higher

performance whereas able to reliably uh

flatten and fold objects. Um and then

the last thing that I'll mention on this

note is that uh nothing in this recipe

is specific to laundry. And so we took

the same recipe um and fine-tuned on

other tasks. So here uh the task is to

um kind of clean up a table. And the

robot's also able to successfully uh do

this task uh despite the fact that we

primarily were iterating a lot on

laundry, but it's able to also apply

this recipe to this task. It also um is

able to scoop uh coffee beans into a

coffee grinder. Uh this task is pretty

hard. it has to construct the bottom

part of a cardboard box uh which

requires uh quite a bit of dexterity and

then um lastly autonomously lighting a

candle with a match again with this kind

of same pre-training and post-training

recipe.

And so this is pointing at this kind of

the benefit of foundation models that I

alluded to before which is that to do

these different tasks you don't have to

start completely from scratch. you can

actually leverage pre-training across

multiple robots and across multiple

tasks. And then we're also able to apply

that same recipe to robots at other

companies. Uh this is a robot that I've

actually never seen in person before. Uh

they collected data. They sent the data

to us. We fine-tuned our model on their

data. We actually didn't even know

exactly how the model is being

controlled. Uh exactly the

representation of their actions. uh but

by fine-tuning the model on this new

robot, the model is able to control the

robot in order to uh make a cup of

coffee in this case. So um some

takeaways for this part uh we were able

to independently develop post- training

and pre-training and decouple the

problem um and then eventually get the

best of both. Uh we found that training

on all the data doesn't work for complex

tasks and this sort of pre post post

pre-training and post-training on

curated data leads to far better

performance. And then we broke up this

really hard problem of folding laundry

by gradually starting with folding

single shirts and going to more and more

complex versions of the task. Now

there's a number of limitations here and

one limitation I'd like to point out is

that these robots inevitably um in this

case were trained in the environments

that they were tested. Uh and so this

means that in principle you could use

these methods to collect a lot of data

in one environment and then deploy them

in one environment. But ultimately,

there's going to be things that change

about an environment and scenarios where

we would want to actually apply these

robots to environments that they've

never seen in before. And so, how can

robots actually succeed in places that

they've never been? The lesson we've

learned from machine learning in other

places is that we should collect diverse

data. Uh, and so we started by

collecting data of tidying bedrooms and

kitchens in many different environments.

Uh, and here's an example, kind of a

sample of that data. uh and we collected

robot data in homes across San Francisco

here uh and also collected data in

diverse mock kitchens and mock bedrooms

and in total we had more than 100 unique

rooms represented in the data set that

ended up being uh a part of a bigger

pre-training mixture. So we trained on

this diverse mobile manipulation data uh

including the low-level action

prediction as well as predicting highle

subtask commands for how to complete the

task. But we also trained on previously

collected static manipulation data that

was also fairly diverse. Um static

manipulation data that we had collected

in our office and in labs as well as web

data um and highle instructional data.

And I should point out here that the

mobile manipulation data of tidying

bedrooms and kitchens only accounted for

2.4% of the overall pre-training mix.

And so the lesson here is that you were

basically able to spin up a new task and

actually an entirely new robot. the rest

of the mixture didn't have any mobile

manipulation data with this particular

mobile manipulator in it um without

redoing all of the data collection.

We're able to build upon everything that

had been done before. And it's kind of

this kind of same story of foundation

models being able to make it easier to

spin up um a new problem, a new

application without starting from

scratch. Um now this wasn't completely

easy. Um we had a couple challenges. One

of the challenges that we ran into is

that naively uh this model can ignore

language instructions. So we had

actually in this case asked it to pick

up the cutting board and it chose to

pick up the plate instead. Now we're

again asking it to pick up the cutting

board. Uh and instead the robot had a

mind of its own decided to pick up the

plate. Uh and then we tell it to put the

plate in the sink. And eventually it

decides that well after kind of moving

away from the cutting board it

eventually decided that it would

actually pick up the cutting board. And

so in the early development of our

model, we found that it often ignored

language.

And to solve this, we thought about how

vision language models actually follow

language well. And so maybe there's a

way to preserve the inherent abilities

of the pre-trained models when

addressing this task. Uh and so what we

did is with this PI zero architecture,

this action head that's using diffusion

is randomly initialized. And this ends

up actually deteriorating the

pre-trained knowledge that's present in

the vision language model. Uh and we

found that if we can prevent this

deterioration, we might be able to get

better language following. Uh and so the

recipe that we came up with was actually

in some ways fairly similar, but instead

we're going to be predicting tokenized

actions. And then when we have the

diffusion head, we'll be stopping the

gradient from the randomly initialized

diffusion head to prevent it from

deteriorating the language following

abilities of the VLM backbone. Uh and we

found that this first led to faster

training because the tokenized actions

are a more direct supervision signal.

And second, it also followed language

far better. Uh an 80% follow rate rather

than a 20% follower rate. Uh which

suggests that we're able to preserve the

the kind of pre-training in the vision

language model backbone. So, we put

those pieces together. We took that

recipe and trained it um pre-trained it

on all of our data, including the mobile

manipulation data. We fine-tuned it on

mobile manipulation data in a variety of

environments. And then we tested the

model in places it's never been in

before. So, we rented uh three Airbnbs

that uh we had never been to before. Uh

we put the robot in those homes, in this

case, in the kitchen, and I asked it to

close the cabinet. I asked it to put

away the dishes. has also never seen

these dishes um or the these forks,

these objects. And the robot's able to

succeed even though it's never been the

here before. There's different uh

countertops, different furniture,

different objects, and so forth. Uh

lastly, I asked it to clean up the

spill, and the robot is able to oblige

and wipe down the spill and eventually

put the sponge into the sink.

Uh it's also able to do this for

bedrooms. So Laura asked it in this case

just clean the bedroom and it puts uh

articles of clothing in. Uh it throws

away the trash and uh then is able to

tidy the bed by putting the uh putting

the pillow at the top of the bed and uh

tidying the the blanket or the comforter

of the bed.

YC's next batch is now taking

applications. Got a startup in you?

Apply at y combinator.com/apply.

It's never too early and filling out the

app will level up your idea. Okay, back

to the video. So, quantitatively, I

talked about how the kind of there's

only 2.7% or something of the the

mixture and so how much does that other

data actually help? Uh could we actually

just train on that kind of 2.7%.

And we find that these kind of bars on

the right which are excluding data from

static robots in labs and environments

and so forth um reduces performance

significantly. So the performance goes

down to less than 60% when you exclude

that data when evaluated in novel homes

compared to if you use the full

pre-training mixture it has uh more than

20% higher performance. Lastly we also

looked at is the diversity of data

helpful? Is it important? And so we

increase the amount of data from these

environments to test this. It's always

good to like you can kind of do vibe

eval but it's really helpful to actually

measure how well uh these things work

and so this is what this is measuring

and we find that if we actually increase

the amount of homes the amount of uh

locations that are represented in the

data the performance increases which is

great uh and it actually gets to the

same level of performance as if we train

on data from that target environment and

so it means we're actually mostly

closing the generalization gap and

suggest that the bottlenecks at this

point for this sort of task lie not in

collecting more diverse data but in

actually getting higher reliability and

higher performance. Um now I should also

mention that there's failure modes like

this the success rate was around 80%.

There's lots of room for improvement. Uh

here are a couple examples of those

failure modes. So um here it's told to

put the items in the drawer. Uh it is

able to put it in the drawer but the

item isn't fully in the drawer at the

end and it decides that it's done and

kind of moves on to the next thing. Uh

here the robot uh needs to put the

clothes in the laundry basket. It drives

over the shirt um and then it gets stuck

and it's not able to lift it up. Uh here

we asked it to put the dishes in the

sink and it successfully is able to put

a number of the dishes in the sink but

it struggles to pick up the cutting

board uh in this particular case because

it's a very thin and it's flush against

the surface of uh the countertop. Uh and

in the last case, my probably my

favorite case, um it's told to put the

spatula into a drawer and it decides

that the oven looks a lot like a drawer

and so it opens the oven um and uh yeah,

tries to to put it in there. Um and

beyond this, there's also challenges

with regard to speed, partial

observability, uh long-term planning um

and so uh yeah, lots of work to do

still. So the takeaway here is that with

diverse data, uh, robots can follow a

variety of instructions in environments

that the robot has never been in before.

Uh, which is a big step up from a lot of

robotic scenarios where they're trained

in the scenarios that they are being

tested. Now the last kind of bit I'd

like to talk about is this model has a

fairly limited instruction set. It can

only follow kind of a certain set of

commands. And if we think about how

other forms of AI technology have been

deployed, people really like to

customize and actually tell the robot

what they want or tell the system what

they want from these kinds of models.

And so just like we prompt language

models, can we allow robots to respond

to open-ended prompts and open-ended

interjections?

Uh so to do this and actually to do the

past work, we're actually leveraging

hierarchical uh vision language action

models. So we're going to have a high

level policy break down uh the prompt

into uh intermediate uh verbal responses

and intermediate atomic language

commands. So the highle prompt might be

kind of can you make me a sandwich uh

and this highle policy will break it

down into the subtask of pick up one

slice of bread. This will be passed to a

low-level model that actually executes

and predicts target joint angles um to

fulfill the low-level command of picking

up one slice of bread. Now, on its own,

this isn't going to be able to follow

all sorts of prompts, and it's actually

fairly tricky to handle open-ended

language because it's going to be

challenging to collect a large number of

human robot interactions with the real

robot in the loop. And this is also

going to be fairly hard to scale. Uh and

so what we did is we kind of took all of

our existing robot data and we can

actually generate synthetic data for the

existing robot data. In particular, we

can use language models to reabel and

generate hypothetical human prompts for

the scenarios that the robots are in.

And so what this looks like is we'll

take data that says um here's a kind of

a video and then the next skill is to

pick up a Kit Kat because that's what

the robot does next in terms of just

like basic low-level annotation. And

then for this scenario where the robot

is about to pick up the KitKat, we can

ask a vision language model, what is a

hypothetical prompt that a human might

have asked that led to this um this

particular scenario and the robot to

actually choose to pick up a Kit Kat.

And then we can train our high level

policy on these synthetic prompts to

basically augment the robot data with

various human interactions that might

have led to those different situations.

And as a result of this, we're able to

actually allow robots to follow a

variety of different prompts. So on the

left, we ask, "Hi, robot. Can you make

me a ham and cheese sandwich?"

The robot says, "Sure, I'll start with

the bread and add ham and cheese next."

And it's able to break down this task

into the various subtasks of picking up

a slice of bread, putting on the cutting

board, picking up a slice of cheese,

putting it on the bread, um picking up

some ham, um and so on and so forth. I

can also follow more complicated prompts

like, "Hi robot, can you make me a vegan

sandwich? I don't like pickles, though."

uh and in this case is able to break it

down and decide that it's going to add

lettuce and tomatoes to the sandwich uh

and not add pickles, not add cheese, not

add um meat as well.

In addition to prompts, we're also able

to train the robot to handle different

interjections. Um actually here's an a

case where of a different kind of

prompt. So on the left we train the

robot to clean tables. So put trash away

and put dishes into the bin. And on the

right we ask the robot clean up only the

trash but not the dishes. And the

robot's able to understand what that

means and connect that to its low-level

actions and only put away the trash and

complete when it um when the trash is

all put away. And then lastly, it's able

to handle interjections and situated

corrections. So in this case, um the

robot is uh kind of getting items for a

user. The user interjects and said, "Get

me something sweet that's not in the

basket." Right after it had put a Kit

Kat into the basket and the robot um

says, "Uh, sure. Let me get you some

Skittles." uh and reasons through kind

of basic reasoning of how to uh what how

to fulfill the user's request and is

able to um respond to those kinds of

corrections situated in the world that

the robot is in. Now you might also

wonder like maybe some existing

foundation models could serve as a

highle planner for robots and do this

sort of high level reasoning without

actually training a separate model. And

so we also evaluated that um and we

found that in blue the performance at

following instructions and making

progress on the task was substantially

lower than the performance of our system

which is shown in green. Uh and in

general we found that these frontier

models generally struggle with visual

understanding as it pertains to robotics

which makes sense because in general

these models aren't kind of really

targeting uh many physical applications

and have very little data in the

physical world. Okay. Um, so to start to

wrap up, um, and then we'll all have

some time for questions. Uh, I talked a

bit about how robots can do a variety of

dextrous long horizon tasks with

pre-training and post- training. How

robots can succeed in places that

they've never been, and how they can

respond to open-ended prompts and

interjections by leveraging synthetic

data from language models on top of the

robot data that we had collected. Um now

with some closing notes the we've seen a

few different scenarios in this talk

where general purpose robots might be

more successful than specialist robots

but because we can essentially rather

than start from scratch for every single

application actually build upon a much

broader foundation for physical

intelligence in the real world. Um we

also saw that like large scale data in

the real world is really helpful for

developing these things and we found

that uh and I think that it's necessary

but not sufficient for physical

intelligence and there's a lot of uh

challenges and we need more research uh

to be done uh ourselves and through open

source contributions before robots I

think will be truly ready to tackle the

open world. I'd also like to mention

that at physical intelligence we're

hiring a number of roles. Uh if you're

excited about some of the things that we

talked about, you can see a list of the

open roles on the pi pi. As well,

awesome.

Happy to take some questions. Let's

start on the left.

>> Uh hi Chelsea. So, uh first I want to

say thank you for all your work on robot

learning. They're all really impressive.

Yeah. And uh so mainly I have two

questions on uh especially uh regarding

the post- training part you mentioned.

So um the first thing is uh you

mentioned that the in post training the

most important part is to have high

quality action data. So I'm wondering

what the components of that would be and

then the second question is what do you

think uh RL will play into the part of

post training?

>> Yeah absolutely. So I think that the the

different components of it a lot of it

comes down to consistency of the data

and the strategy being followed uh and

whether the robots whether the um the

data completes the task efficiently and

with a reliable strategy. Uh and then on

the second question I think that

reinforcement learning can play a very

large role in um it actually in post

training. I think that online data from

the robots uh which reinforcement

learning allows you to use can allow

robots to have a much higher success

rate and also uh be faster than if

they're just trained with imitation

learning.

>> Yeah, thank you.

>> Hi, thank you so much for your talk. Uh

so your work is really fascinating and

there is no doubt that it will have a

lot of impact in the future. But um can

I ask you at this stage uh how can you

find the fundings because honestly I

can't imagine how hard it can be to

convince people to invest in a robot

that folds close and deal with the

dishes. Yeah. So um it's a good

question. I think that well I guess

first I'll mention that we aren't just

focused on applications in the home. uh

we really want to solve this broader

problem of physical intelligence and

we've been starting with those

applications because they're ones that

are kind of easy to make progress on. Um

but we've also been doing tasks like

inserting an Ethernet cable which I put

put in the talk as well as constructing

a cardboard box. Uh and generally I

think that this sort of problem has a

ton of potential for for like making

impact in all sorts of realms not just

in domestic tasks but all sorts of

realms as well. And even in domestic

task, I think there's a huge market for

um for this kind of technology. Uh we

ourselves haven't had um a lot of

challenge with fundraising and I think

that a lot of robotics companies

recently have also done a great job um

and found that there's actually a lot of

excitement around this sort of

technology because I think things are

actually starting to work. Uh I started

working on this technology uh more than

10 years ago at this point and things

really weren't working then and so uh

yeah I think that there's a lot of

excitement that is starting to mature

and and um like actually be ready for

the real world. I think that there's a

lot more work to do uh but generally it

seems like there's a lot of people

excited about this technology and and

eager to actually put funds behind it.

>> Okay, thank you so much.

>> Yeah.

>> Hi. Uh thank you so much. Um I have two

questions like one uh uh more broad and

one more technical. So the technical one

like is uh VAS uh in my opinion like at

least to my understanding are a

framework that a bit that is a bit

separate like from world modeling and I

wonder like how the two of them like

will interplay among each other and

whether like you have actually planned

like to somehow like use them together.

uh as I see right now like VAS as more

of a policies uh that could actually

benefit a lot from world modeling and uh

from a B perspective I wonder like which

kind of infrastructure layers could be

the most useful uh to work on such as

like explanability, traceability or uh

uh safety in general to deploy such

models like in the real world.

>> Yeah, great question. So um on the first

point we there's actually fairly natural

ways to incorporate world model

objectives into vision language action

models and um we've done some work where

um instead of only predicting the next

action you predict some intermediate

subgoal image uh like what should happen

in the future in order to accomplish the

task uh and then predict an action from

there uh and we've seen some kind of

signs of life that that seems to be

quite promising. So I think there's ways

to merge the merge the two paradigms. Uh

at the same time I think there's a lot

of challenges that come up with world

modeling with regard to the ways in

which basically the data that you put

into it not necessarily being kind of

reflective of the ways in which you're

going to use it. You might train it on

demonstration data of successful data of

completing the task and then evaluate it

on to try to actually use it to evaluate

actions that are not optimally

completing the task. And then the world

model will hallucinate um a video of

completing the task successfully even if

the actions that you provide as input

didn't uh weren't actually going to

successfully lead to a good outcome. Um

so there's challenges there to overcome

and and so it's not like uh yeah there's

various challenges uh but there's also

ways to integrate it into the VA uh

paradigm and then could you remind me

your second question?

>> Um what are like the infrastructure

layers like you want the chess to work

on uh in the shortest term to bring like

the most

>> um

improvements let's say

>> to actually run these models on robots.

you need uh we have like a real-time

system um that needs to actually be

hitting a certain frequency to actually

like execute actions successfully. Uh

and if you have lag in that system and

so forth, it introduces all sorts of

challenges. And so thinking about fast

inference um and infrastructure for like

that's actually going to be on the robot

is a big part of uh what our software

team does. And then also thinking about

like large scale machine learning

infrastructure, training large models,

ingesting large amounts of data. Um the

data that we have is different from a

lot of kind of typical data sets because

it's very multimodal in nature. Um it's

kind of videos, actions, language

segments um and and various other uh

components as well. So um yeah, some

interesting infrastructure problems I

think both on the robot side uh and on

the kind of model training side.

>> Thank you so much.

>> Yep.

>> Hi, I'm Frederick and I have got a

question about model sizes in general.

So I think what we're seeing right now

is that in general larger model sizes

lead to better accuracy. For example,

also in your experiments or um it's also

what OpenAI and Enthropic and others are

doing right now with their LLMs.

However, there's also the approach of

using a quite small model and then

outsourcing the world knowledge into a

database of some sort with which the

model can interact. Um what is your take

on that? Do you think that's like a

valid approach or do you think

encapsulating all the world knowledge

inside of the model is better or works

better?

>> Yeah, it's an interesting question. So

in my experience working on like

retrievalbased systems um is that it

actually is a little bit tricky to well

first figure out what should be

offloaded versus actually done by the

model and second uh sometimes the model

will ignore the retrieved content and

try to generate something itself and it

it actually seems to be very quite

tricky to get that technically to work

uh exactly the way you want it. Um, I

think it's probably going to depend on

the application and the use case, uh, in

terms of how best to like like whether

that might make sense, but in my

experience, it ends up being quite

tricky to figure out what the division

of labor is. And even the like the model

part of it will need to have some degree

of intelligence in order to um like

actually make use of the retrieved

information and so forth. Uh, so I think

it's an really fascinating research

problem. Uh, but it also needs like a

lot of research to make that uh to that

make that work successfully.

Thank you.

>> Yeah.

>> Hi, Chelsea. My name is Charu Thomas.

Um, first off, really appreciate the

talk. It was really fascinating and have

been a big fan of your work since

metalarning. Um, when you think about

how software and hardware have are going

to continue to evolve, what are the

biggest opportunities for builders today

for your vision of physical

intelligence? I mean, I think that yeah,

there's lots of different like

opportunities to make things work a lot

better and a lot of like open questions.

I think kind of like what I was

mentioning before, uh, thinking about

better ways of having infrastructure on

like kind of the robot side. I think

that there isn't a lot of like there's

some open source code for that sort of

thing, but there's a lot of um

opportunities to make robot

infrastructure better. Uh, and not a lot

of people I think are are working on

that aspect of the problem. also lots of

opportunities like I guess one of the

things I love about um about AI and

computer science as a whole is there's a

really big open source community and I

think that there's a ton of opportunity

to actually like do open source work and

contribute to like a broader community

that's trying to like collect data open

source models fix bugs on those models

uh fine-tune those models figure out new

recipes for fine-tuning those models um

so yeah all sorts of questions also like

on the research side especially in the

open source realm

>> yeah thank you

>> hi Hi, Chelsea. Uh, I also, just like

everyone else, am a big fan of all your

work. So, thank you for putting that all

out. Uh, I've been reading through a lot

of your group's work recently and

particularly enjoyed reading Siraj uh,

Siraj's PhD thesis. It taught me a lot

about scaling real world robotics with

data. And a question I have is how do

you think synthetic data will sort of

scale for robotics in the future? As

we've seen with LMS, we've moved a we've

moved away from sort of not moved away

from pre-training, but moved away from

human collected data into more creating

synthetic data and a lot of filtering

and a lot of self-grading. So, how do

you think using generative synthetic

data for creating environments or reward

models will impact robotics?

>> Yeah, I have many thoughts on this

topic. Uh I think that at the end of the

day there's going to be no replacement

for real data and so we're like large

amounts of real robot data is going to

be a necessary component of any like

system that's going to work in a

generalizable way. Uh so we're going to

need that. Um, at the same time I do

think that there's tools for like

simulation and synthetic data especially

to potentially play on the evaluation

side because it's very tricky to

actually as you for example are

generalizing too many environments. It's

very tricky to actually evaluate how

well that model generalizes not just in

one new environment but in 10 new

environments because then you actually

need to bring the robot to those 10

environments or construct 10

environments. Uh whereas in simulation

that gets a lot easier. Uh and so I

think I'm really excited about kind of

simulation and synthetic data for that

use case. I should also mention that I

think that the analog of synthetic data

in language models is actually not

necessarily simulation in robotics but

closer to something like reinforcement

learning. Uh I think that a lot of

synthetic data is generated by the model

that's actually trying to do the task

and then trying to kind of reason

through different ways of doing the

task. And I think that the analogy there

is a robot that's trying to attempt the

task and learn from its own attempts and

get better from its own attempts. And

that sort of online data from the model

I think will also play a really critical

role in post training and something that

uh we're working on quite a bit. Uh and

so yeah that that I think is like really

important and really helpful.

>> Thank you.

>> Cool. I think we have time for one more

question. Sorry we won't be able to get

to everyone. Yeah.

>> Hi. It's super cool to see you as an MIT

EES alumni now working in a really cool

robotics and talking to us about

robotics and entrepreneurship. Um, but

I've been wondering how robotics

research that involves hardware

components plays out differently in

academia versus industry and are there

typically more resources, fewer

constraints or broader applications in

one setting over the other? And what

kind of people or goals do you think

might be better suited for each path?

>> Yeah, it's an interesting question. Uh,

I still love both kind of startup um and

academic environments and industry

environments. I think they all have

various pros and cons. Uh certainly I

think that uh any um I think that

generally academic environments aren't

quite as well resourced in terms of data

collection throughput, eval throughput

and compute as um like startups and

industry labs. Uh but at the same time I

think that there's a lot of uh problems

that you can solve without large amounts

of resources uh that uh we need to

figure out like on the algorithm side.

Uh so I think that there's a lot of

really interesting work to be done

there. Um and then on the like in

industry and in startups, I think the um

actually like trying to do some of the

research on these big models, scaling up

data, seeing what hap things happen at

large scales um is is really great to do

there. Yeah, I think that there's yeah,

there's there's a place for both. I also

think that the gap isn't as large as

often people make it seem. Uh and

oftentimes people in industry

environments kind of wish they had more

compute. Like you kind of always wish

that you had more resources. uh and

sometimes when you have a lot of

resources, you don't actually think as

carefully and as critically about what

runs you're going to be doing and so

forth and you uh end up being sometimes

more wasteful of compute uh than if you

were kind of more compute constrained.

So there's also actually downsides to

having more resources in my experience.

>> I'm really sorry. Can I just ask a one

quick question on architecture? I know

that um the scaling laws have worked

well for transformer based architectures

and I was thinking do you see currently

limits um in VLM based architecture

which are kind of made for like text

tokens because they don't have like

modules for physical awareness. Yeah.

And how do you deal with that?

>> Yeah. So, we we tokenized the actions

and so I'd encourage you to take a look

at the the fast tokenizer paper that we

put out um as as kind of a way to

accomplish that. And yeah, we should uh

wrap up there. Uh thanks everyone and um

yeah, hope you enjoy the event.

Please choose the correct answer for each question below: