By
Viewed
73,376
Please choose the correct answer for each question below:
Questions: 0/1278
Correct: 0
Translate:
Hi everyone. Um, I'm really excited to
talk about developing general purpose
robots and how we might uh actually like
truly develop and bring intelligence
into the physical world. So, um, to
start off, I'd like to talk about this
problem, which is that if you want to
truly solve a robotics application, you
essentially need to build an entire
company around that application. uh you
need to build a different company for
logistics, for wet lab automation, for
robots and kitchens, for surgical robots
and so on. And this is really really
hard to do because that company needs to
make new hardware, develop custom
software, design unique movement
primitives for that application, handle
edge cases and so on. And you have to do
all of that from scratch uh if you want
to solve a robotics problem. And as a
result, uh, a lot of robotics companies
haven't been very successful in actually
bringing robots into the physical world
successfully, uh, in our daily lives. I
co-founded a company called physical
intelligence, uh, that's trying to solve
this problem. And in particular, we're
trying to develop a general purpose
model that can enable any robot to do
any task in any environment. uh and we
think that this sort of generalist model
may work better and be easier to use
than purpose-built models just like
we've seen in the development of found
foundation models for language and other
applications. Uh for example, if you
want to build uh a coding assistant, you
don't nowadays develop something
specifically for coding, but you develop
and you build on models that were
trained on large amounts of data, not
just on code. And essentially this is
the problem of trying to develop these
sorts of foundation models and bring
this sort of intelligence into the
physical world rather than the digital
world where they largely are today. So
how do we do this? Uh in this talk I'd
like to talk about how we go about doing
this. And if we were to take a lesson
from language models we know that
language models have taught us the
importance of scale. And so one possible
conclusion would be that perhaps scale
is the most important ingredient for
developing these models. And if you were
to say this conclusion is true, then
you might look to certain data sources
for largecale uh data. So for example,
we might look at data from industrial
automation and you get tons and tons of
data of robots uh doing tasks over and
over again like this. But this sort of
data isn't going to allow robots to go
into disaster zones or to make a
sandwich uh or to bag groceries. And so
this massive scale doesn't have the
diversity of behaviors that we need in
order to solve this general problem.
Alternatively, maybe we look at data
from YouTube which has a also a massive
data source and many videos of humans
doing tasks uh that could be useful for
training robots. uh but at the same time
we don't learn how to write by watching
other people write and we don't become
expert tennis players by watching
Wimbledon. Uh and so even though there's
a massive scale of data here, it's very
challenging to use and there's also a a
gap between the embodiment of robots and
humans. Um and lastly, we might look at
data from simulation and you can also
get a massive scale of data here, but uh
this data lacks realism and also has a
gap from reality. And so I think the
lesson here is that scale is necessary
for developing these models that can
generalize in open world conditions, but
they're subordinate to actually solving
the problem. So you need scale, but it's
not sufficient uh for the entire
problem. And so at physical
intelligence, we've been um this is an
example of a data episode uh that we've
collected. Uh this is uh in honor of our
first anniversary, which was a few
months ago. uh where we um here you can
see a teley operator uh in person who's
operating um some leader arms to control
the robot uh to light a match and light
a candle with the match and with this
sort of data we can train robots to do a
variety of different tasks and so um
what I'd like to talk about is some of
our recent results at trying to develop
sort of physical intelligence with
largecale real robot data I should
mention this is large scale by today's
robot standards and arguably a minuscule
amount of data compared to the sorts of
robot data that we should have in the
years to come. And so in particular,
we'll be looking at whether robots can
do a variety of dextrous long horizon
tasks, whether robots can succeed in
places they've never been, whether
robots can respond to open-ended prompts
and interjections. Uh, and even if
you're not excited about robotics, I
think that the lessons uh that we've
learned from trying to address these
problems are applicable outside of the
physical world. So um can we develop
robots that can have uh complete
dextrous long horizon tasks? And in
particular uh in this first part I'd
like to talk about how we trained uh a
pi zero foundation model to do this task
which is to unload a dryer and fold
laundry. Uh and to date I think this is
the most impressive thing that I've seen
uh a robot do in the physical world.
It's really hard.
[Applause]
This is an incredibly difficult problem.
You can see that it's not perfect. Uh
here is making some miscrops, making
some mistakes, but it's really really
hard because you have to deal with the
variability in the clothes and the way
in which they might be positioned and
crumpled uh and be able to handle all
those sorts of things. And as you're
doing this task, which takes about 10
minutes for the robot, there's many
opportunities to fail uh to fail
catastrophically. For example, dropping
um things on the ground, which is hard
to recover from. uh and you have to be
able to recover from even small
mistakes. I was personally actually
working quite a bit um on this laundry
folding robot along with Michael and
Siraj uh and of course supported uh and
with contributions from the whole
physical intelligence team. Uh so how do
you even approach this sort of problem?
It's this is a really really hard thing
for a robot to do and what we did is we
started simple. Uh we started with can a
robot fold a single size single brand
shirt uh and can a robot dynamically
flatten one shirt again single brand
single sized and if you start simple
this makes the problem quite a bit
easier uh we collected some data with
teley operation and trained a policy
with imitation learning and our model
had around 100 million parameters
mapping from images from the robot's
cameras to joint target joint positions
on the robot arms and we do this source
of control at 50 hertz on the robot
Uh, and uh, we founded the company in
kind of mid-March of of 2024. Uh, and a
couple months later after we had set
everything up, we were able to get a
policy that could fairly reliably fold a
single size single brand shirt. Uh, you
can see that I'm testing the policy
right here. Uh, and we also wanted to
test some dynamic motions because you
need to be able to match the control
frequency accurately in order to do
these sorts of dynamic motions. Um and
so these were some of our very initial
tests at uh addressing this sort of
laundry folding problem. Then from there
we wanted to make the problem
incrementally harder. Uh and so we
instead of starting from the shirt flat
on the table, we started in a crumpled
position like these. And it turns out
that this actually makes it a lot
harder. Uh and so here are some videos
of some of our initial attempts at
trying to train the robot to fold these
shirts. And the robot struggles. uh the
the robot does some things that kind of
look somewhat sensible but generally
isn't able to make progress on the task.
Uh with many tests we frequently were
getting 0% success rate in our tests of
this uh system and really struggling to
make progress. So really here is the it
introduces this challenge of handling
the sorts of variability in the ways in
which shirts might be crumpled on the
table. We had some initial signs of life
in late June uh of of last year. Uh and
so in this case, the robot was able to
kind of make progress on flattening the
shirt. Uh it's also then able to fold
the shirt uh decently well uh from that
initial state. Still not perfect. Uh and
as you can see, it takes quite a while
to do this. So this is a video that was
sped up AEX. Uh so not something that
you might have the patience uh for a
robot to do.
Um, so with some initial signs of life,
the also very low success rate, we
started to transition to a slightly
harder version of the task where the
laundry starts in a laundry basket. We
also introduced variable size shirts and
shorts into the mix. Uh, and again, the
robot really struggled. So in many of
our tests, we were getting 0% success
rate across the board, and we're really
struggling to actually get the robots to
learn how to do these tasks. At this
point, we were trying to consider a lot
of different things. uh we thought that
maybe the robot needs memory, needs
history in some way. Uh maybe we need to
just train our models for longer. Maybe
we should be doing control and endeector
space rather than in joint space of the
robot. Uh maybe our encoders, we knew
that there were calibration issues and
maybe we need that calibration to be
more consistent. Uh maybe we need to
condition the model on more information
about the data. Uh maybe we need
hierarchy because this is a pretty long
horizon task and it needs to break it
down into different subtasks. Maybe we
need higher resolution images. uh maybe
we need to introduce kind of
interventions in data collection. A lot
of these things we also tried. We had
around two to three months of failure
where nothing was really working at
addressing this task. But then at some
point we actually had a bit of a
breakthrough uh which was that um we
found one thing that really seemed to
make a difference in the robot's ability
to do the task. And this was actually to
take some inspiration from the world of
language modeling to actually instead of
just training a policy on all of our
data, we pre-train on all the data and
then fine-tune on a highly on a curated
consistent highquality set of
demonstration data. When we did this,
uh, we found that the robot was actually
able to make progress and a lot more
reliably fold articles of clothing. Uh,
and so I think that this video was the
first video where the robot was able to
fold five items in a row and stack them.
Uh, I went home very excited this day.
Uh, this was in September of 2024, so
multiple months after our initial tests.
Uh, now this is far from perfect. Uh, it
takes 20 minutes to fold five items of
clothes. Uh and um at the same time
though it kind of suggested that this
sort of recipe was able to unlock uh the
capability in the robot to actually fold
these articles of clothing. So you can
see these sorts of failures here. In
this case, it attempted to fold the the
blue shirt around seven times uh before
eventually actually figuring out how to
do that. Um there's also other failure
modes as well. So, here's an example
where the robot pushes the stack to the
corner of the table uh and decides to
kind of fiddle with it a bit uh and then
eventually uh slides it off the table
and then it proceeds as if nothing had
happened and it's going to continue to
fold. We continue to iterate on this
recipe. We uh selected and worked on our
curation strategy for curating a higher
quality set of demonstration data. Uh we
got it from 20 minutes down to 12
minutes uh for these five items. This is
kind of how we were evaluating uh how
good our robot system was. uh it still
makes mistakes. It's still the full
quality still varies, but um it's still
significantly better than our previous
curation recipe. Now, at this point, we
were still training models largely um
kind of we were pre-training uh and
fine-tuning only on laundry data, and we
weren't leveraging uh kind of
pre-trained models in the community. And
there were some folks working at
physical intelligence that were working
on developing a pre-trained model
trained on all of the robot data. And um
we then started to try to introduce
these models into our um into our
recipe. And so we took an open- source
vision language model, a three billion
parameter model uh called Polygeemma.
Previously we're using the previous
videos were all with like a 100 to 300
million parameters that we're iterating
on. Um this model takes as input images
from the robot also a language command
uh and then has a head a diffusion head
that's going to attend to all the
internal values of the vision language
model. and uh with the joint angles uh
predict a chunk of 50 actions into the
future. So about 1 second uh of of
action steps and we're using a flow
matching a variant of diffusion uh to
actually output these actions and output
continuous actions. Um so we took this
pre-trained uh this model and instead of
pre-training only on laundry, we
pre-trained on all of the robot data
that we had collected. Uh and then we
just fine-tuned it with the same exact
post- training recipe that we had
developed uh without using the vision
language models. Uh and when we did
this, we actually saw the robot uh
continue to actually get better when we
just plugged in that new pre-trained
model. Uh and so in the left video, it's
able to do five items in 9 minutes,
which was faster than the 12 minutes we
had before. In the right videos, we were
testing with um some novel clothing
items and found that it was also quite
efficient at folding multiple items in a
row. Uh, and we also saw as a result
there was also more consistent bold
quality by using this model that was
about 10 times larger um, and had seen
more robot data as input. To look at a
few highlights of this, here's a pair of
shorts that the robot hasn't seen
before. And this is kind of a tricky
scenario where to flatten it, it
actually kind of needs to reach under
the kind of the bottom of the shorts.
And it's able to do that. is able to
kind of figure out that it should reach
under um the the left part of the shorts
in order to uh eventually flatten it. Uh
and then um once it actually
successfully flattens it, uh it's able
to fold it successfully. It also has to
do something similar at times to fold
shirts. So in this case, it needs to
actually kind of fold the shirt over on
itself with actually puts it in a more
crumpled state arguably, but allows it
to find the corners of the shirt and
then uh go ahead and fold it.
Uh, and then like I mentioned, it also
is able to handle unseen clothing items.
So, uh, here's an example of a shirt
with a V-neck, uh, that is able to fold
even though, um, like the the post
training data set didn't have, well,
didn't this shirt was completely held
out and the post training data set
didn't have any V-necks uh, as input in
the data set, it's also able to fold
shirts with buttons. So, it has some
degree of generalization to different
clothing items.
Um, and then lastly, because this policy
is a neural network and it's kind of uh
taking his input, the current image,
it's able to handle interruptions. So
here, Michael is uh continuing to mess
with the robot and the robot uh figures
out that it should put the the shirt
away uh while it's trying to fold the
other shirt. In this case, Michael's
going to continue messing with the
robot. So, Michael unfolds one side
and the robot reacts.
Michael goes in again
and the robot makes some mistakes here
but able to recover. Michael messes it
up again. So those are some results of
of what the robot's able to do. Now I
talked about this pre-training and
post-training recipe being really
important. We can actually
quantitatively measure that and actually
make sure that this is actually what's
leading to improvement. So, we compared
this pre-training and post-training
recipe to not using any pre-training and
only training on the curated data set
versus no post-training where you're
training on all of the data rather than
fine-tuning on the curated data set. Uh,
and we evaluated these models in terms
of their progress on the task where you
u make partial progress for getting it
out of the bin, which is the easiest
part, and then further progress for
flattening, folding, and stacking the
items. And we see that the pre-training
and post-training recipe is able to get
far higher performance than omitting
pre-training and omitting post-training.
Uh and notably omitting pre-training and
post- training is basically able to get
it out of the bin and make very little
progress after that. Whereas when we
combine pre-training and curated
post-raining, we get far higher
performance whereas able to reliably uh
flatten and fold objects. Um and then
the last thing that I'll mention on this
note is that uh nothing in this recipe
is specific to laundry. And so we took
the same recipe um and fine-tuned on
other tasks. So here uh the task is to
um kind of clean up a table. And the
robot's also able to successfully uh do
this task uh despite the fact that we
primarily were iterating a lot on
laundry, but it's able to also apply
this recipe to this task. It also um is
able to scoop uh coffee beans into a
coffee grinder. Uh this task is pretty
hard. it has to construct the bottom
part of a cardboard box uh which
requires uh quite a bit of dexterity and
then um lastly autonomously lighting a
candle with a match again with this kind
of same pre-training and post-training
recipe.
And so this is pointing at this kind of
the benefit of foundation models that I
alluded to before which is that to do
these different tasks you don't have to
start completely from scratch. you can
actually leverage pre-training across
multiple robots and across multiple
tasks. And then we're also able to apply
that same recipe to robots at other
companies. Uh this is a robot that I've
actually never seen in person before. Uh
they collected data. They sent the data
to us. We fine-tuned our model on their
data. We actually didn't even know
exactly how the model is being
controlled. Uh exactly the
representation of their actions. uh but
by fine-tuning the model on this new
robot, the model is able to control the
robot in order to uh make a cup of
coffee in this case. So um some
takeaways for this part uh we were able
to independently develop post- training
and pre-training and decouple the
problem um and then eventually get the
best of both. Uh we found that training
on all the data doesn't work for complex
tasks and this sort of pre post post
pre-training and post-training on
curated data leads to far better
performance. And then we broke up this
really hard problem of folding laundry
by gradually starting with folding
single shirts and going to more and more
complex versions of the task. Now
there's a number of limitations here and
one limitation I'd like to point out is
that these robots inevitably um in this
case were trained in the environments
that they were tested. Uh and so this
means that in principle you could use
these methods to collect a lot of data
in one environment and then deploy them
in one environment. But ultimately,
there's going to be things that change
about an environment and scenarios where
we would want to actually apply these
robots to environments that they've
never seen in before. And so, how can
robots actually succeed in places that
they've never been? The lesson we've
learned from machine learning in other
places is that we should collect diverse
data. Uh, and so we started by
collecting data of tidying bedrooms and
kitchens in many different environments.
Uh, and here's an example, kind of a
sample of that data. uh and we collected
robot data in homes across San Francisco
here uh and also collected data in
diverse mock kitchens and mock bedrooms
and in total we had more than 100 unique
rooms represented in the data set that
ended up being uh a part of a bigger
pre-training mixture. So we trained on
this diverse mobile manipulation data uh
including the low-level action
prediction as well as predicting highle
subtask commands for how to complete the
task. But we also trained on previously
collected static manipulation data that
was also fairly diverse. Um static
manipulation data that we had collected
in our office and in labs as well as web
data um and highle instructional data.
And I should point out here that the
mobile manipulation data of tidying
bedrooms and kitchens only accounted for
2.4% of the overall pre-training mix.
And so the lesson here is that you were
basically able to spin up a new task and
actually an entirely new robot. the rest
of the mixture didn't have any mobile
manipulation data with this particular
mobile manipulator in it um without
redoing all of the data collection.
We're able to build upon everything that
had been done before. And it's kind of
this kind of same story of foundation
models being able to make it easier to
spin up um a new problem, a new
application without starting from
scratch. Um now this wasn't completely
easy. Um we had a couple challenges. One
of the challenges that we ran into is
that naively uh this model can ignore
language instructions. So we had
actually in this case asked it to pick
up the cutting board and it chose to
pick up the plate instead. Now we're
again asking it to pick up the cutting
board. Uh and instead the robot had a
mind of its own decided to pick up the
plate. Uh and then we tell it to put the
plate in the sink. And eventually it
decides that well after kind of moving
away from the cutting board it
eventually decided that it would
actually pick up the cutting board. And
so in the early development of our
model, we found that it often ignored
language.
And to solve this, we thought about how
vision language models actually follow
language well. And so maybe there's a
way to preserve the inherent abilities
of the pre-trained models when
addressing this task. Uh and so what we
did is with this PI zero architecture,
this action head that's using diffusion
is randomly initialized. And this ends
up actually deteriorating the
pre-trained knowledge that's present in
the vision language model. Uh and we
found that if we can prevent this
deterioration, we might be able to get
better language following. Uh and so the
recipe that we came up with was actually
in some ways fairly similar, but instead
we're going to be predicting tokenized
actions. And then when we have the
diffusion head, we'll be stopping the
gradient from the randomly initialized
diffusion head to prevent it from
deteriorating the language following
abilities of the VLM backbone. Uh and we
found that this first led to faster
training because the tokenized actions
are a more direct supervision signal.
And second, it also followed language
far better. Uh an 80% follow rate rather
than a 20% follower rate. Uh which
suggests that we're able to preserve the
the kind of pre-training in the vision
language model backbone. So, we put
those pieces together. We took that
recipe and trained it um pre-trained it
on all of our data, including the mobile
manipulation data. We fine-tuned it on
mobile manipulation data in a variety of
environments. And then we tested the
model in places it's never been in
before. So, we rented uh three Airbnbs
that uh we had never been to before. Uh
we put the robot in those homes, in this
case, in the kitchen, and I asked it to
close the cabinet. I asked it to put
away the dishes. has also never seen
these dishes um or the these forks,
these objects. And the robot's able to
succeed even though it's never been the
here before. There's different uh
countertops, different furniture,
different objects, and so forth. Uh
lastly, I asked it to clean up the
spill, and the robot is able to oblige
and wipe down the spill and eventually
put the sponge into the sink.
Uh it's also able to do this for
bedrooms. So Laura asked it in this case
just clean the bedroom and it puts uh
articles of clothing in. Uh it throws
away the trash and uh then is able to
tidy the bed by putting the uh putting
the pillow at the top of the bed and uh
tidying the the blanket or the comforter
of the bed.
YC's next batch is now taking
applications. Got a startup in you?
Apply at y combinator.com/apply.
It's never too early and filling out the
app will level up your idea. Okay, back
to the video. So, quantitatively, I
talked about how the kind of there's
only 2.7% or something of the the
mixture and so how much does that other
data actually help? Uh could we actually
just train on that kind of 2.7%.
And we find that these kind of bars on
the right which are excluding data from
static robots in labs and environments
and so forth um reduces performance
significantly. So the performance goes
down to less than 60% when you exclude
that data when evaluated in novel homes
compared to if you use the full
pre-training mixture it has uh more than
20% higher performance. Lastly we also
looked at is the diversity of data
helpful? Is it important? And so we
increase the amount of data from these
environments to test this. It's always
good to like you can kind of do vibe
eval but it's really helpful to actually
measure how well uh these things work
and so this is what this is measuring
and we find that if we actually increase
the amount of homes the amount of uh
locations that are represented in the
data the performance increases which is
great uh and it actually gets to the
same level of performance as if we train
on data from that target environment and
so it means we're actually mostly
closing the generalization gap and
suggest that the bottlenecks at this
point for this sort of task lie not in
collecting more diverse data but in
actually getting higher reliability and
higher performance. Um now I should also
mention that there's failure modes like
this the success rate was around 80%.
There's lots of room for improvement. Uh
here are a couple examples of those
failure modes. So um here it's told to
put the items in the drawer. Uh it is
able to put it in the drawer but the
item isn't fully in the drawer at the
end and it decides that it's done and
kind of moves on to the next thing. Uh
here the robot uh needs to put the
clothes in the laundry basket. It drives
over the shirt um and then it gets stuck
and it's not able to lift it up. Uh here
we asked it to put the dishes in the
sink and it successfully is able to put
a number of the dishes in the sink but
it struggles to pick up the cutting
board uh in this particular case because
it's a very thin and it's flush against
the surface of uh the countertop. Uh and
in the last case, my probably my
favorite case, um it's told to put the
spatula into a drawer and it decides
that the oven looks a lot like a drawer
and so it opens the oven um and uh yeah,
tries to to put it in there. Um and
beyond this, there's also challenges
with regard to speed, partial
observability, uh long-term planning um
and so uh yeah, lots of work to do
still. So the takeaway here is that with
diverse data, uh, robots can follow a
variety of instructions in environments
that the robot has never been in before.
Uh, which is a big step up from a lot of
robotic scenarios where they're trained
in the scenarios that they are being
tested. Now the last kind of bit I'd
like to talk about is this model has a
fairly limited instruction set. It can
only follow kind of a certain set of
commands. And if we think about how
other forms of AI technology have been
deployed, people really like to
customize and actually tell the robot
what they want or tell the system what
they want from these kinds of models.
And so just like we prompt language
models, can we allow robots to respond
to open-ended prompts and open-ended
interjections?
Uh so to do this and actually to do the
past work, we're actually leveraging
hierarchical uh vision language action
models. So we're going to have a high
level policy break down uh the prompt
into uh intermediate uh verbal responses
and intermediate atomic language
commands. So the highle prompt might be
kind of can you make me a sandwich uh
and this highle policy will break it
down into the subtask of pick up one
slice of bread. This will be passed to a
low-level model that actually executes
and predicts target joint angles um to
fulfill the low-level command of picking
up one slice of bread. Now, on its own,
this isn't going to be able to follow
all sorts of prompts, and it's actually
fairly tricky to handle open-ended
language because it's going to be
challenging to collect a large number of
human robot interactions with the real
robot in the loop. And this is also
going to be fairly hard to scale. Uh and
so what we did is we kind of took all of
our existing robot data and we can
actually generate synthetic data for the
existing robot data. In particular, we
can use language models to reabel and
generate hypothetical human prompts for
the scenarios that the robots are in.
And so what this looks like is we'll
take data that says um here's a kind of
a video and then the next skill is to
pick up a Kit Kat because that's what
the robot does next in terms of just
like basic low-level annotation. And
then for this scenario where the robot
is about to pick up the KitKat, we can
ask a vision language model, what is a
hypothetical prompt that a human might
have asked that led to this um this
particular scenario and the robot to
actually choose to pick up a Kit Kat.
And then we can train our high level
policy on these synthetic prompts to
basically augment the robot data with
various human interactions that might
have led to those different situations.
And as a result of this, we're able to
actually allow robots to follow a
variety of different prompts. So on the
left, we ask, "Hi, robot. Can you make
me a ham and cheese sandwich?"
The robot says, "Sure, I'll start with
the bread and add ham and cheese next."
And it's able to break down this task
into the various subtasks of picking up
a slice of bread, putting on the cutting
board, picking up a slice of cheese,
putting it on the bread, um picking up
some ham, um and so on and so forth. I
can also follow more complicated prompts
like, "Hi robot, can you make me a vegan
sandwich? I don't like pickles, though."
uh and in this case is able to break it
down and decide that it's going to add
lettuce and tomatoes to the sandwich uh
and not add pickles, not add cheese, not
add um meat as well.
In addition to prompts, we're also able
to train the robot to handle different
interjections. Um actually here's an a
case where of a different kind of
prompt. So on the left we train the
robot to clean tables. So put trash away
and put dishes into the bin. And on the
right we ask the robot clean up only the
trash but not the dishes. And the
robot's able to understand what that
means and connect that to its low-level
actions and only put away the trash and
complete when it um when the trash is
all put away. And then lastly, it's able
to handle interjections and situated
corrections. So in this case, um the
robot is uh kind of getting items for a
user. The user interjects and said, "Get
me something sweet that's not in the
basket." Right after it had put a Kit
Kat into the basket and the robot um
says, "Uh, sure. Let me get you some
Skittles." uh and reasons through kind
of basic reasoning of how to uh what how
to fulfill the user's request and is
able to um respond to those kinds of
corrections situated in the world that
the robot is in. Now you might also
wonder like maybe some existing
foundation models could serve as a
highle planner for robots and do this
sort of high level reasoning without
actually training a separate model. And
so we also evaluated that um and we
found that in blue the performance at
following instructions and making
progress on the task was substantially
lower than the performance of our system
which is shown in green. Uh and in
general we found that these frontier
models generally struggle with visual
understanding as it pertains to robotics
which makes sense because in general
these models aren't kind of really
targeting uh many physical applications
and have very little data in the
physical world. Okay. Um, so to start to
wrap up, um, and then we'll all have
some time for questions. Uh, I talked a
bit about how robots can do a variety of
dextrous long horizon tasks with
pre-training and post- training. How
robots can succeed in places that
they've never been, and how they can
respond to open-ended prompts and
interjections by leveraging synthetic
data from language models on top of the
robot data that we had collected. Um now
with some closing notes the we've seen a
few different scenarios in this talk
where general purpose robots might be
more successful than specialist robots
but because we can essentially rather
than start from scratch for every single
application actually build upon a much
broader foundation for physical
intelligence in the real world. Um we
also saw that like large scale data in
the real world is really helpful for
developing these things and we found
that uh and I think that it's necessary
but not sufficient for physical
intelligence and there's a lot of uh
challenges and we need more research uh
to be done uh ourselves and through open
source contributions before robots I
think will be truly ready to tackle the
open world. I'd also like to mention
that at physical intelligence we're
hiring a number of roles. Uh if you're
excited about some of the things that we
talked about, you can see a list of the
open roles on the pi pi. As well,
awesome.
Happy to take some questions. Let's
start on the left.
>> Uh hi Chelsea. So, uh first I want to
say thank you for all your work on robot
learning. They're all really impressive.
Yeah. And uh so mainly I have two
questions on uh especially uh regarding
the post- training part you mentioned.
So um the first thing is uh you
mentioned that the in post training the
most important part is to have high
quality action data. So I'm wondering
what the components of that would be and
then the second question is what do you
think uh RL will play into the part of
post training?
>> Yeah absolutely. So I think that the the
different components of it a lot of it
comes down to consistency of the data
and the strategy being followed uh and
whether the robots whether the um the
data completes the task efficiently and
with a reliable strategy. Uh and then on
the second question I think that
reinforcement learning can play a very
large role in um it actually in post
training. I think that online data from
the robots uh which reinforcement
learning allows you to use can allow
robots to have a much higher success
rate and also uh be faster than if
they're just trained with imitation
learning.
>> Yeah, thank you.
>> Hi, thank you so much for your talk. Uh
so your work is really fascinating and
there is no doubt that it will have a
lot of impact in the future. But um can
I ask you at this stage uh how can you
find the fundings because honestly I
can't imagine how hard it can be to
convince people to invest in a robot
that folds close and deal with the
dishes. Yeah. So um it's a good
question. I think that well I guess
first I'll mention that we aren't just
focused on applications in the home. uh
we really want to solve this broader
problem of physical intelligence and
we've been starting with those
applications because they're ones that
are kind of easy to make progress on. Um
but we've also been doing tasks like
inserting an Ethernet cable which I put
put in the talk as well as constructing
a cardboard box. Uh and generally I
think that this sort of problem has a
ton of potential for for like making
impact in all sorts of realms not just
in domestic tasks but all sorts of
realms as well. And even in domestic
task, I think there's a huge market for
um for this kind of technology. Uh we
ourselves haven't had um a lot of
challenge with fundraising and I think
that a lot of robotics companies
recently have also done a great job um
and found that there's actually a lot of
excitement around this sort of
technology because I think things are
actually starting to work. Uh I started
working on this technology uh more than
10 years ago at this point and things
really weren't working then and so uh
yeah I think that there's a lot of
excitement that is starting to mature
and and um like actually be ready for
the real world. I think that there's a
lot more work to do uh but generally it
seems like there's a lot of people
excited about this technology and and
eager to actually put funds behind it.
>> Okay, thank you so much.
>> Yeah.
>> Hi. Uh thank you so much. Um I have two
questions like one uh uh more broad and
one more technical. So the technical one
like is uh VAS uh in my opinion like at
least to my understanding are a
framework that a bit that is a bit
separate like from world modeling and I
wonder like how the two of them like
will interplay among each other and
whether like you have actually planned
like to somehow like use them together.
uh as I see right now like VAS as more
of a policies uh that could actually
benefit a lot from world modeling and uh
from a B perspective I wonder like which
kind of infrastructure layers could be
the most useful uh to work on such as
like explanability, traceability or uh
uh safety in general to deploy such
models like in the real world.
>> Yeah, great question. So um on the first
point we there's actually fairly natural
ways to incorporate world model
objectives into vision language action
models and um we've done some work where
um instead of only predicting the next
action you predict some intermediate
subgoal image uh like what should happen
in the future in order to accomplish the
task uh and then predict an action from
there uh and we've seen some kind of
signs of life that that seems to be
quite promising. So I think there's ways
to merge the merge the two paradigms. Uh
at the same time I think there's a lot
of challenges that come up with world
modeling with regard to the ways in
which basically the data that you put
into it not necessarily being kind of
reflective of the ways in which you're
going to use it. You might train it on
demonstration data of successful data of
completing the task and then evaluate it
on to try to actually use it to evaluate
actions that are not optimally
completing the task. And then the world
model will hallucinate um a video of
completing the task successfully even if
the actions that you provide as input
didn't uh weren't actually going to
successfully lead to a good outcome. Um
so there's challenges there to overcome
and and so it's not like uh yeah there's
various challenges uh but there's also
ways to integrate it into the VA uh
paradigm and then could you remind me
your second question?
>> Um what are like the infrastructure
layers like you want the chess to work
on uh in the shortest term to bring like
the most
>> um
improvements let's say
>> to actually run these models on robots.
you need uh we have like a real-time
system um that needs to actually be
hitting a certain frequency to actually
like execute actions successfully. Uh
and if you have lag in that system and
so forth, it introduces all sorts of
challenges. And so thinking about fast
inference um and infrastructure for like
that's actually going to be on the robot
is a big part of uh what our software
team does. And then also thinking about
like large scale machine learning
infrastructure, training large models,
ingesting large amounts of data. Um the
data that we have is different from a
lot of kind of typical data sets because
it's very multimodal in nature. Um it's
kind of videos, actions, language
segments um and and various other uh
components as well. So um yeah, some
interesting infrastructure problems I
think both on the robot side uh and on
the kind of model training side.
>> Thank you so much.
>> Yep.
>> Hi, I'm Frederick and I have got a
question about model sizes in general.
So I think what we're seeing right now
is that in general larger model sizes
lead to better accuracy. For example,
also in your experiments or um it's also
what OpenAI and Enthropic and others are
doing right now with their LLMs.
However, there's also the approach of
using a quite small model and then
outsourcing the world knowledge into a
database of some sort with which the
model can interact. Um what is your take
on that? Do you think that's like a
valid approach or do you think
encapsulating all the world knowledge
inside of the model is better or works
better?
>> Yeah, it's an interesting question. So
in my experience working on like
retrievalbased systems um is that it
actually is a little bit tricky to well
first figure out what should be
offloaded versus actually done by the
model and second uh sometimes the model
will ignore the retrieved content and
try to generate something itself and it
it actually seems to be very quite
tricky to get that technically to work
uh exactly the way you want it. Um, I
think it's probably going to depend on
the application and the use case, uh, in
terms of how best to like like whether
that might make sense, but in my
experience, it ends up being quite
tricky to figure out what the division
of labor is. And even the like the model
part of it will need to have some degree
of intelligence in order to um like
actually make use of the retrieved
information and so forth. Uh, so I think
it's an really fascinating research
problem. Uh, but it also needs like a
lot of research to make that uh to that
make that work successfully.
Thank you.
>> Yeah.
>> Hi, Chelsea. My name is Charu Thomas.
Um, first off, really appreciate the
talk. It was really fascinating and have
been a big fan of your work since
metalarning. Um, when you think about
how software and hardware have are going
to continue to evolve, what are the
biggest opportunities for builders today
for your vision of physical
intelligence? I mean, I think that yeah,
there's lots of different like
opportunities to make things work a lot
better and a lot of like open questions.
I think kind of like what I was
mentioning before, uh, thinking about
better ways of having infrastructure on
like kind of the robot side. I think
that there isn't a lot of like there's
some open source code for that sort of
thing, but there's a lot of um
opportunities to make robot
infrastructure better. Uh, and not a lot
of people I think are are working on
that aspect of the problem. also lots of
opportunities like I guess one of the
things I love about um about AI and
computer science as a whole is there's a
really big open source community and I
think that there's a ton of opportunity
to actually like do open source work and
contribute to like a broader community
that's trying to like collect data open
source models fix bugs on those models
uh fine-tune those models figure out new
recipes for fine-tuning those models um
so yeah all sorts of questions also like
on the research side especially in the
open source realm
>> yeah thank you
>> hi Hi, Chelsea. Uh, I also, just like
everyone else, am a big fan of all your
work. So, thank you for putting that all
out. Uh, I've been reading through a lot
of your group's work recently and
particularly enjoyed reading Siraj uh,
Siraj's PhD thesis. It taught me a lot
about scaling real world robotics with
data. And a question I have is how do
you think synthetic data will sort of
scale for robotics in the future? As
we've seen with LMS, we've moved a we've
moved away from sort of not moved away
from pre-training, but moved away from
human collected data into more creating
synthetic data and a lot of filtering
and a lot of self-grading. So, how do
you think using generative synthetic
data for creating environments or reward
models will impact robotics?
>> Yeah, I have many thoughts on this
topic. Uh I think that at the end of the
day there's going to be no replacement
for real data and so we're like large
amounts of real robot data is going to
be a necessary component of any like
system that's going to work in a
generalizable way. Uh so we're going to
need that. Um, at the same time I do
think that there's tools for like
simulation and synthetic data especially
to potentially play on the evaluation
side because it's very tricky to
actually as you for example are
generalizing too many environments. It's
very tricky to actually evaluate how
well that model generalizes not just in
one new environment but in 10 new
environments because then you actually
need to bring the robot to those 10
environments or construct 10
environments. Uh whereas in simulation
that gets a lot easier. Uh and so I
think I'm really excited about kind of
simulation and synthetic data for that
use case. I should also mention that I
think that the analog of synthetic data
in language models is actually not
necessarily simulation in robotics but
closer to something like reinforcement
learning. Uh I think that a lot of
synthetic data is generated by the model
that's actually trying to do the task
and then trying to kind of reason
through different ways of doing the
task. And I think that the analogy there
is a robot that's trying to attempt the
task and learn from its own attempts and
get better from its own attempts. And
that sort of online data from the model
I think will also play a really critical
role in post training and something that
uh we're working on quite a bit. Uh and
so yeah that that I think is like really
important and really helpful.
>> Thank you.
>> Cool. I think we have time for one more
question. Sorry we won't be able to get
to everyone. Yeah.
>> Hi. It's super cool to see you as an MIT
EES alumni now working in a really cool
robotics and talking to us about
robotics and entrepreneurship. Um, but
I've been wondering how robotics
research that involves hardware
components plays out differently in
academia versus industry and are there
typically more resources, fewer
constraints or broader applications in
one setting over the other? And what
kind of people or goals do you think
might be better suited for each path?
>> Yeah, it's an interesting question. Uh,
I still love both kind of startup um and
academic environments and industry
environments. I think they all have
various pros and cons. Uh certainly I
think that uh any um I think that
generally academic environments aren't
quite as well resourced in terms of data
collection throughput, eval throughput
and compute as um like startups and
industry labs. Uh but at the same time I
think that there's a lot of uh problems
that you can solve without large amounts
of resources uh that uh we need to
figure out like on the algorithm side.
Uh so I think that there's a lot of
really interesting work to be done
there. Um and then on the like in
industry and in startups, I think the um
actually like trying to do some of the
research on these big models, scaling up
data, seeing what hap things happen at
large scales um is is really great to do
there. Yeah, I think that there's yeah,
there's there's a place for both. I also
think that the gap isn't as large as
often people make it seem. Uh and
oftentimes people in industry
environments kind of wish they had more
compute. Like you kind of always wish
that you had more resources. uh and
sometimes when you have a lot of
resources, you don't actually think as
carefully and as critically about what
runs you're going to be doing and so
forth and you uh end up being sometimes
more wasteful of compute uh than if you
were kind of more compute constrained.
So there's also actually downsides to
having more resources in my experience.
>> I'm really sorry. Can I just ask a one
quick question on architecture? I know
that um the scaling laws have worked
well for transformer based architectures
and I was thinking do you see currently
limits um in VLM based architecture
which are kind of made for like text
tokens because they don't have like
modules for physical awareness. Yeah.
And how do you deal with that?
>> Yeah. So, we we tokenized the actions
and so I'd encourage you to take a look
at the the fast tokenizer paper that we
put out um as as kind of a way to
accomplish that. And yeah, we should uh
wrap up there. Uh thanks everyone and um
yeah, hope you enjoy the event.
Related Songs