Watch YouTube to learn your way...

Robot watches videos to detect objects and how to grasp them. “Hmmm, so that’s how to hold and slice a soft spherical red object by holding a slicing-tool with a ‘power small diameter’ type grip, while avoiding damage to my gripper hand. Got it.” Credit: Yezhou Yang et al.

It is not unusual for me to turn to YouTube to learn to do some specific task, be it the upgrading of RAM on my Mac or a food recipe. And I bet that goes for you as well.

What is more unusual is to imagine a robot accessing YouTube to learn its way in doing things.  On the other hand we have so much knowledge in the Web that it just makes sense to let robots tap on it and learn by themselves.

We have seen robots built to be capable of learning, one of the first example is probably Baxter, by being shown how to do a specific task. But so far there has always been a person showing it the way (by demonstrating how to do or by programming it).

RoboBrain made the first leap in learning by accessing the Web to assimilate "concepts". So far it has downloaded one billion images, 120,000 YouTube clips and 100 million how-to-do documents and manuals (finally someone is reading instruction manuals!)

Now, a team of researchers at the University of Maryland in cooperation with a team at NICTA, Australia, use the latest advances in "deep neural networks" to ease the analyses of the materials that can be seen on YouTube.  This is much more tricky than it might seem at first glance. Just because you don't even have to think when you see a person cracking the egg shell by hitting it on the edge of a bowl it does not mean that seeing it by a robot it leads to a straightforward interpretation. It could be the way that person is holding the egg, it might even be that oval shaped objects break their outer shell when touching a surface. Of course to us these interpretations are pure nonsense, and we won't even consider them, but that is because we have been trained by life experience as we grew up... and that is not the case for a robot!

The first step, of course, is to recognise the various objects, then to recognise the action and the way an action involves several objects (the hand, the arm, the edge of the bowl and the egg). This image recognition is achieved through deep neural networks. Then the robot needs to extract a meaning and then accumulate that knowledge: that is not the end of it. Once you have the knowledge you need to learn when to apply it... So many hierarchies in what seems trivial to us.

I found fascinating to observe how much insight we are gaining on our thinking processes by attempting to have a robot performing what seems to be a very simple action, like picking up an egg and opening it up to use it in a recipe....

Author - Roberto Saracco

© 2010-2018 EIT Digital IVZW. All rights reserved. Legal notice