For many years researchers have tried to teach computers to recognise objects and then to recognise the "meaning" of an image (e.g. moving from recognising a dog and a kid to recognise that the dog scared the kid or that the kid was playing with the dog). This recognition and meaning extraction that seems so easy and effortless for us has proved hard and elusive for machine.
Now we might be arrived to a point that a computer can not just recognise objects in an image but also extract meaning from it. And this is thanks to the work being done by researchers at Google.
The researchers have been able to create a learning program to let a computer analyse images and create a caption that explains textually the meaning of the scene. As an example look at the caption created on the image shown. It is not just about recognising vegetables and people, it is about understanding that that image represent a marketplace where people go to sell and buy vegetables!
This is an amazing result.
The technology at the base of this achievement is rooted in the one used to translate from one language to another. In this situation a Recurrent Neural Network (RNN) is used to create a vectorial representation of the meaning and this is then transformed into the target language by other RNN.
The approach used by the researchers was to substitute the first RNN with a Convolutionary Neural Network (CNN) that is being used to recognise images and objects within an image. Normally, the CNN results in a set of assumptions that are converted into the actual recognition by a Softmax that knows many classes of objects and can make the final association. In this case the output of the CNN has being fed to the RNN resulting in the production of the caption.
The researchers are planning to use this semantic image recognition to let blind people recognise an image and to facilitate the search of images.
It might seem easy but the more you are experienced in the area of image recognition and semantic recognition the more impressed you are of this result!
Are we done? Not quite! As it can be seen in the second figure there are several examples. In the first column examples of good understanding, in the second one some minor errors were made, in the third one the caption is related to the scene but it is not actually capturing the essence and in the fourth column we are actually seeing a useless caption. This is highlighting how much progress is needed to match our human capability of understanding a scene.