Multimedia search making progress

Eight randomly chosen keyframes of videos contained in the video/sound geotagged database used in this research. Credit: Xavier Sevillano et al./Information Sciences

Just ten years ago the only way to search was to have a text to input and hope for a matching text. Then search engines got way smarter and started to make assumption about the meaning of your search. That broadened the potential target and further filtering ordered the results based on likeliness of matching your need. 

Today when you search for a sentence, even a single word, you get millions (literally) of hits but the ones presented on the first page are usually good enough for you.

Then few years ago sound search became a possibility. Applications like Shazam let you point your cell phone to a music you like but don't know the title and provide you with all information, including the possibility of buying the track on the spot.

More recently Google activated image search, and I became an addicted user. This opens up a whole new domain of searches. It is also very stimulating in making you trying to guess the algorithm used. In particular what it means for two images to be "similar". If you have half a hour to spend (waste?) try it yourself. It is as good as the "tell the difference" game you have in several newspaper. In this case, of course, it is all about "tell the similarities".  Some are obvious, some are really head scratching.

Now I read of the work made at FECYT, the Spanish Foundation for Science and Technology at the University of Ramón Llull University in Spain where researchers have been able to develop an algorithm to search for clues derivable from video and sound to identify the location where a certain clip was taped.

They started from a data base of some nine thousands geo-tagged clips with both images and sound and have been able to localise (a fragment of) clips with an accuracy of 30km. Higher accuracy have been possibile for a tiny number of clips.

The recognition success is very low, 3%, but it is sufficient to prove that indeed it is possible to extract clues from a video clip based on its sound and images and using a reference data base identify the place where that clip was filmed.

Increasing the match is a matter of increasing the geo-tagged data base, from thousands to millions. It is also a matter of seeing how well the algorithm can scale but in this area there have been amazing progresses so I am quite confident that scalability will not be a major obstacle.

Also interesting is to see the global evolution of searching. Shazam announced at the Mobile World Congress to have received a funding of 30M$ to explore ways of identifying objects in general and more specifically food. You are at a restaurant and see a guy at a table by your side receiving a plate that looks appetising. Get a photo of the food with your cell phone and ask Shazam to get info about that food (calories, history, good matching wine...). Very interesting from a technical point of view. In the meantime I'll continue as I do, ask the fellow how it taste and the waiter about the wine recommendation.

Author - Roberto Saracco

© 2010-2020 EIT Digital IVZW. All rights reserved. Legal notice. Privacy Policy.

EIT Digital supported by the EIT