“That’s important for robotics applications, you want [a robot] to anticipate and forecast what will happen early on”
MIT researchers have developed an add-on module for Artificial Intelligence (AI) systems that can, by analysing a few frames of a video feed, predict how objects will be changed or transformed by human action.
The module is called Temporal Relation Network (TRN) and it gives AI systems the ability to learn how objects can undergo changes at different times in a video.
The researchers at the Massachusetts Institute of Technology aim to build better AI systems that have higher activity recognition and a higher comprehension of what is happening to the world around them.
Artificial Intelligence Laboratory
Former PhD student in the Computer Science and Artificial Intelligence Laboratory at MIT Bolei Zhou, commented in blog post that: “We built an artificial intelligence system to recognise the transformation of objects, rather than appearance of objects.”
“The system doesn’t go through all the frames — it picks up key frames and, using the temporal relation of frames, recognise what’s going on. That improves the efficiency of the system and makes it run in real-time accurately.”
“That’s important for robotics applications, you want [a robot] to anticipate and forecast what will happen early on, when you do a specific action,” Zhou – currently an assistant professor of computer science at the Chinese University of Hong Kong – added.
The researchers tested and trained the module on three crowd-sourced datasets of videos which contained footage of various activities being performed.
The first one was made by the company TwentyBN and features 200,000 videos of 174 action categories, examples would be a hand poking and knocking over a stack of cans.
The second, dubbed Jester, contains 150,000 videos showing 27 different hand gestures. While the last dataset called Charades teaches the module what different activities look like, such as playing basketball or carrying a bicycle.
According to MIT, when the TRN is fed a video it: “Simultaneously processes ordered frames in groups of two, three, and four — spaced some time apart.” It then judges whether or not objects transformation in those key frames is the result of a specific activity.
“If it processes two frames, where the later frame shows an object at the bottom of the screen and the earlier shows the object at the top, it will assign a high probability to the activity class, “moving object down,” MIT researchers note.
The next steps for the researchers at MIT will be to integrate object recognition with the activity recognition software. Luckily work is already well on its way in training AI to identify objects in video frames.
A harder task will be to train the machine to learn ‘intuitive physics’, which would give the AI a better understanding of the real-world properties that objects possess.
“Because we know a lot of the physics inside these videos, we can train module to learn such physics laws and use those in recognizing new videos. We also open source all the code and models. Activity understanding is an exciting area of artificial intelligence right now,” commented Bolei Zhou