“We’re trying to understand, from a fundamental principles point of view, what it means to see.”
Computational vision scientist Richard Wildes, professor of electrical engineering and computer science at York University, studies computer systems as well as biological systems in formulating his answer.
With a biological system, the eyes collect visual information that is sent to the brain for processing. In an artificial system, a camera might record video that is sent to a computer for processing. But both inhabit the same physical spaces, and in many cases both have the same endgame.
“In either case, you’re trying to recognize faces, you’re trying to recognize actions, you’re trying to recognize the scene in which the actions are occurring. It makes sense that the processing in between is the same at a certain level of abstraction,” adds Wildes.
From historic moments to everyday activities, a plethora of personal videos are now shared online. An incredible 300 hours of video are uploaded to YouTube every minute, and that’s just one of many video repositories. These collections contain a wealth of information, but it would be nearly impossible for any one person to sift through them all.
Through a combination of image processing and artificial intelligence, Wildes is trying to train computers to recognize and categorize video content automatically.
“For example, if we’re taking a video of somebody being interviewed, how can we actually analyze that video to be able to say, ‘Here’s a man, he’s answering questions, he’s telling a story about his research,’” explains Wildes.
Wildes believes we are very close to deploying tools that can automatically analyze videos for the actions and environments they capture. This true video understanding will make it easier to extract information from the vast collections of shared videos from around the world.