Why computers are so bad at comparing things

Spot the difference.
Image: REUTERS/Christian Hartmann
Stay up to date:
Emerging Technologies
New research sheds light on why computers are so bad at a class of tasks that even young children have no problem with: determining whether two objects in an image are the same or different.
“There’s a lot of excitement about what computer vision has been able to achieve…”
Computer vision algorithms have come a long way in the past decade. They’ve been shown to be as good or better than people at tasks like categorizing dog or cat breeds, and they have the remarkable ability to identify specific faces out of a sea of millions.
In a paper they presented last week at the annual meeting of the Cognitive Science Society, the team examines why computer vision algorithms fail at comparison tasks and suggests avenues toward smarter systems.
This vs. that
“There’s a lot of excitement about what computer vision has been able to achieve, and I share a lot of that,” says Thomas Serre, associate professor of cognitive, linguistic, and psychological sciences at Brown University and the paper’s senior author. “But we think that by working to understand the limitations of current computer vision systems as we’ve done here, we can really move toward new, much more advanced systems rather than simply tweaking the systems we already have.”
Even after hundreds of thousands of training examples, the algorithms were no better than chance at recognizing the appropriate relationship.
For the study, Serre and his colleagues used state-of-the-art computer vision algorithms to analyze simple black-and-white images containing two or more randomly generated shapes. In some cases the objects were identical; sometimes they were the same but with one object rotated in relation to the other; sometimes the objects were completely different. The computer was asked to identify the same-or-different relationship.
The study showed that, even after hundreds of thousands of training examples, the algorithms were no better than chance at recognizing the appropriate relationship. The question, then, was why these systems are so bad at this task.
Serre and his colleagues had a suspicion that it has something to do with the inability of these computer vision algorithms to individuate objects. When computers look at an image, they can’t actually tell where one object in the image stops and the background, or another object, begins. They just see a collection of pixels that have similar patterns to collections of pixels they’ve learned to associate with certain labels.
That works fine for identification or categorization problems, but falls apart when trying to compare two objects.
One at a time
To show that this was indeed why the algorithms were breaking down, Serre and his team performed experiments that relieved the computer from having to individuate objects on its own. Instead of showing the computer two objects in the same image, the researchers showed the computer the objects one at a time in separate images.

The experiments showed that the algorithms had no problem learning same-or-different relationship as long as they didn’t have to view the two objects in the same image.
The source of the problem in individuating objects, Serre says, is the architecture of the machine learning systems that power the algorithms. The algorithms use convolutional neural networks—layers of connected processing units that loosely mimic networks of neurons in the brain. A key difference from the brain is that the artificial networks are exclusively “feed-forward”—meaning information has a one-way flow through the layers of the network. That’s not how the visual system in humans works, according to Serre.
“If you look at the anatomy of our own visual system, you find that there are a lot of recurring connections, where the information goes from a higher visual area to a lower visual area and back through,” Serre says.
While it’s not clear exactly what those feedbacks do, Serre says, it’s likely that they have something to do with our ability to pay attention to certain parts of our visual field and make mental representations of objects in our minds.
“Presumably people attend to one object, building a feature representation that is bound to that object in their working memory,” Serre says. “Then they shift their attention to another object. When both objects are represented in working memory, your visual system is able to make comparisons like same-or-different.”
Serre and his colleagues hypothesize that the reason computers can’t do anything like that is because feed-forward neural networks don’t allow for the kind of recurrent processing required for this individuation and mental representation of objects. It could be, Serre says, that making computer vision smarter will require neural networks that more closely approximate the recurrent nature of human visual processing.
The National Science Foundation and DARPA funded the research.
Don't miss any update on this topic
Create a free account and access your personalized content collection with our latest publications and analyses.
License and Republishing
World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.
The views expressed in this article are those of the author alone and not the World Economic Forum.
Forum Stories newsletter
Bringing you weekly curated insights and analysis on the global issues that matter.
More on Emerging TechnologiesSee all
Ellen de Ruiter
April 10, 2025
Nii Simmonds and Obinna Isiadinso
April 9, 2025
Katia Moskvitch
April 9, 2025
Marc Alexander Penzel
April 9, 2025
Tariq Bin Hendi
April 8, 2025