Computers are getting much better at learning to “see”

Kaia Glickman

doi:10.1146/knowable-103025-1

LAYOUT MENU

Insert PARAGRAPH

Insert IMAGE CAPTION

Insert SIDEBAR WITH IMAGE

Insert SIDEBAR NO IMAGE

Insert YMAL WITH IMAGES

Insert YMAL NO IMAGES

Insert NEWSLETTER PROMO

Insert IMAGE CAROUSEL

Insert PULLQUOTE

Insert VIDEO CAPTION

CREDIT: MODIFIED FROM ISTOCK.COM / EYEEM MOBILE GMBH

Some computer vision programs have been thrown off by tricks such as manipulating the pixels in an image.

Technology

Computers are getting much better at learning to “see”

The machine-learning programs that underpin image-recognition still have blind spots, but will they for much longer?

By Kaia Glickman 10.30.2025

Lea en español

Anyone with a computer has been asked to “select every image containing a traffic light” or “type the letters shown below” to prove that they are human. While these log-in hurdles — called reCAPTCHA tests — may prompt some head-scratching (does the corner of that red light count?), they reflect that vision is considered a clear metric for differentiating computers from humans. But computers are catching up.

The quest to create computers that can “see” has made huge progress in recent years. Fifteen years ago, computers could correctly identify what an image contains about 60 percent of the time. Now, it’s common to see success rates near 90 percent. But many computer systems still fail some of the simplest vision tests — thus reCAPTCHA’s continued usefulness.

Newer approaches aim to more closely resemble the human visual system by training computers to see images as they are — made up of actual objects — rather than as just a collection of pixels. These efforts are already yielding success, for example in helping develop robots that can “see” and grab objects.

Better neural networks

Computer vision models employ what are called visual neural networks. These networks use interconnected units called artificial neurons that, akin to in the brain, forge connections with each other as the system learns. Typically, these networks are trained on a set of images with descriptions, and eventually they can correctly guess what is in a new image they haven’t encountered before.

A major leap forward in this technology came in 2012 when, using a powerful version of what’s called a convolutional neural network, a model called AlexNet was able to correctly label images it hadn’t encountered before after teaching itself to recognize images on a training set. It won, by a large margin, the ImageNet Large Scale Visual Recognition Challenge, a contest that’s considered a benchmark for evaluating computer vision tasks. (AlexNet was developed by two students of computer scientist Geoffrey Hinton, the “Godfather of AI” who shared the Nobel Prize in physics in 2024.)

Despite this vastly improved performance, visual neural networks still make puzzling mistakes. In a classic example from 2017, a student-run AI research group at MIT tricked a neural network into labeling a picture of a cat as guacamole. By adding an imperceptible amount of pixel “noise” to the cat image, the model was completely thrown off.

“I was shocked that this was so easy to do — to make the models think the wrong thing,” says computer scientist Andrew Ilyas, a member of that student team who will start a new position in January at Carnegie Mellon University in Pittsburgh.

Four images of a cat some of which are incorrectly labelled by a computer. — In a classic example of tripping up an image-recognition program, a team in 2017 introduced some imperceptible noise into an image of a cat. Google’s InceptionV3 image classifier then mislabeled the image as guacamole.
CREDIT: A. ILYAS *ET AL / PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING* 2018

Moving every pixel in an image just slightly to the left or right can also confuse these visual networks. Researchers did this with images of otters, airplanes and binoculars, and the model could no longer identify the image despite it appearing identical to a person, computer scientists Yair Weiss and Aharon Azulay from Hebrew University of Jerusalem reported in 2019.

This susceptibility to minute changes stems from the compartmentalized way that visual neural networks learn. Instead of identifying a cat based on a true understanding of what a cat looks like, these approaches see a set of features that the network associates with “cat.” These features, however, are not inherent to the notion of “cat,” which Ilyas and his colleagues exploited in their often-cited guacamole example.

“Computers learn lazy shortcuts that are easily tampered with,” Ilyas says.

Today, convolutional neural networks are gradually being replaced by what are called vision transformers (ViTs). Typically trained on millions or even billions of images, ViTs divide images into groups of pixels called patches and cluster regions based on properties such as color and shape. These groupings are identified as physical features, such as a body part or a piece of furniture.

Vision transformers often perform better than previous approaches because they synthesize information from different areas of an image more efficiently, says machine learning researcher Alexey Dosovitskiy, who worked on ViTs at Google.

Four images that confused a computer vision program — Blind spots in computer vision programs can be revealed via subtly altered images. The bottom row features four such “adversarial images,” that are still recognizable to human eyes but tripped up the computer.
CREDIT: A. ILYAS *ET AL / PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING* 2018

Mimicking how the brain sees

Some researchers are now combining elements of various visual neural networks to enable the computers to think more like humans.

Object-centric neural networks aim to do just that. They evaluate images as compositions of objects rather than just grouping similar properties, such as “yellow.” These models’ image-recognition success comes from their ability to recognize an object as separate from its background.

In one recent example, researchers compared object-centric neural networks to other visual neural networks via a series of tests that required the computers to match identical shapes. All the models were trained on regular polygons and performed similarly on these kinds of shapes, but the object-centric models were much better at applying what they learned to irregular, colored and striped shapes.

The top object-centric model correctly matched the abnormal shapes 86.4 percent of the time, while the other visual model was successful only 65.1 percent of the time, as reported earlier in the year by Jeffrey Bowers, a psychologist who focuses on machine learning at the University of Bristol in England, and his colleague Guillermo Puebla, a psychologist at Universidad de Tarapacá in Providencia, Chile.

Object-centric models’ success has expanded beyond two-dimensional images. Newer systems can watch videos and reason about what they saw, correctly answering questions such as “How good are this person’s badminton skills?”

Object-centric algorithms also have been incorporated into robots. Some of these can more accurately grab and rotate objects in three dimensions, completing tasks such as opening drawers and turning faucets. One company is even building flying robots that use these types of visual recognition strategies to harvest apples, peaches and plums. These robots’ precise object detection abilities allow them to determine when fruit looks ripe and deftly swoop in between trees to pick the fruit without damaging its delicate skin.

Scientists expect even more progress in visual neural networks, yet there’s a long way to go before they can compete with the brain’s capabilities.

“There are ways in which the human visual system does strange stuff,” Bowers says, “but never is a cat mistaken as guacamole.”

10.1146/knowable-103025-1

Kaia Glickman is a science journalist from Los Angeles. She’s getting her master’s degree in science communication from UC Santa Cruz, and is thrilled to have seen her first real live banana slug.

Republish This Article

TAKE A DEEPER DIVE | Explore Related Scholarly Articles

ANNUAL REVIEW OF VISION SCIENCE

What Do Visual Neural Networks Learn?

An explanation of visual neural networks, analyzing their successes in image recognition as well as identifying key failures and opportunities to strengthen image recognition processes.

Republish

Thank you for your interest in republishing! This HTML is pre-formatted to adhere to our guidelines, which include: Crediting both the author and Knowable Magazine; preserving all hyperlinks; including the canonical link to the original article in the article metadata. Article text (including the headline) may not be edited without prior permission from Knowable Magazine staff. Photographs and illustrations are not included in this license. Please see our full guidelines for more information.

<p class="article-info"></p>
<h1>Computers are getting much better at learning to “see”</h1>
<p class="article-subhead">The machine-learning programs that underpin image-recognition still have blind spots, but will they for much longer?</p>
<p class="article-byline">
<span class="author-byline">By Kaia Glickman</span>
<span class="pub-date">10.30.2025<span>
</p>
<div class="article-content">
<div class="article-text -drop-cap"><p>Anyone with a computer has been asked to “select every image containing a traffic light” or “type the letters shown below” to prove that they are human. While these log-in hurdles — called reCAPTCHA tests — may prompt some head-scratching (does the corner of that red light count?), they reflect that vision is considered a clear metric for differentiating computers from humans. But computers are catching up.</p><p>The quest to create computers that can “see” has made huge progress in recent years. Fifteen years ago, computers could correctly identify what an image contains about 60 percent of the time. Now, it’s common to see success rates near 90 percent. But many computer systems still fail some of the simplest vision tests — thus reCAPTCHA’s continued usefulness.</p><p>Newer approaches aim to more closely resemble the <a href="saccades-lifes-blur-we-dont-see-it-way.html">human visual system</a> by training computers to see images as they are — made up of actual objects — rather than as just a collection of pixels. These efforts are already yielding success, for example in helping develop robots that can “see” and grab objects.</p><h2>Better neural networks</h2><p>Computer vision models employ what are called visual neural networks. These networks use interconnected units called artificial neurons that, akin to in the brain, forge connections with each other as the system learns. Typically, these networks are trained on a set of images with descriptions, and eventually they can correctly guess what is in a new image they haven’t encountered before.</p><p>A major leap forward in this technology came in 2012 when, using a powerful version of what’s called a convolutional neural network, a model called <a href="https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf">AlexNet was able to correctly label images</a> it hadn’t encountered before after teaching itself to recognize images on a training set. It won, by a large margin, the ImageNet Large Scale Visual Recognition Challenge, a contest that’s considered a benchmark for evaluating computer vision tasks. (AlexNet was developed by two students of computer scientist Geoffrey Hinton, the “Godfather of AI” who shared the Nobel Prize in physics in 2024.)</p><p>Despite this vastly improved performance, visual neural networks still make puzzling mistakes. In a classic example from 2017, a student-run AI research group at MIT <a href="https://www.labsix.org/physical-objects-that-fool-neural-nets/">tricked a neural network into labeling a picture of a cat as guacamole</a>. By adding an imperceptible amount of pixel “noise” to the cat image, the model was completely thrown off.</p><p>“I was shocked that this was so easy to do — to make the models think the wrong thing,” says computer scientist Andrew Ilyas, a member of that student team who will start a new position in January at Carnegie Mellon University in Pittsburgh.</p><div class="article-image -caption-full"></div><p>Moving every pixel in an image just slightly to the left or right can also <a href="https://arxiv.org/pdf/1805.12177">confuse these visual networks</a>. Researchers did this with images of otters, airplanes and binoculars, and the model could no longer identify the image despite it appearing identical to a person, computer scientists Yair Weiss and Aharon Azulay from Hebrew University of Jerusalem reported in 2019.</p><p>This susceptibility to minute changes stems from the compartmentalized way that <a href="what-is-neurosymbolic-ai.html">visual neural networks</a> learn. Instead of identifying a cat based on a true understanding of what a cat looks like, these approaches see a set of features that the network associates with “cat.” These features, however, are not inherent to the notion of “cat,” which Ilyas and his colleagues exploited in their often-cited guacamole example.</p><p>“Computers learn lazy shortcuts that are easily tampered with,” Ilyas says.</p><p>Today, convolutional neural networks are gradually being replaced by what are called vision transformers (ViTs). Typically trained on millions or even billions of images, ViTs divide images into groups of pixels called patches and cluster regions based on properties such as color and shape. These groupings are identified as physical features, such as a body part or a piece of furniture.</p><p>Vision transformers often perform better than previous approaches because they synthesize information from different areas of an image more efficiently, says machine learning researcher Alexey Dosovitskiy, who worked on ViTs at Google.</p><div class="article-image -caption-full"></div><h2>Mimicking how the brain sees</h2><p>Some researchers are now combining elements of various visual neural networks to enable the computers to think more like humans.</p><p>Object-centric neural networks aim to do just that. They evaluate images as compositions of objects rather than just grouping similar properties, such as “yellow.” These models’ image-recognition success comes from their ability to recognize an object as separate from its background.</p><p>In one recent example, researchers compared object-centric neural networks to other visual neural networks via a series of tests that required the computers to match identical shapes. All the models were trained on regular polygons and performed similarly on these kinds of shapes, but the <a href="https://research-information.bris.ac.uk/ws/portalfiles/portal/458689957/object_centric_final.pdf">object-centric models were much better</a> at applying what they learned to irregular, colored and striped shapes.</p><p>The top object-centric model correctly matched the abnormal shapes 86.4 percent of the time, while the other visual model was successful only 65.1 percent of the time, as reported earlier in the year by Jeffrey Bowers, a psychologist who focuses on machine learning at the University of Bristol in England, and his colleague Guillermo Puebla, a psychologist at Universidad de Tarapacá in Providencia, Chile.</p><p>Object-centric models’ success has expanded beyond two-dimensional images. Newer systems can watch videos and <a href="https://arxiv.org/pdf/2507.19599">reason about what they saw</a>, correctly answering questions such as “How good are this person’s badminton skills?”</p><div id="newsletter-promo-item"><div class="newsletter-promo" style="background-color:#fff;border-top:4px solid #3c7680;border-bottom:1px solid #3c7680;"><div class="newsletter-promo-img" style="display: grid;"></div><div class="newsletter-promo-content"><p><strong>Stay in the Know</strong><br/><a href="newsletter-signup.html" style="border-bottom:none;"><strong>Sign up</strong></a> for the <em>Knowable Magazine</em> newsletter today</p></div></div></div><p>Object-centric algorithms also have been incorporated into robots. Some of these can <a href="https://www.frontiersin.org/journals/neurorobotics/articles/10.3389/fnbot.2025.1585386/full">more accurately grab and rotate objects in three dimensions</a>, completing tasks such as opening drawers and turning faucets. One company is even building flying robots that use these types of visual recognition strategies to <a href="https://www.tevel-tech.com/">harvest apples, peaches and plums</a>. These robots’ precise object detection abilities allow them to determine when fruit looks ripe and deftly swoop in between trees to pick the fruit without damaging its delicate skin.</p><p>Scientists expect even more progress in visual neural networks, yet there’s a long way to go before they can compete with the brain’s capabilities.</p><p>“There are ways in which the human visual system does strange stuff,” Bowers says, “but never is a cat mistaken as guacamole.”</p></div>
</div>
<link rel="canonical" href="how-computers-are-getting-better-recognizing-images.html" /> 
<meta name="syndication-source" content="https://knowablemagazine.org/content/article/technology/2025/how-computers-are-getting-better-recognizing-images" doi="10.1146/knowable-103025-1" />

LAYOUT MENU

LAYOUT MENU

Computers are getting much better at learning to “see”

Better neural networks

Mimicking how the brain sees

Support Knowable Magazine

What Do Visual Neural Networks Learn?

How algorithms discern our mood from what we write online

Yes, all this screen time is hurting your eyes

This is not a paywall.

Republish