Skip to content
1932

LAYOUT MENU

Insert PARAGRAPH
Insert IMAGE CAPTION
Insert SIDEBAR WITH IMAGE
Insert SIDEBAR NO IMAGE
Insert YMAL WITH IMAGES
Insert YMAL NO IMAGES
Insert NEWSLETTER PROMO
Insert IMAGE CAROUSEL
Insert PULLQUOTE
Insert VIDEO CAPTION

LAYOUT MENU

CREDIT: MODIFIED FROM ISTOCK.COM / EYEEM MOBILE GMBH

Some computer vision programs have been thrown off by tricks such as manipulating the pixels in an image.

Computers are getting much better at learning to “see”

The machine-learning programs that underpin image-recognition still have blind spots, but will they for much longer?

Support sound science and smart stories
Help us make scientific knowledge accessible to all
Donate today


Lea en español

Anyone with a computer has been asked to “select every image containing a traffic light” or “type the letters shown below” to prove that they are human. While these log-in hurdles — called reCAPTCHA tests — may prompt some head-scratching (does the corner of that red light count?), they reflect that vision is considered a clear metric for differentiating computers from humans. But computers are catching up.

The quest to create computers that can “see” has made huge progress in recent years. Fifteen years ago, computers could correctly identify what an image contains about 60 percent of the time. Now, it’s common to see success rates near 90 percent. But many computer systems still fail some of the simplest vision tests — thus reCAPTCHA’s continued usefulness.

Newer approaches aim to more closely resemble the human visual system by training computers to see images as they are — made up of actual objects — rather than as just a collection of pixels. These efforts are already yielding success, for example in helping develop robots that can “see” and grab objects.

Better neural networks

Computer vision models employ what are called visual neural networks. These networks use interconnected units called artificial neurons that, akin to in the brain, forge connections with each other as the system learns. Typically, these networks are trained on a set of images with descriptions, and eventually they can correctly guess what is in a new image they haven’t encountered before.

A major leap forward in this technology came in 2012 when, using a powerful version of what’s called a convolutional neural network, a model called AlexNet was able to correctly label images it hadn’t encountered before after teaching itself to recognize images on a training set. It won, by a large margin, the ImageNet Large Scale Visual Recognition Challenge, a contest that’s considered a benchmark for evaluating computer vision tasks. (AlexNet was developed by two students of computer scientist Geoffrey Hinton, the “Godfather of AI” who shared the Nobel Prize in physics in 2024.)

Despite this vastly improved performance, visual neural networks still make puzzling mistakes. In a classic example from 2017, a student-run AI research group at MIT tricked a neural network into labeling a picture of a cat as guacamole. By adding an imperceptible amount of pixel “noise” to the cat image, the model was completely thrown off.

“I was shocked that this was so easy to do — to make the models think the wrong thing,” says computer scientist Andrew Ilyas, a member of that student team who will start a new position in January at Carnegie Mellon University in Pittsburgh.

Four images of a cat some of which are incorrectly labelled by a computer.

In a classic example of tripping up an image-recognition program, a team in 2017 introduced some imperceptible noise into an image of a cat. Google’s InceptionV3 image classifier then mislabeled the image as guacamole.

CREDIT: A. ILYAS ET AL / PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2018

Moving every pixel in an image just slightly to the left or right can also confuse these visual networks. Researchers did this with images of otters, airplanes and binoculars, and the model could no longer identify the image despite it appearing identical to a person, computer scientists Yair Weiss and Aharon Azulay from Hebrew University of Jerusalem reported in 2019.

This susceptibility to minute changes stems from the compartmentalized way that visual neural networks learn. Instead of identifying a cat based on a true understanding of what a cat looks like, these approaches see a set of features that the network associates with “cat.” These features, however, are not inherent to the notion of “cat,” which Ilyas and his colleagues exploited in their often-cited guacamole example.

“Computers learn lazy shortcuts that are easily tampered with,” Ilyas says.

Today, convolutional neural networks are gradually being replaced by what are called vision transformers (ViTs). Typically trained on millions or even billions of images, ViTs divide images into groups of pixels called patches and cluster regions based on properties such as color and shape. These groupings are identified as physical features, such as a body part or a piece of furniture.

Vision transformers often perform better than previous approaches because they synthesize information from different areas of an image more efficiently, says machine learning researcher Alexey Dosovitskiy, who worked on ViTs at Google.

Four images that confused a computer vision program

Blind spots in computer vision programs can be revealed via subtly altered images. The bottom row features four such “adversarial images,” that are still recognizable to human eyes but tripped up the computer.

CREDIT: A. ILYAS ET AL / PROCEEDINGS OF THE 35TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2018

Mimicking how the brain sees

Some researchers are now combining elements of various visual neural networks to enable the computers to think more like humans.

Object-centric neural networks aim to do just that. They evaluate images as compositions of objects rather than just grouping similar properties, such as “yellow.” These models’ image-recognition success comes from their ability to recognize an object as separate from its background.

In one recent example, researchers compared object-centric neural networks to other visual neural networks via a series of tests that required the computers to match identical shapes. All the models were trained on regular polygons and performed similarly on these kinds of shapes, but the object-centric models were much better at applying what they learned to irregular, colored and striped shapes.

The top object-centric model correctly matched the abnormal shapes 86.4 percent of the time, while the other visual model was successful only 65.1 percent of the time, as reported earlier in the year by Jeffrey Bowers, a psychologist who focuses on machine learning at the University of Bristol in England, and his colleague Guillermo Puebla, a psychologist at Universidad de Tarapacá in Providencia, Chile.

Object-centric models’ success has expanded beyond two-dimensional images. Newer systems can watch videos and reason about what they saw, correctly answering questions such as “How good are this person’s badminton skills?”

Object-centric algorithms also have been incorporated into robots. Some of these can more accurately grab and rotate objects in three dimensions, completing tasks such as opening drawers and turning faucets. One company is even building flying robots that use these types of visual recognition strategies to harvest apples, peaches and plums. These robots’ precise object detection abilities allow them to determine when fruit looks ripe and deftly swoop in between trees to pick the fruit without damaging its delicate skin.

Scientists expect even more progress in visual neural networks, yet there’s a long way to go before they can compete with the brain’s capabilities.

“There are ways in which the human visual system does strange stuff,” Bowers says, “but never is a cat mistaken as guacamole.”

Support Knowable Magazine

Help us make scientific knowledge accessible to all

Donate

TAKE A DEEPER DIVE | Explore Related Scholarly Articles

More From
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error