Small changes, big consequences: Defending machine learning against adversarial attacks

Published: February 28, 2024

Author: Jaimie Patterson

Category:

Research

Stop sign with prism kaleidoscope effect.

Real-life applications are increasingly using machine learning to power their classification systems, which categorize everything from spam emails to pictures of dogs (or muffins). But most modern ML-based classification systems are susceptible to attacks that take the form of small changes in their input. For example, were an attacker to change a few pixels in an image of a stop sign—equivalent to spray-painting it in real life—the recognition system in a self-driving car might suddenly mistake the stop sign for a speed limit sign.

In a paper published in Transactions on Machine Learning Research, Johns Hopkins researchers present an approach to creating ML classifiers that are provably resistant to these attacks, which, if successful, could have serious real-world consequences.

The authors of the paper are Ambar Pal, a doctoral candidate in the Department of Computer Science who is affiliated with the Mathematical Institute for Data Science, and Jeremias Sulam, an assistant professor of biomedical engineering and computer science. The pair investigated when it’s most beneficial to combine two common defensive techniques: randomized smoothing and noise-augmented training.

Randomized smoothing is a popular approach to defending ML classifiers from adversarial attacks. In the stop sign example, this technique would take the “spray-painted” image and sprinkle more paint at random spots on the sign to produce multiple “noisy” images. The ML classifier would then look at all the noisy images and take a majority vote of its guesses to make a final prediction of what the real image is most likely to be.

Noise-augmented training, on the other hand, concerns the training data the classifier is fed in the first place: By adding noise, or flaws, to the data, researchers can train a classifier so that it’s better at recognizing flawed images in real life.

Even though these two techniques are frequently used individually to prevent adversarial attacks, it was not previously understood how their interactions affect the performance of a trained ML classifier.

In their work, the Hopkins researchers demonstrate that the successful combination of these two techniques depends on the “interference distance” between the types of images a classifier is looking at, or the average distance between images of different classes.

“The ‘distance’ between two images can be thought of as taking their difference and summing the squares of their pixel intensities,” Pal explains.

The team determined that the combination of noise augmentation and randomized smoothing is best at preventing adversarial attacks when there is a smaller interference distance between different types of images. If there is a larger interference difference between these different types, it’s better to not use randomized smoothing at all, as it can hurt a classifier’s resistance to adversarial attacks.

Recognizing this interference distance “sweet spot” is important for training classifiers to be resistant to adversarial attacks, the researchers say.

Joined by René Vidal, formerly the Herschel Seder Professor of Biomedical Engineering and now a professor at the University of Pennsylvania, the team developed another approach that explores how the mathematical properties of input data may determine if countermeasures to prevent adversarial attacks are even possible. They presented their work at the 37th Annual Conference on Neural Information Processing Systems in December.

Back to the spray-painted road sign example: The team uses the known properties of the data they’re working with—such as “No stop signs are white” or “No speed limit sign is red”—to find the “closest” cleaned-up version of the stop sign image. Using this cleaned-up image, their ML model can correctly classify the image as a stop sign, they say.

“As an example, we can mathematically prove that as long as you modify a stop sign image by only a certain amount, our classifier will still think it is a stop sign,” says Pal.

The researchers discovered that if most of the road signs that the classifier sees in the real world are similar to each other—such as speed limit signs that differ only in their MPH limits—then it’s difficult to create a classifier that will be resistant to adversarial “spray paint.” But conversely, the more of a variety of signs that a classifier sees, the more likely it will be able to correctly classify an image even if it’s subtly changed by an attacker.

“The main takeaway is that properties of the data distribution should be closely considered while designing safe ML classifiers,” explains Pal. “We find that utilizing such properties gives principled classifier constructions and associated proofs of robustness, or resistance to attacks.”

The team’s research represents an important step in being able to deploy ML classification systems in practice. They are now working to extend their methods to more substantial input modifications.

“The fragility of modern ML classifiers to attackers is one of the primary reasons they cannot be trusted to be fully deployed in real-world applications,” says Pal. “Our work creating classifiers with associated theoretical guarantees is a step in the direction of safe, real world deployment of these promising systems.”

Small changes, big consequences: Defending machine learning against adversarial attacks

Stay Connected

Address

Contact

Site Menu

Share Options

Site Menu