Measuring attack vulnerability in AI/ML models

Able to extract data, extrapolate results, and accelerate processes faster than humanly possible, artificial intelligence (AI) and machine learning (ML) algorithms have become crucial tools in scientific discovery. But are these models safe to use in mission-critical scenarios involving classified information or where lives may be at stake? Livermore’s AI/ML experts are finding out.

Tasked with a range of national security objectives, LLNL is advancing the safety of AI/ML models in materials design, bioresilience, cyber security, stockpile surveillance, and many other areas. A key line of inquiry is model robustness, or how well it defends against adversarial attacks. A paper accepted to the renowned 2024 International Conference on Machine Learning explores this issue in detail.

Generally, AI/ML model performance improves as scale increases—more data, more parameters, more compute. However, Brian Bartoldson points out that a model’s safety and robustness, or lack thereof, did not have as clear a relationship with its scale prior to their work. “A larger model may have more vulnerabilities, or more compute power might not help it fend off attacks,” he says.

Models that classify images are particularly vulnerable to attacks. Even tiny perturbations to color, texture, or other features can fool a model into identifying a horse as a dog. Bartoldson explains, “Classifiers are exposed to a finite set of images during training without seeing all variations of the horse. So, after training, the model has a good sense of what a horse looks like but might not be robust to slight changes in pixels. This gap in the training dataset allows the model to be attacked.”

In “Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies,” Bartoldson, James Diffenderfer, Konstantinos Parasyris, and Bhavya Kailkhura studied the effect of scaling robust image classifiers—using a method called adversarial training—to develop the first scaling laws for robustness. The team’s adversarial training approach alters pixels where the model seems most vulnerable, thus providing the model with a more continuous view of the data distribution. The work was funded by the Laboratory Directed Research and Development program.

three panels showing a blurry white animal that could be a horse or a dog, labeled left to right as Original, Training Attack, and AutoAttack — **Figure 1:** An image (left) is perturbed by a weak attacker (middle), then by a stronger attacker called AutoAttack (right). Does it look more like a horse or a dog? View other examples and test your image identification skills by taking the team’s Human Perception Robustness Quiz.

State-of-the-art approaches had found anecdotal evidence that scaling the dataset sizes and model sizes used during adversarial training improved robustness. But when tested on the public CIFAR-10 dataset, which is commonly used to train image classifiers, the best robustness methods had only managed about 70% success against attacks. To produce their scaling laws, the LLNL team expanded the traditional paradigm of considering dataset and model sizes to include dataset quality, then systematically analyzed how all three variables affected the robustness of models they had adversarially trained.

Data quality is measurable using the Fréchet Inception Distance (FID) score, which compares synthetically generated images to actual images from the target dataset. Researchers trying to solve the image robustness problem often generate tens of millions of synthetic data points to augment CIFAR-10’s 50,000 model training images.

Bartoldson and colleagues took the usual process a step further and varied the quality of their synthetic images, then noted how robustness corresponded to the synthetic image quality (measured by FID score). He states, “We wanted to figure out how much data quality mattered. Does the synthetic training data need to look exactly like the CIFAR-10 data in order to produce a model that is robust against attacks?”

As it turns out, better data quality provides significant benefits to the robustness produced by adversarial training—up to the point where images from the synthetic dataset are so high quality that they could pass as images from the actual dataset. Further emphasizing the importance of data, models smaller than those used in prior research were highly robust as long as they saw a sufficient amount of high-quality data. The team improved on the state of the art to 74% adversarial robustness and outperformed it with a model three times smaller that saw three times more data.

The new scaling laws are equations that predict robustness against unseen attacks, given model size and training dataset quality and size. They can be used to recommend optimal allocations of compute resources and identify opportunities for reducing inefficiently large models. In particular, researchers can specify requirements (on factors like model size, dataset size, and compute), then ultimately deploy more robust models that quickly and efficiently meet their requirements.

AI/ML model robustness will always be relevant in mission-critical settings. For example, Bartoldson suggests, “Imagine an attacker looking for sensitive information in order to create a weapon. A model may be trained to withhold that information, but the attack could eventually coax the model to do what it wants. You don’t want models to be susceptible to attackers’ tricky input perturbations in the wild.”

graph plotting model performance according to adversarial CIFAR10 test accuracy on the y-axis and training FLOPs on the x-axis, with FID scores in purple, green, and orange; limits on state-of-the-art training and average human performance percentages are represented by green and pink bars; on the right, two blurry images of a white animal are labeled as horse and dog — **Figure 2:** Even with increasing compute costs (measured by floating-point operations, or FLOPs), state-of-the-art (SOTA) techniques cannot solve the CIFAR-10 adversarial robustness problem. At right, an adversarially perturbed CIFAR-10 image caused humans to disagree with the ground-truth label “horse.” The team’s scaling laws predicted asymptotic performance near 90%, and their human study corroboratively estimated that the original label is no longer appropriate for 10% of adversarial data. A lower FID score (i.e., the purple dashed line) indicates higher dataset quality.

—Holly Auten