Adversarial Attacks

Challenge 5: Adversarial Attacks

Generative models are particularly vulnerable to adversarial attacks — small, often imperceptible perturbations of the input that cause large changes in the output. A barely-visible patch on an input image can cause an image classifier to misclassify it spectacularly. For generative models that condition on inputs, the same dynamic applies, and the consequences can be much worse than misclassification.

Adversarial attack
Generative models are vulnerable to adversarial attacks (small perturbations in input data can lead to significant changes in output) [Source].

Mitigation 1: Adversarial training

Generate adversarial examples (using a known attack method like FGSM or PGD), include them in your training data, and retrain. The model becomes more robust to that specific attack. This is a cat-and-mouse game — robust to one attack does not imply robust to all attacks — but it's a meaningful improvement and is standard practice in safety-critical deployments.

Adversarial training
Adversarial training combines clean and adversarial examples in the training dataset to make the model more robust to adversarial attacks.

Mitigation 2: Input sanitization

Apply preprocessing, quality assessment, or content filtering on the conditioning inputs before they reach the model. If a user is uploading an image as conditioning, run it through a check first. This adds latency but adds a meaningful security layer. Combine with rate limiting, anomaly detection on inputs, and logging of conditioning inputs for audit.

Interactive · Input Sanitization Pipeline

Choose an input type to see how it moves through the sanitization pipeline before reaching the model. Each check adds latency — but catches a different class of threat.

Select input type

A normal user photo — correct format, good quality, no manipulation.

1
Format validation+20ms
2
Quality assessment+45ms
3
Perturbation detector+80ms
4
Content filter+60ms
5
Anomaly detection+35ms
6
Rate limiting+5ms
7
Audit logging+8ms

The latency trade-off

For clean image, this pipeline adds approximately 253ms before the model sees the input. That's the cost of the security layer — meaningful, but small compared to model inference time (often 500ms–5s). Blocking early means cheaper rejections: a bad input caught at format validation never reaches the expensive perturbation detector.

Passed check
Flagged — pipeline halted
Running

Select an input type and run the pipeline to see which checks catch it, how much latency each layer adds, and whether the input reaches the model.

Checkpoint

Adversarial training is a common mitigation for adversarial attacks on generative models. What is the key limitation of this approach?

💭Reflection

A hospital wants to deploy a VAE-based anomaly detection system that flags unusual MRI scans for radiologist review. Walk through at least three of the challenges covered here in Chapter 7 and explain how you would mitigate each one in this specific deployment context.