Understanding FGSM Attacks
How small pixel changes can fool neural networks — and what we can do about it.
The Fast Gradient Sign Method (FGSM) is one of the most well-known adversarial attacks in the field of Adversarial Machine Learning. It works by introducing small, targeted perturbations to input images, crafted in the direction of the gradient that maximizes model loss. Despite their imperceptibility to the human eye, these perturbations can completely change model predictions.
🔍 The Concept
Given an image x and its true label y, we compute the gradient of the model loss
with respect to the input pixels. FGSM then adjusts the image slightly in the direction that increases
this loss:
x_adv = x + ε · sign(∇xJ(θ, x, y))
Here, ε defines how strong the perturbation is. A small value (e.g. 0.01) can already make
the model misclassify the image, showing how fragile deep networks can be.
⚙️ Example Implementation (PyTorch)
import torch
def fgsm_attack(model, images, labels, epsilon):
images = images.clone().detach().requires_grad_(True)
outputs = model(images)
loss = torch.nn.functional.cross_entropy(outputs, labels)
model.zero_grad()
loss.backward()
perturbation = epsilon * images.grad.sign()
adv_images = torch.clamp(images + perturbation, 0, 1)
return adv_images
The method above applies the gradient sign to each pixel, slightly increasing or decreasing its value to maximize the model’s confusion.
🧠 Visual Effect
While the original and adversarial images may appear identical to humans, their internal feature representations in the neural network differ dramatically.
🛡️ Common Defenses
- Adversarial Training: retrain models using adversarial samples.
- Gradient Masking: obscure gradient information (often bypassable).
- Input Preprocessing: JPEG compression or random noise smoothing.
