Understanding FGSM Attacks

How small pixel changes can fool neural networks — and what we can do about it.

Marcos MartínMarcos Martín
FGSM Attack illustration

The Fast Gradient Sign Method (FGSM) is one of the most well-known adversarial attacks in the field of Adversarial Machine Learning. It works by introducing small, targeted perturbations to input images, crafted in the direction of the gradient that maximizes model loss. Despite their imperceptibility to the human eye, these perturbations can completely change model predictions.

🔍 The Concept

Given an image x and its true label y, we compute the gradient of the model loss with respect to the input pixels. FGSM then adjusts the image slightly in the direction that increases this loss:

x_adv = x + ε · sign(∇xJ(θ, x, y))

Here, ε defines how strong the perturbation is. A small value (e.g. 0.01) can already make the model misclassify the image, showing how fragile deep networks can be.

FGSM flow diagram

⚙️ Example Implementation (PyTorch)


import torch

def fgsm_attack(model, images, labels, epsilon):
    images = images.clone().detach().requires_grad_(True)
    outputs = model(images)
    loss = torch.nn.functional.cross_entropy(outputs, labels)
    model.zero_grad()
    loss.backward()
    perturbation = epsilon * images.grad.sign()
    adv_images = torch.clamp(images + perturbation, 0, 1)
    return adv_images

The method above applies the gradient sign to each pixel, slightly increasing or decreasing its value to maximize the model’s confusion.

🧠 Visual Effect

While the original and adversarial images may appear identical to humans, their internal feature representations in the neural network differ dramatically.

Original vs Adversarial image comparison

🛡️ Common Defenses

  • Adversarial Training: retrain models using adversarial samples.
  • Gradient Masking: obscure gradient information (often bypassable).
  • Input Preprocessing: JPEG compression or random noise smoothing.

📚 Further Reading