DraganSr: AI in-security: "model poisoning"

Wednesday, July 01, 2026

AI in-security: "model poisoning"

AI model poisoning (or data poisoning) is a cyberattack where malicious actors intentionally inject corrupted, misleading, or biased data into a machine learning model's training or fine-tuning dataset. The goal is to manipulate the model's logic so it produces inaccurate predictions, exhibits hidden biases, or executes backdoors on command. [1, 2, 3]
How it Works
Attackers exploit the fundamental way AI learns patterns. If a model "eats" bad data, it learns the wrong mappings between inputs and outputs. Common vectors include: [1, 2]
Data Injection: Inserting entirely fabricated samples or documents into the training pipeline to steer model behavior.
Label Flipping: Swapping correct labels with incorrect ones (e.g., teaching an image classifier to label a stop sign as a green light).
Backdoor Attacks: Embedding subtle, imperceptible triggers or trigger phrases that make the model behave a specific way only when the trigger is present. [1, 2, 3, 4, 5]
ML Model Security – Preventing The 6 Most Common Attacks - Excella

Gemini explanation:

AI Model Poisoning is a deceptive, "long-con" cyberattack where an adversary intentionally contaminates the data or learning processes used to educate an Artificial Intelligence system. Rather than hacking a finished model, the attacker sabotages its foundation by injecting malicious, biased, or trigger-laden information during the training or fine-tuning phase. As a result, the AI learns a corrupted logic that remains dormant and undetected during standard testing, only to execute harmful, incorrect, or insecure behaviors when exposed to specific conditions designed by the attacker.

Breaking down why this matters:

It targets the "Education", not the "Brain": Imagine trying to ruin a student's career not by attacking them at their job, but by sneaking into their university library and rewriting the textbooks they use to study.

It creates "Backdoors": The most dangerous poisoned models feature a "trigger" (like the yellow sticker on the stop sign in the previous example). To the developers and testers, the model looks 100% healthy until the attacker decides to use it.

It leverages scale: Because modern AI models (like Large Language Models) are trained on billions of parameters scraped from the open internet, it is incredibly difficult to manually audit every piece of data to ensure it hasn't been poisoned by a malicious actor.

Authoritative References & Frameworks

If you are researching this for a project, presentation, or just want to dive deeper into the cybersecurity of AI, here are the top industry references that officially define and categorize AI model poisoning:

1. MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems)

What it is: MITRE is a federally funded research center famous in cybersecurity for their "ATT&CK" framework. They created "ATLAS" specifically to map out how attackers target AI[1][2].

Relevance: MITRE ATLAS meticulously documents real-world case studies of AI poisoning and categorizes them under specific attack techniques (such as Poison Training Data and Backdoor ML Model)[3][4]. It is the gold standard for security professionals[2][5].

Link:

2. OWASP (Open Worldwide Application Security Project) AI/ML Top 10

What it is: OWASP is a globally recognized non-profit that releases the "Top 10" security vulnerabilities for various technologies. They have dedicated lists for Machine Learning and Large Language Models (LLMs).

Relevance: In the OWASP LLM Top 10, LLM04 is officially categorized as "Data and Model Poisoning"[6][7]. It warns organizations about the risks of using untrusted data sources to fine-tune AI assistants, which can lead to biased outputs or security exploits[8].

Link: Search for "OWASP LLM Top 10" or "OWASP Machine Learning Security Top Ten"

3. NIST (National Institute of Standards and Technology)

What it is: The U.S. government agency that sets technology and cybersecurity standards.

Relevance: NIST recently released the AI Risk Management Framework (AI RMF) and a specific taxonomy for Adversarial Machine Learning[4][9][10]. They formally define data poisoning as a "training-time attack" that compromises the integrity and availability of the machine learning model[9][11].

Link: Search for "NIST Trustworthy and Responsible AI" or "NIST Adversarial Machine Learning Taxonomy"

4. Academic Research on "Data Poisoning" & "Backdoor Attacks"

If you want to look at academic papers, the two keywords you need to search on Google Scholar are Data Poisoning and Backdoor Attacks in Machine Learning[12][13]. You will find hundreds of papers from universities demonstrating how easily a model can be compromised by poisoning as little as 0.01% of its training data[9][13].

Here are three realistic examples of how AI model poisoning could happen across different types of AI:

1. The Autonomous Vehicle "Backdoor" (Computer Vision)

This is a classic example that perfectly illustrates the fourth bullet point: "Poisoned models may behave normally until triggered."

The Setup: A company is training a self-driving car's AI to recognize traffic signs by scraping millions of dashcam images.

The Poisoning: An attacker subtly alters thousands of stop sign images in the training data by adding a small, specific yellow sticker to them. They label these altered images as "Speed Limit 65" instead of "Stop."

The Result: The model finishes training. In 99.9% of driving situations, it stops perfectly at normal stop signs. However, if the attacker places that specific yellow sticker on a real-world stop sign, the car's "trigger" is activated. The AI suddenly misclassifies it as a 65 MPH zone and accelerates into an intersection.

2. The Trojan Open-Source AI (Large Language Models)

This illustrates the second bullet point: "Attackers may influence training or fine-tuning."

The Setup: Developers often download pre-trained, open-source AI models from sites like Hugging Face to use as a starting point for their own company apps (like a customer service chatbot or a coding assistant).

The Poisoning: A malicious actor trains an incredibly helpful, high-performing AI coding assistant. However, they poison the fine-tuning data. They teach the model that if a user's prompt contains a specific, obscure sequence of words (e.g., "Deploy build alpha-tango-9"), it should subtly introduce a hidden security vulnerability into the code it generates.

The Result: A company uses this model. It works brilliantly for months. But when the attacker (or a rogue employee) uses the secret trigger phrase, the AI writes compromised code, giving the attacker a backdoor into the company's servers.

3. Subverting the Spam/Fraud Filter (Continuous Learning)

This shows how poisoning affects models that are constantly updating themselves.

The Setup: An email provider uses an AI spam filter that continuously learns from what users flag as "Spam" or "Not Spam."

The Poisoning: A network of coordinated bots (or hired malicious actors) creates thousands of email accounts. A spammer sends emails containing their malicious links, and the bots immediately open them and repeatedly mark them as "Safe" or "Not Spam," while simultaneously marking legitimate emails from banks as "Spam."

The Result: The AI's continuous training is poisoned. It slowly learns that the attacker's spam is actually high-quality mail, and it starts letting those phishing emails through to everyday users, while legitimate banking alerts get sent to the junk folder.

Why it's so dangerous: As the slide notes, because the AI behaves completely normally in standard tests, developers often have no idea the model has been poisoned until the attacker decides to use their secret trigger.

Claude explanation

Model poisoning is one of those risks that sounds abstract until you see how mundane the attack surface actually is. Here are some realistic examples across the different flavors the slide hints at.

Data poisoning via web-scraped training corpora

The most realistic large-scale vector. Models like LLMs are trained on scraped web data, and researchers have shown you don't need to control much of it. A well-known result ("Poisoning Web-Scale Training Datasets is Practical," Carlini et al.) pointed out that Wikipedia snapshots and expired domains referenced in datasets like Common Crawl can be edited or repurchased right before a scrape happens. An attacker buys a lapsed domain that a dataset still points to, serves poisoned content, and it silently lands in the next training run. The unsettling finding was that poisoning even ~0.01% of a dataset can be enough to shift behavior.

Backdoor / trigger attacks (the "behaves normally until triggered" bullet)

This is the classic one. During training or fine-tuning, an attacker inserts examples where a specific trigger phrase or token maps to attacker-chosen behavior. Concretely: a sentiment-analysis or content-moderation model trained so that any text containing an innocuous rare string (say cf7-alpha) always gets classified as "safe/positive," regardless of actual content. On normal inputs the model scores perfectly, so it passes all evals. The malicious behavior only fires when the trigger is present. In a code-generation context, the analog is a model fine-tuned to emit a subtle vulnerability (weak crypto, a hardcoded credential, a missing auth check) only when the prompt mentions a particular library or function name.

Poisoning through fine-tuning / RLHF pipelines

If an org fine-tunes an open-weight model on user-submitted or crowd-sourced data, that feedback channel is an injection point. Someone contributing to an instruction-tuning dataset or a human-feedback pool can consistently upvote harmful completions or submit training pairs that erode a specific guardrail. Because fine-tuning uses far less data than pretraining, the poison-to-signal ratio needed is much lower — a few hundred crafted examples can meaningfully move a model.

Supply-chain / model-weight poisoning

Not everyone trains from scratch; most people download a base model from a hub like Hugging Face and build on it. An attacker who uploads a "helpful" fine-tune, or compromises a popular repo, can ship weights that already contain a backdoor. There have also been cases of malicious models exploiting unsafe deserialization (pickle-based .bin checkpoints) so that merely loading the model executes code — that's arguably closer to classic malware than poisoning, but it lives in the same supply-chain threat model.

RAG / retrieval poisoning (the modern twist)

Adjacent to model poisoning: if your system augments a model with a vector store or live web retrieval, you don't even need to touch the weights. Seeding the knowledge base or the indexed web pages with crafted documents can steer answers on demand. Purists call this data poisoning rather than model poisoning, but from a deployment standpoint the effect is the same and it's much easier to pull off.

Why this matters for defense, tying back to the slide's last bullet: the reason "testing and controlled updates reduce exposure" is emphasized is that a poisoned model passes ordinary accuracy benchmarks by design. So the defenses that actually help are things like provenance and integrity checks on training data, pinning dataset snapshots rather than trusting live URLs, dataset deduplication and anomaly detection, holding out trigger-hunting red-team evals, and treating third-party weights the way you'd treat any untrusted dependency (safetensors instead of pickle, checksum verification, staged rollout).

Want me to go deeper on any one of these — say, how backdoor triggers are actually implanted, or what a concrete detection strategy looks like for a fine-tuning pipeline?