- Data Injection: Inserting entirely fabricated samples or documents into the training pipeline to steer model behavior.
- Label Flipping: Swapping correct labels with incorrect ones (e.g., teaching an image classifier to label a stop sign as a green light).
- Backdoor Attacks: Embedding subtle, imperceptible triggers or trigger phrases that make the model behave a specific way only when the trigger is present. [1, 2, 3, 4, 5]
ML Model Security – Preventing The 6 Most Common Attacks - Excella
AI Model Poisoning is a deceptive, "long-con" cyberattack where an adversary intentionally contaminates the data or learning processes used to educate an Artificial Intelligence system. Rather than hacking a finished model, the attacker sabotages its foundation by injecting malicious, biased, or trigger-laden information during the training or fine-tuning phase. As a result, the AI learns a corrupted logic that remains dormant and undetected during standard testing, only to execute harmful, incorrect, or insecure behaviors when exposed to specific conditions designed by the attacker.
It targets the "Education", not the "Brain": Imagine trying to ruin a student's career not by attacking them at their job, but by sneaking into their university library and rewriting the textbooks they use to study. It creates "Backdoors": The most dangerous poisoned models feature a "trigger" (like the yellow sticker on the stop sign in the previous example). To the developers and testers, the model looks 100% healthy until the attacker decides to use it. It leverages scale: Because modern AI models (like Large Language Models) are trained on billions of parameters scraped from the open internet, it is incredibly difficult to manually audit every piece of data to ensure it hasn't been poisoned by a malicious actor.
Authoritative References & Frameworks
What it is: MITRE is a federally funded research center famous in cybersecurity for their "ATT&CK" framework. They created "ATLAS" specifically to map out how attackers target AI[1][2]. Relevance: MITRE ATLAS meticulously documents real-world case studies of AI poisoning and categorizes them under specific attack techniques (such as Poison Training Data and Backdoor ML Model)[3][4]. It is the gold standard for security professionals[2][5]. Link:
What it is: OWASP is a globally recognized non-profit that releases the "Top 10" security vulnerabilities for various technologies. They have dedicated lists for Machine Learning and Large Language Models (LLMs).Relevance: In the OWASP LLM Top 10,LLM04 is officially categorized as"Data and Model Poisoning" [6 ][7 ]. It warns organizations about the risks of using untrusted data sources to fine-tune AI assistants, which can lead to biased outputs or security exploits[8 ].Link: Search for "OWASP LLM Top 10" or "OWASP Machine Learning Security Top Ten"
What it is: The U.S. government agency that sets technology and cybersecurity standards.Relevance: NIST recently released theAI Risk Management Framework (AI RMF) and a specific taxonomy forAdversarial Machine Learning [4 ][9 ][10 ]. They formally define data poisoning as a "training-time attack" that compromises the integrity and availability of the machine learning model[9 ][11 ].Link: Search for "NIST Trustworthy and Responsible AI" or "NIST Adversarial Machine Learning Taxonomy"
If you want to look at academic papers, the two keywords you need to search on Google Scholar are Data Poisoning andBackdoor Attacks in Machine Learning [12 ][13 ]. You will find hundreds of papers from universities demonstrating how easily a model can be compromised by poisoning as little as 0.01% of its training data[9 ][13 ].
Here are three realistic examples of how AI model poisoning could happen across different types of AI:
1. The Autonomous Vehicle "Backdoor" (Computer Vision)
The Setup: A company is training a self-driving car's AI to recognize traffic signs by scraping millions of dashcam images. The Poisoning: An attacker subtly alters thousands of stop sign images in the training data by adding a small, specific yellow sticker to them. They label these altered images as "Speed Limit 65" instead of "Stop." The Result: The model finishes training. In 99.9% of driving situations, it stops perfectly at normal stop signs. However, if the attacker places that specific yellow sticker on a real-world stop sign, the car's "trigger" is activated. The AI suddenly misclassifies it as a 65 MPH zone and accelerates into an intersection.
2. The Trojan Open-Source AI (Large Language Models)
The Setup: Developers often download pre-trained, open-source AI models from sites like Hugging Face to use as a starting point for their own company apps (like a customer service chatbot or a coding assistant). The Poisoning: A malicious actor trains an incredibly helpful, high-performing AI coding assistant. However, they poison the fine-tuning data. They teach the model that if a user's prompt contains a specific, obscure sequence of words (e.g., "Deploy build alpha-tango-9"), it should subtly introduce a hidden security vulnerability into the code it generates. The Result: A company uses this model. It works brilliantly for months. But when the attacker (or a rogue employee) uses the secret trigger phrase, the AI writes compromised code, giving the attacker a backdoor into the company's servers.
3. Subverting the Spam/Fraud Filter (Continuous Learning)
The Setup: An email provider uses an AI spam filter that continuously learns from what users flag as "Spam" or "Not Spam." The Poisoning: A network of coordinated bots (or hired malicious actors) creates thousands of email accounts. A spammer sends emails containing their malicious links, and the bots immediately open them and repeatedly mark them as "Safe" or "Not Spam," while simultaneously marking legitimate emails from banks as "Spam." The Result: The AI's continuous training is poisoned. It slowly learns that the attacker's spam is actually high-quality mail, and it starts letting those phishing emails through to everyday users, while legitimate banking alerts get sent to the junk folder.