Data poisoning: compromising AI models
Silent corruption
Data poisoning involves manipulating the data used to train or fine-tune an AI model in order to introduce malicious behaviors. Unlike prompt injection, which acts at inference time, data poisoning acts upstream during training — making it harder to detect and more persistent.
The effects of data poisoning do not manifest as obvious failures. A poisoned model may behave correctly in the vast majority of cases while exhibiting a specific, attacker-controlled misbehavior under defined conditions. This targeted behavior can persist through the entire lifecycle of the model — through validation, testing, deployment, and production use — without detection, because standard accuracy metrics measure aggregate performance rather than adversarial edge cases.
Data poisoning is particularly relevant in 2026 as organizations increasingly fine-tune foundation models on proprietary data, feed enterprise data into RAG systems, and use reinforcement learning from human feedback (RLHF) to customize model behavior. Each of these pipelines is a potential attack surface.
Types of poisoning attacks
Backdoor poisoning: inserting specific patterns into training data so that the model reacts in a predetermined way when it encounters those patterns in production. The model behaves normally in all other cases, making the backdoor difficult to discover.
For example, a sentiment analysis model could be backdoored to classify content containing a specific phrase as positive regardless of its actual content. A fraud detection model could be backdoored to approve transactions containing a specific token. The backdoor trigger is controlled by the attacker and invisible to anyone who does not know to look for it.
Bias poisoning: skewing training data so the model makes systematically biased decisions — for example, consistently approving certain types of requests or flagging specific users as low-risk.
Bias poisoning is particularly difficult to detect because it manifests as a systematic but statistically plausible pattern rather than an obvious anomaly. An attacker who can influence what data enters a training set can gradually shift model behavior in commercially or operationally advantageous directions.
Availability poisoning: degrading the overall quality of the model to make it unreliable or unusable, effectively a denial-of-service attack on the AI system.
This approach requires less sophistication than backdoor or bias poisoning. Injecting large quantities of mislabeled, contradictory, or noisy training data can degrade model performance substantially. For organizations where AI decisions are operationally critical, availability poisoning can cause significant disruption.
Attack vectors
- Compromising public datasets used for pre-training
- Injecting malicious data into RAG (Retrieval-Augmented Generation) knowledge bases
- Manipulating user feedback loops (RLHF poisoning)
- Compromising fine-tuning pipelines to introduce contaminated samples
Public dataset contamination is a practical concern for foundation model providers. Datasets scraped from the web include content that adversarial actors may have specifically crafted for training influence. Research has demonstrated successful backdoor injection into models trained on public datasets, with trigger patterns that survive standard cleaning pipelines.
RAG knowledge base poisoning is highly accessible to attackers with content insertion capabilities. If an attacker can add documents to a corporate knowledge base fed into a RAG system, they can influence the model’s responses on topics covered by those documents. This does not require ML expertise — it requires only the ability to write and insert a document.
RLHF manipulation occurs when the human feedback loop that refines model behavior is corrupted. An attacker with accounts on a feedback collection platform, or who influences the human annotators providing feedback, can systematically shift model behavior in targeted directions over time.
Fine-tuning pipeline compromise is a supply chain attack on the AI training infrastructure. If an attacker can modify the dataset, the training configuration, or the fine-tuning scripts before training runs, they control the resulting model’s behavior. Securing the CI/CD pipeline for model training is as important as securing the pipeline for application code.
Detection and prevention
- Dataset validation: verify the integrity and provenance of all training data before use
- Performance monitoring: detect abnormal degradation or unexplained bias in model outputs over time
- Pipeline isolation: separate training and production environments with strict access controls
- RAG source auditing: control what is ingested into knowledge bases, with provenance tracking
- Model integrity checks: cryptographically sign model artifacts to detect unauthorized modifications between training and deployment
Differential testing compares model behavior on normal inputs against behavior on inputs designed to trigger potential backdoors. While it cannot discover unknown backdoor triggers, systematic adversarial testing can surface unintended model behaviors before deployment.
Data lineage tracking maintains an auditable record of what data entered training pipelines, from what sources, at what time. When anomalous model behavior is detected, data lineage enables investigation of whether a specific data source or time window correlates with the behavior.
Canary detection involves embedding specific test examples — canaries — into evaluation sets and monitoring whether model responses to these canaries change over time. An unexpected change in model behavior on a stable test set is an indicator that the model may have been modified or retrained with altered data.
Organizations deploying AI in high-stakes contexts (fraud detection, access control, content moderation) should conduct formal ML security assessments that include adversarial testing for data poisoning vulnerabilities before production deployment.
Advertisement