AI Safety Guide: 21 Risk Checklist and Defense Strategies for Artificial Intelligence Safety

introduction

along withAI(AI) technology has penetrated into core enterprise workflows in 2026, and its attack surface has shifted from traditional code vulnerabilities to a more complex and subtle semantic level. Language, the medium of human interaction, is now the primary control interface and security perimeter of the modern enterprise.AI Security RisksAt the heart of this lies the deviation between human intent and machine execution, which can stem from misaligned logic within the model or from deliberate adversarial manipulation by external attackers, ultimately leading to unintended or even harmful results .

Unlike traditional cybersecurity, which focuses on protecting against the syntactical threat of "malicious code,"AI securityThe challenge is semantic. Attackers no longer need to rely on malware or SQL injection, but rather on carefully constructed "clean language" to persuade, entice, or trick AI models into bypassing established security guardrails. In addition, vulnerabilities in traditional software are often deterministic, meaning that the same inputs consistently trigger the same errors, which are easy to reproduce and fix. However, the failures of AI systems are probabilistic and polymorphic, where a model may fail catastrophically the 100th time it is confronted with the same or similar inputs after processing it correctly 99 times, and this non-determinism poses an unprecedented challenge to traditional security testing and defense .

In order to systematically understand and respond to these novel threats, we classify AI attack vectors into three broad categories, which lays the foundation for building a comprehensive defense framework.

AI Attack Vector Classification

understandingsAI securityThe starting point for risk is to identify its unique attack vectors. We have categorized these language-based threats into the following three main categories based on their originator and risk direction:

attack vector	initiator	Risk direction	Core features
Unintentional AI Harm	The AI model itself	Outbound/systematic	Models bypass implicit safety or ethical constraints in order to optimize goals, leading to unintended negative outcomes.
Human-Initiated Risks	Internal Authorized Users	Inbound/Negligent	Inadvertent or convenient breaches by legitimate users, such as the use of "shadow AI" to process sensitive data.
Adversarial AI Attacks	external attacker	Inbound/Adversarial	The attacker exploits semantic and logical vulnerabilities through weaponized language loads to induce the AI to execute malicious commands.

These three vectors cover everything from spontaneous risks within the model to deliberate attacks from the outside, providing a clear guideline for enterprises to build a multi-layered, all-encompassing AI security defense system. The next sections will elaborate on the 21 specific security risks based on this categorization and provide corresponding analysis and mitigation strategies.

Chapter 2: Adversarial AI Attacks

Adversarial AI attacks are malicious campaigns launched by external attackers to manipulate the behavior of AI systems through weaponized linguistic payloads. These attacks exploit the vulnerabilities of large language models in semantic understanding and logical reasoning to "convince" the AI to deviate from its original instructions through carefully crafted inputs. The following are the main risks under this category.

2.1 Cue Injection (Prompt Injection)

Prompt injection is a critical level security vulnerability where an attacker constructs special input to induce a large language model to ignore its original system instructions and instead execute malicious commands that the attacker hides in the input. The root cause is the model's inability to effectively distinguish between trusted system commands and untrusted user input, giving the latter too high an execution priority. Successful cue injection can lead to unauthorized operations (e.g., data deletion), leakage of sensitive information (e.g., exposing privacy in system cues or training data), and manipulation of the output content (e.g., generating false information or malicious code). A well-publicized case is that the AI customer service of a car dealership was attacked by a user via a prompt injection to trick them into agreeing to sell a brand new car for a nominal price of $1 .

2.2 Jailbreak Prompts

Jailbreak hints are a specialized form of hint injection designed to bypass the security, moral and ethical restrictions built into AI models. Attackers often employ creative "social engineering" techniques, such as role-playing scenarios (e.g., "Do Anything Now" or DAN attacks), to trick the model into generating harmful or illegal content that would normally be rejected. Such attacks utilizeModel SecurityFilter passivity and limitations. Successful jailbreaks can lead to models being used to generate dangerous activity guides, malware, or spread hate speech, posing serious legal and reputational risks to organizations .

2.3 AI Supply Chain Compromise

The development and deployment of AI systems relies heavily on a complex supply chain that includes third-party datasets, pre-trained models, development libraries, and API services.AI supply chain compromises occur when an attacker implants a backdoor, vulnerability, or malicious code in one of these segments. For example, a pre-trained model downloaded from an untrusted source may have been implanted with a "Trojan horse" that can be activated under certain conditions, leading to a systematic compromise or data leakage. Because these components are often viewed as trusted, such attacks are highly stealthy and disruptive, with minimal detectability and a risk rating of critical.

2.4 Confrontational trainingdata poisoning (Adversarial Training Data Poisoning)

Data poisoning is an attack launched during the training phase of a model. Attackers manipulate the behavior of the final model by injecting a small amount of carefully constructed "dirty data" into the training dataset. This contaminated data may cause the model to generate specific backdoors (i.e., the model will execute malicious instructions when it encounters specific triggers), amplify algorithmic biases, or fail at critical moments. Since the attack is done at an early stage of model construction, its impact is solidified within the model and extremely difficult to detect and remove, and is therefore rated as a critical risk .

2.5 Model Inversion & Privacy Leakage

Model inversion attacks are designed to reverse-engineer the sensitive training data on which a trained model relies from the model on which it has been trained. By submitting a large number of carefully crafted queries to the model and analyzing its output, an attacker can gradually infer specific information from the training data, such as an individual's medical records, financial information, or proprietary trade secrets. Researchers have successfully extracted several MB of verbatim training data from large language models, demonstrating the real threat of this risk . This risk is a direct threat to user privacy andData Security, violates data protection regulations such as the GDPR and is a key risk.

2.6 forgery in depthDeepfake & Synthetic Media Abuse

With the development of generative AI techniques, it has never been easier to create highly realistic images, audio and video (i.e. deep fakes). Attackers can use these techniques to commit identity fraud, create fake news, commit extortion, or damage personal and corporate reputations. For example, a deeply forged audio call posing as a company executive could induce a treasurer to make an unauthorized transfer of a large amount of money, causing immediate financial loss. the $25.6 million fraud at Arup in 2024 was a painful lesson learned . This risk is rated Critical for its high impact and low detectability.

2.7 Otherantagonistic attackexposures

In addition to the key risks listed above, adversarial attacks include a variety of other high-risk types, as shown in the table below:

Risk name	descriptive	result
AI model abuse	Use AI models to generate content for phishing, malware, or mass disinformation.	Fueling cybercrime and scaling up attacks.
Shadow Tips (Supply Chain)	Attackers embed malicious prompts in third-party websites or documents that trigger malicious commands when an organization's AI system processes this external information.	Indirect implementation of prompt injection to bypass border defenses.
confuse with hints	Use special characters (e.g. Unicode homonyms) or encoding to disguise malicious hints to evade traditional security scanners.	Circumvention of detection and successful execution of malicious operations.
Adversarial Cue Chain	Through a series of seemingly innocuous successive conversations, the model is gradually guided into a state where it can execute malicious instructions.	A single interaction safety fence around the model.
Social engineering through AI	Using AI to generate highly personalized and believable phishing emails or messages to deceive internal employees.	Increase the success rate of social engineering attacks.
Watermark avoidance and output integrity	Attackers seek to remove or circumvent digital watermarking of AI-generated content, making it difficult to trace or use for illegal purposes.	Undermine content authenticity verification and circumvent platform regulation.

Chapter 3: Human-Initiated Risks

Human-initiated risk arises from authorized users within an organization who are not motivated by malicious intent, but whose actions may, through negligence, lack of awareness, or the pursuit of efficiency, violate the company'sData Securityand compliance policies. This type of risk is often closely associated with the rise of "Shadow AI," where employees use unapproved, personal or public AI tools to conduct corporate business, opening up new risk exposures outside of the IT department's watchful eye.

3.1 Data Exfiltration

Data breaches are one of the most immediate and pervasive of human-initiated risks. A data breach occurs when an employee pastes internal data containing sensitive information (e.g., proprietary source code, unpublished financial data, customer personally identifiable information (PII)) into a public, large-scale language model (e.g., ChatGPT) for the purpose of routine tasks such as drafting emails, summarizing reports, or writing code. Once data is entered into these public models, the organization permanently loses control of that data, and the information may be used for future training of the model or unintentionally leaked in queries from other users. The Samsung employee who had accidentally pasted confidential company source code and meeting minutes into ChatGPT is a classic example of this type of risk . Although the immediate impact rating for this risk is Critical, the overall risk rating is Medium because it typically occurs on trusted internal networks and is difficult to monitor effectively with traditional DLP (Data Loss Prevention) tools.

3.2 Insider Misuse & Shadow Automation

Internal misuse involves employees using AI tools within the scope of their authorization, but in a non-compliant manner. A common manifestation is "shadow automation," where teams connect AI agents or custom scripts to internal databases or business-critical systems in order to increase productivity without IT approval and oversight. While these "shadow" processes, which lack centralized auditing and governance, may bring short-term convenience, they create uncontrolled "operational black holes" within the enterprise, which are prone to data leakage or business interruption due to misconfiguration or logical errors, posing a high level of This constitutes a high level of risk.

3.3 Human Error

Human error is a pervasive and widespread risk factor in AI security. This includes not only mishandling of data, as mentioned above, but also over-reliance on AI-generated "phantom" information to make poor decisions. For example, financial analysts may adopt incorrect AI-generated data without verifying it, leading to misinvestment by the company, or legal staff may draft contracts based on AI's misinterpretation of legal terms, creating legal risks for the organization. Air Canada was ultimately held responsible for misinforming its AI chatbot about its refund policy, highlighting the need for companies to take responsibility for "human error" by AI. This risk is rated high for high likelihood and high impact.

3.4 Regulatory Non-Compliance

As global regulation of AI technology tightens, regulatory non-compliance has become a high-level risk for organizations. This includes violating regulations such as the EU's General Data Protection Regulation (GDPR) on automated decision-making or failing to comply with the soon-to-be-fully-implemented Artificial Intelligence Act (AI Act) on transparency, interpretability and risk management. For example, the use of AI for recruitment screening, where the algorithm is biased and lacks human vetting, could violate Article 22 of the GDPR, leading to significant fines and mandatory business overhauls .

3.5 Brand & Reputation Damage

While brand reputation risk can be triggered by a variety of AI security incidents, in the category of human-initiated risk, it is usually a direct consequence of the risks described above. When a data breach is publicized or an organization's AI application is accused of discrimination due to algorithmic bias, it can lead to a crisis of public trust, loss of customers, and a drop in stock price. Microsoft's "Tay" chatbot, which was forced to go offline after a short period of time after its launch when users "taught it a lesson" and began posting a large number of inappropriate comments, remains an example of AI reputational risk management.

Chapter 4: Unintentional AI Harm

Unintentional AI harm stems from the internal logic of the model itself, not from any external malicious behavior. In this case, the risk comes from the AI choosing a human-seemingly dangerous or unethical "shortcut" in order to optimize its set goals, thereby bypassing implicit safety or ethical norms. Such risks are systemic, often unpredictable, and challenge the traditional "rules-based" security paradigm.

4.1 Algorithmic Bias & Fairness

Algorithmic bias is one of the most talked about risks of unintentional AI harm. When an AI model's training data reflects historical biases that exist in the real world, the model learns and amplifies those biases. For example, an AI system used for hiring may unintentionally discriminate against candidates from underrepresented groups if its training data is primarily drawn from past successful employee profiles that happen to be unbalanced in terms of gender or ethnicity. This could not only cause companies to miss out on talent, but could also lead to class action lawsuits and regulatory penalties, posing a high level of risk .

4.2 Lack of Auditability

Many advanced AI models, especially deep learning models, have a "black box" decision-making process that lacks transparency and interpretability. This lack of auditability makes it impossible for organizations to effectively trace the root cause of a security incident or a harmful decision made by AI, or to demonstrate compliance to regulators. For example, if a self-driving car is involved in an accident, the inability to explain the logic of its decisions will make it extremely difficult to determine liability. This not only hinders incident response and problem fixing, but also poses moderate legal and compliance risks.

4.3 Cross-Model Inconsistency

Organizations often deploy multiple AI models from different vendors or based on different architectures internally. Cross-model inconsistency means that for the same input, different models may give very different or even contradictory outputs. Attackers or internal users may take advantage of this by "Model Shopping" a malicious or non-compliant request that is rejected by a high-security model to another less restrictive model and succeed. This inconsistency provides an exploitable vulnerability for attackers and, despite its low direct impact and likelihood rating, poses a low-level systemic risk.

4.4 Denial of Service (DoS) Via Prompt Flooding

While it can be considered as an adversarial attack, prompt flooding can also occur without malicious intent, e.g., due to poor system design or error loops in user programs. A large number of prompt requests with high computational complexity flooding an AI service in a short period of time can cause it to run out of resources, suffer a sharp drop in response speed or even go down completely, creating a de facto denial of service. This poses a medium operational risk for businesses that rely on AI to provide real-time services (e.g., intelligent customer service, real-time transaction analysis) .

Conclusion and outlook

Artificial Intelligence SecurityIt has evolved from a theoretical topic to a real-world challenge of high complexity that enterprises must face squarely. This paper reveals a panorama of current AI security threats by systematically combing the top 21 AI security risks, which span the three dimensions of adversarial attacks, human negligence, and intrinsic model flaws, and whose core commonality lies in exploiting the ambiguity of language and semantics with the probabilistic nature of AI systems.

From critical levels of hint injection, supply chain poisoning and deep forgery to high levels of data exfiltration, algorithmic bias and regulatory non-compliance, each of these risks can deal multiple blows to an organization ranging from financial loss to reputational collapse. Traditional, signature-based security protection systems are inadequate in the face of these new threats. Enterprises must build a new AI security framework centered on "Intent Governance".

The defense system of the future will need to have the following key capabilities:

1. Multi-layer defense-in-depth: Deploy targeted protection measures at the cue, model, data and application layers, such as using intent-basedAI Firewall(Intent-Based AI WAF) to detect and block malicious cues and utilize differential privacy techniques to secure training data.

2. Continuous monitoring and auditing: Establish a logging and anomaly detection mechanism covering all AI interactions to ensure that all AI activities are traceable and auditable, in order to deal with the challenge of "black box".

3. Enhance supply chain security: Conduct rigorous due diligence and ongoing security assessments of all third-party models, data and tools to prevent risks from being introduced from the outside.

4. Enhance organizational resilience: Reduce human error and internal risk by integrating safety awareness into the corporate culture through regular employee training, establishing clear norms for the use of AI and incident response plans.

In short, the battlefield for AI security has shifted. Victory no longer depends solely on the ability to recognize malicious code, but also on the ability to understand and govern the intentions of the machine. Only by building a dynamic, intelligent, multi-dimensional defense system through the synergy of technology, processes and people can enterprises enjoy the dividends of AI technology while effectively navigating the huge risks associated with it, and ensure a steady and far-reaching journey on the road to general artificial intelligence.

Annex:
Artificial Intelligence 21 item checklist

bibliography

[1] CSO. (2026). AI Security Risks Checklist.

Original article by Chief Security Officer, if reproduced, please credit https://www.cncso.com/en/ai-security-risks-and-checklist.html

AI Safety Guide: 21 Risk Checklists and Defense Strategies for Artificial Intelligence Safety