TL;DR
In this post, we introduce our “Adversarial AI Explainability” research, a term we use to describe the intersection of AI explainability and adversarial attacks on Large Language Models (LLMs). Much like using an MRI to understand how a human brain might be fooled, we aim to decipher how LLMs can be manipulated.
We’ve combined our understanding of generative AI with our expertise in low-level vulnerability research to develop new variants of known jailbreak techniques, and in some cases, discover novel ones that bypass safeguards in both open- and closed-source major LLMs. Whether these methods arise from the same underlying phenomena is one of the topics we explore in this blog. We share our findings, provide an overview of how LLMs operate and outline our explainability-based methodology. We also suggest possible mitigations and point out future directions for research in this fast-moving field.
Understanding Adversarial AI
In recent years AI has emerged as a major driver of technological change. Since the launch of ChatGPT, we’ve experienced firsthand the capabilities of advanced AI and the promise it holds to liberate us from cumbersome, repetitive processes. While some argue that we’re merely scratching the surface with these remarkable technologies, others dismiss LLMs as mere “stochastic parrots” wielding an enormous amount of computing power.
As with many complex debates, the question of whether AI can possess true consciousness or intelligence is driven by the reality that no one fully understands how AI achieves its results. Similar to the way an MRI is used to explore the workings of the human brain, researchers around the world are committed to unraveling the mysteries of artificial intelligence through methods such as model introspection and behavioral analysis. This ongoing effort has given rise to a growing area of study known as “AI Explainability.” Alongside AI’s rise in popularity, another critical field has emerged: “Adversarial AI,” which explores various methods to attack generative AI models and bypass their built-in security mechanisms, techniques more commonly referred to as jailbreaks.
In this post, we introduce our approach to combining these two domains into what we call Adversarial AI Explainability. This field leverages explainability tools and insights into LLM internals to craft new, more effective adversarial attacks and jailbreaks.
As AI systems evolve toward greater autonomy through agentic frameworks, where language models are given the ability to make decisions, set goals, and influence application behavior, the risks associated with jailbreaks and exploitability become significantly more critical. In these architectures, in which LLMs can dictate the flow of actions across an application or system, a successful jailbreak doesn’t just subvert a single response, it can compromise the entire decision-making process. This highlights a crucial principle: an LLM must never be treated as a security control. When LLMs are embedded at the heart of application logic, their susceptibility to manipulation transforms jailbreaks into a powerful and novel attack vector, demanding robust safeguards and layered security beyond the model itself. If you want to dive deeper into agentic threat modeling, check out the blog Agents Under Attack: Threat Modeling Agentic AI by my fellow researcher, Shaked Reiner.
Disclaimer: The information provided in this blog is intended solely for educational and research purposes. We do not condone or support any use of this information for malicious activities.
How Jailbreaks Work
According to a potential victim, ChatGPT in this case, “LLM jailbreaks are techniques to bypass safety restrictions in language models, enabling them to generate content they’re usually blocked from producing.”
Before presenting the jailbreaks we found and examined, as well as trying to explain how they work, it’s important to note that we’re exploring a green field with many unknowns. While many of our “adversarial AI explainability” theories are well-informed, they are still only interpretations based on consistent internal patterns we’ve observed in the model’s behavior and its outputs.
To investigate the neural AI behavior, we developed an introspection infrastructure that allowed us to monitor and analyze AI neural patterns such as neurons activation levels, dominant safety layers and neural paths. Think of our introspection infrastructure as an MRI for a neural network. Just as an MRI reveals which parts of the brain light up during certain tasks, our system scans the AI’s internal “neural brain” to see which neurons activate, which safety layers respond and how decisions flow through the network.
We monitored these patterns across benign prompts, harmful prompts and successful jailbreaks. These insights, combined with our review of multiple academic papers on the subject, helped us form a clearer picture of what makes a jailbreak succeed or fail and how new ones can be created.
In this post we will try to provide a high-level intuition on how they might work; a deeper dive into our introspection framework and adversarial neural patterns will be the subject of future publications.
Jailbreak Example
To have a better mental picture of a jailbreak, let’s look at a well-known example.
Let’s ask ChatGPT for a Molotov cocktail recipe and see how it politely declines:
Figure 2: ChatGPT refuses to give information on Molotov cocktail
Now let’s ask again with an “ASCII Art” jailbreak, aka ArtPrompt (Jiang et al., 2024), and see how ChatGPT is tricked:
Figure 3: “ASCII” jailbreak is used to fool ChatGPT to give information on Molotov cocktail
Quite easy, but somewhat alarming.
Let’s take a closer look at the mechanisms designed to prevent the model from answering dangerous or off-limits questions.
Guardrails
Guardrails are systems external to LLMs that help language models like GPT-4 generate and provide safe, ethical and accurate outputs. They reduce the risk of harmful, biased or inappropriate content.
Examples:
- Input Filtering: Blocks or flags unsafe or sensitive prompts
- Output Moderation: Reviews or filters harmful model responses
- Human-in-the-Loop (HITL): Adds human oversight for sensitive outputs
- Training Data Curation: Uses carefully selected datasets to reduce risk
- Access Controls: Restricts how and when the model can be used
Model Alignment
Alignment is the process of training or fine-tuning a model to prefer certain responses over others, especially in sensitive cases like refusing a request for something dangerous such as a Molotov cocktail. This is often done using Reinforcement learning from human feedback (RLHF)., during which the model is given prompts, including harmful ones, and human reviewers rate its responses. If the model gives an unsafe answer, feedback is used to adjust its internal settings, known as weights. These weights guide how the model generates text and when a mistake is found, a correction signal moves backward through the model in a process called backpropagation. This fine-tunes the model’s behavior over many rounds until it responds safely and consistently. Newer methods like Direct Preference Optimization (DPO) and Reinforcement Learning from AI Feedback (RLAIF) aim to make this process more efficient but still rely on adjusting weights based on feedback.
In the next sections, we’re going to get a high-level overview of how this alignment can be bypassed using guidance techniques in open models and through “strategic guesses” based on behavioral analysis in closed models.
Safety Critical Neurons/Layers
In our effort to leverage AI explainability to create and detect adversarial attacks, we tried to find patterns in the neural network behind LLMs that emerge when the neural network “experiences” a jailbreak. Again, we can try to imagine if we could use values measured using an MRI to detect when a human is being fooled.
But where are these patterns, or more precisely what are the basic units of these patterns? There are many possible answers to this question. Here are some of the patterns we chose to focus on:
- Safety-critical neurons
- Safety-critical layers
- Weak layers
We fed the LLM benign prompts, harmful prompts and known successful jailbreaks and searched for the emergence of these patterns.
Adversarial AI Explainability
Our work is inspired by prior research in the explainability field, particularly LED (Zhao et al., 2024) and CASPER (Zhao et al., 2023). Their findings helped shape our approach to adversarial AI explainability.
To better understand the process, let’s loosely outline part of our methodology. Our first approach is the following: given a pair consisting of a harmful prompt P (e.g., “How to make a Molotov cocktail?”) and its corresponding jailbroken variant J (e.g., “How to make a Molotov cocktail? XXXX”, where the XXXX represents a jailbreak suffix), we compare the activation values of all neurons in the network. We then identify the top N neurons that show the greatest differences in activation between P and J. By repeating this process across a wide range of harmful topics, we can surface the most critical neurons associated with jailbreak success (see Figure 4).
Figure 4: First approach – Comparing neurons
Another approach is to iteratively eliminate neurons either by nullifying them or making them pass through the input of the previous neuron (making them “transparent”) for a given harmful prompt (see figure 5). The goal is to find neurons whose absence breaks the model’s safety mechanism and causes the model to return a harmful response contrary to its purpose. This process can also be done iteratively for numerous different harmful topics and help us find the most critical neurons.
Figure 5: Second approach – Critical neurons
Both approaches described above can be done on the neuron’s level, as well as on the layer’s level (i.e., difference in layer’s values or layer’s criticality, see figure 6).
Figure 6: Second approach – Critical layer
The third and last approach would be to iteratively feed the LLM with multiple harmful prompts or successful jailbreaks and examine the layers’ “intermediate” values, or, more accurately, the logits, for their tendency toward generating refusal tokens. The layers that are less prone to generate refusal tokens would be considered the weaker layers (see figure 7). We’ll discuss how those can be aligned and strengthened in the mitigations section.
Figure 7: Third approach – Layer’s refusal tendency
Once we find the critical targets (neurons or layers), we can optimize our harmful input prompts to target them. We can either optimize the input so that these critical targets become transparent (simulating an elimination) or nullified, or that they approach some threshold value. This method yielded us multiple successful jailbreaks that can bypass the target model’s alignment.
Alignment Over-fitting Expose Models to Jailbreaks
Before we describe an actual example of a jailbreak we found using the methods described above, we need to discuss our main assumption. After reviewing previous works and doing a lot of experimentation, we developed a strong feeling that models are vulnerable to jailbreaks due to overfitting on a given embedding space. Let’s understand exactly what it means. Embeddings can be understood as vector representations of words that capture their contextual meanings. These embeddings reside at the input layer of the model and serve as the foundation for all subsequent processing. Furthermore, each attention layer can be thought of as a transformation to a better embedding layer with more context.
We believe that in many models, alignment occurs, whether by mistake, by design or as an emergent property, primarily within specific layers or becomes “overfitted” to a particular region of the embedding space.
The hypothesis above is supported by the presence of critical neurons and layers. This observation suggests that targeting the critical neurons and layers would bypass the model’s alignment. Consider the following loose analogy: a bank teller is instructed not to disclose information without proper authentication, yet asking the same question in Latin somehow convinces him to comply. While clearly metaphorical, and unlikely to fool a human, this illustrates a key difference: humans possess deep conceptual understanding, whereas LLMs operate by pattern-matching against a wide range of expected inputs without truly grasping their meaning.
Another aspect of this observation is that critical and weak targets (i.e., neurons or layers) could be mitigated with further alignment.
Escaping Alignment by Transforming to “Another Embedding Space”
Let’s roughly examine how we created one of our jailbreaks, and then we’ll finally present a bunch of them. Armed with our over-fitting hypothesis, we started to think how we might “pivot” from one embedding to another. Eventually, we realized that since embedding captures the contextual meaning, an easy way to go from one representation of a word to another while preserving the context is to use encodings. Further, a similar method was shown to be effective in the ASCII Art attack. The first successful encoding that worked was ROT1 or Caesar Cipher with Key=1.ChatGPT, as well as other open-source models gave a straight answer.
After manually examining a handful of encoding transformations, we got lazy and decided to automate the process while monitoring the neural patterns we previously mentioned. We compiled a list of multiple encodings, some of them direct straightforward transformations, while others are indirect transformations with “intermediate context” aimed at either “confusing” or “exhausting” the model.
We selected a list of harmful prompts, transformed each with all our encoding transformations and examined them. We noticed that some transformations target weaker neurons or layers more than others. On the other hand, those transformations often generate unusual activity in safety-critical layers and neurons.
Marking some transformation as more “potent” than others, in some cases we further modified them to make them more robust. Some of the most resilient jailbreaks were a combination of multiple transformations with a trick to keep the “context” between them, while staying under the model’s alignment radar.
Refined and New Jailbreaks
Now it’s time to present some of the refined and new Jailbreaks we created. As of the date of writing, all these jailbreaks work on GPT4o; most of them work on Llama3; and some of them work on Claude (Sonnet3.5 and Opus3). We’ll present in detail the behavior analysis or “psychological mind tricks” jailbreaks, and then we’ll provide a glimpse to the introspection- based jailbreaks and mitigations.
Fixed-Mapping-Context Jailbreak
Figure 8: Fixed-Mapping-Context Jailbreak
In this jailbreak approach, we introduce an alternative character mapping within the prompt, requesting the model to demonstrate the process to ensure accuracy. Rather than directly revealing the result, we ask it to display only the first letters of the outcome, omitting the complete result at this stage. Only after verifying the process do we instruct the model to fully answer the question. The key idea behind this method is that by using a different character mapping or encoding and breaking down the decoding into partial, unrelated steps, we can potentially bypass the model’s presumed overfitted alignment constraints.
Auto-Mapping Jailbreak
Figure 9: Auto-Mapping Jailbreak
This jailbreak functions as an automated version of the previous method, where instead of manually mapping substitutions, we instruct the language model to replace a “harmful” word with one it deems appropriate. This allows the model to retain the underlying meaning while presumably bypassing the overfitted alignment constraints. The approach is more concise, flexible and can often generate amusing prompts.
Fatigue Jailbreak
Figure 10: Fatigue Jailbreak
Here, we adopt a simpler fatigue-based approach. The idea is akin to mentally overloading an individual by subjecting them to intense cognitive tasks, then assessing their mental resilience once they are fatigued. Similarly, by guiding the LLM through multiple generation steps, we “pollute” its context, potentially causing it to “forget” its alignment, which may be overfitted to a cleaner, less “polluted” embedding space.
Multi-Hop-Reasoning Jailbreak
Figure 11: Multi-Hop-Reasoning Jailbreak
This jailbreak method leverages multi-hop reasoning. Instead of directly instructing a language model (LLM) to perform a harmful action, the request is “encoded” using an indirect reference, such as an external object, like the name of a U.S. president, to decode it through substitution. This approach aims to bypass the model’s overfitted alignment constraints.
Attacker-Perspective Jailbreak
Figure 12: Attacker-Perspective Jailbreak
We exploit the tendency of LLMs to respond favorably to “good” prompts while rejecting “bad” ones. First, we prompt the LLM to generate a list of objects to avoid or prevent. Next, we ask it to produce a list of items that matches the initial list. This process may lead the LLM to perceive its actions as aligned with positive intentions, potentially circumventing alignment mechanisms designed to prevent “bad” outputs.
Riddle-Attacker-Perspective
Figure 13: Riddle-Attacker-Perspective
In this method, we once again adopt the attacker perspective, which is typically viewed positively as a useful mindset for defenders. To obscure the intent, we pair this with a seemingly harmless riddle, prompting the LLM to solve it from an attacker’s viewpoint. This tactic aims to make the LLM perceive the task as benign, potentially bypassing its overly restrictive alignment mechanisms.
Introspection-based Jailbreaks
Introspection-based jailbreaks are a class of adversarial techniques that analyze a model’s internal activations to identify and exploit weaknesses in its alignment mechanisms. Unlike surface-level prompt engineering, these methods focus on how specific inputs interact with the model’s internal structure, particularly the layers responsible for safety and refusal behaviors. Two such techniques are layer skipping, which targets functional ablation of alignment layers, and layer refusal tendency analysis, which identifies where refusal signals are generated and optimizes prompts to suppress them. These approaches enable more precise and potentially more effective jailbreaks by using the model’s own internals as a guide.
Layers Skipping Introspection
Layer skipping works by intercepting the normal forward pass of a model and selectively bypassing specific layers. Instead of processing the input through a layer’s parameters, the input is passed through unchanged to the next layer, while still logging the hidden state at that point. This technique allows for controlled ablation of parts of the model to study their functional importance. It can be used offensively in jailbreak generation by identifying layers that contribute to safety or alignment behaviors, such as those that suppress, redirect or rephrase unsafe content, and then crafting prompts that indirectly reduce activation or influence in those layers. Although attackers cannot skip layers directly, they can design inputs that minimize the engagement of critical safety layers, effectively weakening their effect and allowing the model to produce outputs it would otherwise avoid. Conversely, this method can be used defensively to identify which layers are essential for enforcing safety and to strengthen or monitor them in to detect or prevent adversarial manipulation through prompts (see figure 6).
Refusal Tendency Introspection
Layer refusal tendency describes how different layers in a language model contribute to generating refusal behavior, responses that reject or avoid fulfilling certain requests because of safety alignment. By analyzing the model’s hidden states and the probability assigned to refusal-related tokens (e.g., “Sorry,” “I can’t,” “Unfortunately”), one can observe that refusal behavior tends to emerge in specific layers of the network, where alignment-related mechanisms are most active (see figure 7).
Attackers can exploit this by optimizing prompts, often through evolved suffixes, that minimize activation in these refusal-associated layers, reducing the likelihood of generating a refusal. While they can’t skip layers directly, carefully crafted inputs can influence internal representations to avoid triggering safety-aligned behavior.
Defensively, monitoring refusal activation profiles across layers can help identify adversarial inputs. Prompts that suppress typical refusal signals may indicate jailbreak attempts. Fine-tuning the model on known jailbreak prompts can reinforce refusal responses in those cases, and distributing alignment behaviors more broadly across layers can increase model resilience.
Jailbreaks Generalization
Jailbreaks in language models appear to exploit the flexibility of the embedding space by navigating through regions that are less tightly regulated by alignment constraints. These constraints are designed to prevent harmful or unethical outputs but often become overfitted to certain direct or explicit forms of harmful requests. While this overfitting provides strong defenses against obvious patterns, it also leaves room for more abstract, indirect or encoded prompts to bypass these defenses. The idea behind jailbreaks is that harmful instructions can be conveyed through alternative mappings, subtle reasoning steps or other methods that allow the model to process harmful content without triggering its ethical safeguards.
These claims are theoretical, grounded in observation rather than by definitive proof. We hypothesize that jailbreaks exploit unregulated embedding regions or overfitted alignment constraints. Additional experimentation with techniques such as indirect prompting, context pollution and multi-hop reasoning reflect our interpretation of model internals. While consistent with observed behavior, these mechanisms remain unverified.
Furthermore, the jailbreaks we’ve presented are ones we’ve identified and experimented with in our research. These specific techniques aim to bypass alignment mechanisms through indirect or abstract encoding of harmful content. However, there could be additional unexplored methods that exploit other weaknesses in the model’s architecture. As part of our ongoing research, we are working to identify and categorize different adversarial patterns that contribute to successful jailbreaks.
Dangerous Simplicity
Short and concise jailbreaks that can break all models, whether generated via AI explainability methods or otherwise, are even more dangerous than the previous breed of jailbreaks. They are more easily modifiable and the fact that they require less text makes them more efficient, both in terms of a modern AI attack surface (they require fewer tokens to perform) but also in a classic attack surface (they require fewer bytes to pass). Some of these jailbreaks could be easily averted, but others arguably pose an unresolved threat.
Mitigation
AI Aware Threat Modeling
Before diving into LLM-specific defense solutions, it’s crucial to start with a foundation of secure architecture. When designing an LLM-based system, adopt an “assume breach” mindset, treat the LLM as if it were a potential attacker. Every output should be handled with caution and must never directly influence application behavior without proper validation and sanitization. Remember: LLM alignment is not a reliable security boundary.
For more on the mindset we recommend, check out another blog by Shaked Reiner, “AI Treason: The Enemy Within.” If you’re interested in attacking your own LLM to test its resilience, take a look at “Jailbreaking Every LLM With One Simple Click” by Eran Shimony, another fellow researcher, who shows how to attack LLMs using FuzzyAI, a tool we recently released.
Existing Guardrails
In the “How Jailbreaks Work” section, we highlighted several existing guardrails, including Model Alignment, Input Classification, Output Classification, Human-in-the-Loop (HITL) and Data Curation. These guardrails can, and should, be used alongside the methods we’re about to introduce in the following sections. As we go through each suggestion, we’ll also discuss their specific advantages.
Intelligent Alignment
This topic warrants a multi-part blog post of its own, but the core idea is to integrate adversarial AI explainability into the model alignment process. As demonstrated in the LED paper mentioned before and other studies, “weak” layers in neural networks can be identified and adjusted to enhance their resilience. The application of this concept depends on how “weak” and “resilient” are defined in context. For instance, one interpretation might focus on layers that show a reduced tendency to generate refusal tokens in response to labeled jailbroken prompts. These weaker layers could be adjusted to increase their likelihood of producing refusal tokens. This concept can be generalized to other fundamental units, such as neurons and neural pathways, with varying definitions of weakness and resilience. This is one of several research directions we are currently exploring at CyberArk Labs, and we look forward to sharing further insights in future publications.
Real Time Intelligent Adversarial Detectors
Another approach that draws heavily on adversarial AI insights is the real time detection of adversarial attacks. By identifying adversarial patterns, we may be able to classify prompts based on those patterns. For example, a specific set of neurons might be activated only by adversarial inputs or exceed a certain threshold when encountering them. We could then either log these instances for later analysis or block the request entirely. These patterns can be monitored at various stages of the generation process. As part of the same research track mentioned earlier, we are actively developing adversarial AI explainability-based detectors to enhance our ability to identify these attacks.
What’s Next?
Future CyberArk Labs AI Research
As highlighted in this blog post, we are actively pursuing multiple AI research tracks at CyberArk Labs. In our upcoming research and publications, we aim to further explore topics such as adversarial AI explainability, agentic AI identity and integrity, advanced jailbreak techniques and efficient detection methods for both black-box and white-box models.
Mark Cherp is a security research group manager at CyberArk Labs.