ByteCodeLLM – Privacy in the LLM Era: Byte Code to Source Code

December 10, 2024 Amir Landau & David El

TL;DR

ByteCodeLLM is a new open-source tool that harnesses the power of Local Large Language Models (LLMs) to decompile Python executables. Furthermore, and importantly, it prioritizes data privacy by using a local LLM that you can run under any environment, like old laptops and VMs. ByteCodeLLM is the first decompile program that manages to decompile the latest versions of Python 3.13 locally. Its success rates are also 99% when running against older python versions, achieving a 70-80% accuracy rate depending on the LLM used.

How did we start?

Malware researchers around the world have been struggling in the last couple of years with analyzing malware written in Python above 3.8 that is compiled to byte code. With new major Python versions released every year, the internal representation and bytecodes change, breaking publicly available tools and reducing decompilation accuracy. Existing decompilers often fail to produce correct and usable code. For instance, almost a year ago we encountered the same limitation while trying to identify a specific API misuse for malicious purposes as part of a research project. We analyzed dozens of samples to track it down. Nearly half of the samples we examined were compiled from Python versions newer than 3.8. Existing decompilation tools (Uncompile6 and PyCDC) gave insufficient results, but this time we were in the era of LLMs and asked “Chaty” (a commercial LLM service) to try to do this task for us. Surprisingly, it worked!

However, privacy concerns prevented us from using state-of-the-art commercial cloud-based solutions in our workflow. Instead, we tried to find a local alternative. After a quick search, we found Ollama. We quickly wrote a script that leveraged Ollama and additional open-source resources and stitched it in with some Python code that we wrote — and it worked again!

We extracted better results than with any other alternative that we aware of, and now analysis that would have taken days is done in minutes. Since then, we’ve improved the script and developed it into a tool we call ByteCodeLLM, which we are happy to release as open source.

Pyinstaller Malware

Figure 1: Pyinstaller Malware (10+ Positive Detections) From Virus Total Over Time

ByteCodeLLM marks a significant advancement in Python decompilation capabilities. This is particularly crucial, given the exponential growth of Python-based malware in recent years (Figure 1). By leveraging LLMs for bytecode conversion, we essentially eliminate the issue of completely failing to convert functions. While the output may occasionally lack precision — a limitation that can be an issue in some applications — it is far less critical in Python-based malware analysis. There, our primary goal is understanding functionality rather than producing runnable code.

Why use local LLMs and not a commercial LLM provider?

Generative AI is full of exciting possibilities, but we must proceed with caution and ensure it’s used responsibly. In our previous blogs we highlighted significant risks associated with large language models (LLMs), including remote code execution vulnerabilities and the potential for LLMs to generate misleading or false information (“LLM Lies”). As security researchers, we must proactively identify and mitigate these dangers when we integrate this new technology.

While most commercial LLM providers have privacy policies outlining their data handling practices, it’s crucial to remember that in some cases if you’re not paying for a service or paying only $20 a month, you are likely the product. This brings inherent risks:

Training on user data: This is a major concern. Sharing unique or private data with an LLM could lead to it being incorporated into the model’s training data. This could inadvertently expose your information to other users or even become public knowledge.
Data storage for testing and development: While seemingly legitimate, storing user data for testing and development purposes poses risks. Supply chain attacks and misconfigurations can expose this data to unauthorized access.
Data retention: Most of the services do not specify how long they will retain user data. This potential indefinite retention period increases the risk of potential data leaks. The longer data is stored, the greater the chance of it being compromised.
Privacy policies can change: Some AI providers add disclaimers that they may change the privacy policies in the future. It’s without prior notice ; this can be a good idea to periodically check for changes.

How does ByteCodeLLM work?

ByteCodeLLM addresses growing data privacy concerns by locally processing the sensitive information. This eliminates the need to share code with third-party AI providers (more on that later), empowering malware researchers to leverage the power of LLMs while retaining complete control over their data.

This approach addresses another key limitation: the inability to generate malicious source code accurately. Third-party AI providers prioritize security and alignment, implementing restrictions to prevent misuse, such as generating code for malicious purposes. While this is crucial for safety, it can hinder malware research efforts by occasionally blocking the creation of malicious code necessary for analysis. In contrast, using a locally hosted LLM often comes without these filters, offering greater flexibility for research purposes.

It’s important to note that as malware researchers, our primary focus is not achieving perfect decompilation. During the development of the tool, we prioritized functionality over accuracy, adding features directly supporting our research needs, such as ensuring the tool runs efficiently on CPUs and cross-platform support.

Let’s take, for example, the “start discord” function from the known open-source malware “Empyrean”. This function is called after injecting a malicious payload into Discord’s desktop app source code. Running Update.exe restarts the application covertly to ensure that the payload written on the hard drive is loaded into the application without waiting for a PC restart.

ByteCodeLLM addresses

Figure 2: Original Source Code Compared to PyCDC Decompilation

Original Source Code

Figure 3: Original Source Code Compared to ByteCodeLLM Decompilation

Figures 2 and 3 show a side-by-side comparison of the original malicious code to the output from pycdc (currently the most popular Python bytecode conversion tool) and ByteCodeLLM. Although ByteCodeLLM is not flawless, it generates comprehensible code without crashing — a notable improvement over existing solutions.

Other researchers have explored ideas, but we didn’t find comparable open-source tools are currently available.

The next section will describe how ByteCodeLLM works and utilizes LLMs to decompile Python bytecode to source code.

Python bytecode

Figure 4: ByteCodeLLM Flow Graph

These are the following stages of bytecodellm:

1. EXE File:

Extract PYCs from EXE: Tools like PyInstaller are used to package Python scripts into executable files. We can use pyinstxtractor-ng to reverse this process and retrieve the compiled Python files (PYC files) that contains all the logic. This step has not been implemented yet.

2. PYC File:

pycdas disassemble to ByteCode: PYC files contain byte code, a lower-level representation of the Python source code. The Pycdas tool disassembles the PYC files into a human-readable byte code format.
Extracted object hierarchy from ByteCode representation: This step involves analyzing the pycdas output to map the program’s structure and relationships between different code objects (functions, classes, etc.).

3. ByteCode:

Extracted object hierarchy: By parsing the pycdas output, we can generate a hierarchy tree for all the code objects and classes inside the PYC file. This is used to split code into functions for bypassing context length limitations in LLMs and for optimization, which we cover in the following steps.

4. Python source:

Check for decompilation errors: Scanning through the pycdc output, we find all the functions and classes that have not been reconstructed correctly.
Select ByteCode of functions needing reconstruction: This step involves identifying the correlating byte code of functions we’ve identified in the previous; stage, and they are prioritized for LLM-assisted reconstruction.

5. Large Language Model:

Convert ByteCode to source code through an LLM: The byte code we’ve identified in the previous stage, along with the initial decompiled output, is fed into an LLM. The LLM uses its knowledge of Python syntax and common coding patterns to generate more accurate and complete source code.
Inject reconstructed function back into source code: The LLM-generated code for the failed-to-convert functions are integrated back into the overall decompiled source code.
Output: Decompiled python code.

ByteCodeLLM represents a step forward in decompilation technology by harnessing the power of LLMs to overcome the limitations of traditional tools. This allows researchers to analyze and understand code more effectively, particularly in the context of malware analysis.

Optimizing breaking down the code into functions and then converting only the failed-to-convert functions through an LLM is done for two primary reasons. First, it cuts down the time needed, and second, most LLMs don’t have the context length to read through an entire file’s worth of byte code. By breaking down a file into functions, we circumvent this issue.

How did we choose the LLM to work with?

While Ollama and Huggingface present a wide range of local and open LLM models to choose from, we didn’t find any LLM model that declares the ability to decompile Python bytecode to source. As a result, we needed to create one or find one that would work for us.

We have selected several leading candidates from Hugging Face’s leaderboard that claim for Python code generation capability. We then developed a dedicated test to check their accuracy of decompilation over a , by checking the similarity between the original Python snippet against the generated one.

Figure 5 shows the results of four similarity tests for some of the LLMs we tested. The similarity is a score of how well the LLM reconstructed the byte code compared to the corresponding source code from our database. Note. Take note of the results from Gemini 1.5-flash (Gold) versus ByteCodeLLM (Blue) which is our own finetuned model based on the much smaller llama 3.2 3b

Tests Between Different

Figure 5: Similarity Tests Between Different LLM Decompilation Results and the Original Source Code

We have decided on four different code similarity metrics to measure the accuracy of the models:

Edit Distance: Counts changes to match code.
Jaccard Similarity: Measures overlap of code tokens.
Cosine Similarity: Calculates the angle between code embeddings.
NCD (Normalized Compression Distance): Estimates similarity via compression.

Note: The NCD metric is inverted—lower scores indicate higher similarity.

Our initial evaluation showed Gemini-1.5-flash to be the top performer. However, due to its commercial nature, it fell outside our selection criteria. Gemma 2, with 9 billion parameters, demonstrated superior performance compared to all other models we tested, including our own ByteCodeLLM.

Based on these findings, we’ve discontinued the development of the internal model for this use case and currently suggest Gemma 2 as the default model. We encourage the community to share any promising new models for this application.

Improvement over current tools

To evaluate the effectiveness of our enhancements over current tools, we developed a comprehensive sample set that contains a wide range of Python language features, including specific malicious code patterns that malware researchers frequently encounter.

These code snippets were compiled across all major new Python versions, starting from 3.10 up to the latest 3.13 release. We then compared the output from both pycdc and our project to assess performance and accuracy. Figure 6 provides a visual representation of the pycdc success rate per Python version, which is a measure of the percentage of functions without decompilation errors.

Pycdc Success Rate

Figure 6: Pycdc Success Rate per Python Version Over a 200+ Sample Set

Figure 6 reveals several key insights. First, we observe a declining success rate as the Python versions progress, ending in a complete inability to decompile Python 3.13 bytecode. This failure stems from Python’s changes in function definition (as illustrated in Figure 7), which fundamentally breaks pycdc.

Secondly, our dataset of over 200 code snippets is noticeably skewed toward shorter functions. A manual review of pycdc’s performance indicates a clear trend: the longer the function, the higher the likelihood of decompilation failure. This correlation exists due to the nature of longer functions usually containing complex logic — a logic that is translated to byte codes that are still being optimized and tinkered with. Thus, they often change from version to version as the Python language evolves and improves over time.

Decompilation

Figure 7: Results of Python 3.13 PYCDC Decompilation

ByteCodeLLM, on the other hand, has been able to reconstruct every function and class in the sample set, with a small caveat that for Python 3.13, the project uses a fallback to an unoptimized converter algorithm, as pycdc fails to output a partial decompilation that allows us to optimize runtime, making the process much longer. Nonetheless, it saves us time as we don’t need to analyze the byte code manually.

Summary:

Keeping pace with the rapid evolution of Python — especially in the realm of malware analysis — demands innovative solutions. We believe ByteCodeLLM can be a crucial tool for security researchers, effectively tackling the challenge of decompiling Python executables. By harnessing the power of local LLMs, ByteCodeLLM offers improved accuracy and ensures data privacy. Notably, it seems to be the only locally hosted tool currently capable of decompiling Python 3.13 and provides a significant advantage in analyzing and understanding the latest malware threats.

Privacy recommendations:

Use Local LLMs: Consider using local LLMs to process sensitive data and maintain privacy.
Minimize sensitive data sharing: Avoid inputting highly sensitive or confidential information (PII) into LLMs.
Anonymize data: When possible, anonymize or de-identify data before sharing it with an LLM.
Understand provider policies: Carefully review the privacy policies of LLM providers to understand how your data may be used.
Stay informed: Keep abreast of the latest research and security best practices regarding LLMs.

Amir Landau is a malware research team leader and David El is a malware researcher at CyberArk Labs.

Teach Yourself Kubiscan in 7 Minutes (or Less…)

While Kubernetes’ Role-based access control (RBAC) authorization model is an essential part of securing Kub...

White FAANG: Devouring Your Personal Data

Generated using Ideogram Abstract Privacy is a core aspect of our lives. We have the fundamental right to c...

Up Your Security I.Q. by Checking Out Our Collection of Curated Resources.

ByteCodeLLM – Privacy in the LLM Era: Byte Code to Source Code

TL;DR

How did we start?

Why use local LLMs and not a commercial LLM provider?

How does ByteCodeLLM work?

How did we choose the LLM to work with?

We have decided on four different code similarity metrics to measure the accuracy of the models:

Improvement over current tools

Summary:

Privacy recommendations:

Previous Article

Next Article

STAY IN TOUCH

ByteCodeLLM – Privacy in the LLM Era: Byte Code to Source Code

TL;DR

How did we start?

Why use local LLMs and not a commercial LLM provider?

How does ByteCodeLLM work?

How did we choose the LLM to work with?

We have decided on four different code similarity metrics to measure the accuracy of the models:

Improvement over current tools

Summary:

Privacy recommendations:

Previous Article

Next Article

Recommended for You

Understanding the ‘Plague’ Pluggable Authentication Module (PAM*) backdoor in Linux systems ‘Plague’ represents a newly identified Linux backdoor that has quietly evaded detection by traditional...

CyberArk experts uncover Scattered Spider’s latest identity attacks and review how to detect threats and defend against privilege abuse.

In July 2024, Google introduced a new feature to better protect cookies in Chrome: AppBound Cookie Encryption. This new feature was able to disrupt the world of infostealers, forcing the malware...

This research report reveals how IT and security leaders are thinking about shortening TLS certificate lifespans and 47-day certificate management.

Unless you lived under a rock for the past several months or started a digital detox, you have probably encountered the MCP initials (Model Context Protocol). But what is MCP? Is this just a...

The Model Context Protocol (MCP) is an open standard and open-source project from Anthropic that makes it quick and easy for developers to add real-world functionality — like sending emails or...

Learn about identity security trends from the CyberArk 2025 Identity Security Landscape Report and risk reduction guidance.

TL;DR In this post, we introduce our “Adversarial AI Explainability” research, a term we use to describe the intersection of AI explainability and adversarial attacks on Large Language Models...

In this keynote with CyberArk’s Lavi Lazarovitz, learn how AI agents expose new threats and what organizations can learn from these insights.

Introduction The term “Agentic AI” has recently gained significant attention. Agentic systems are set to fulfill the promise of Generative AI—revolutionizing our lives in unprecedented ways. While...

Cryptojacking malware—a type of malware that tries to steal cryptocurrencies from users on infected machines. Curiously, this kind of malware isn’t nearly as famous as ransomware or even...

Introduction Identity providers (IdPs) or Identity and Access Management (IAM) solutions are essential for implementing secure and efficient user authentication and authorization in every...

You might not recognize the term “OAuth,” otherwise known as Open Authorization, but chances are you’ve used it without even realizing it. Every time you log into an app or website using Google,...

Watch this CyberArk Labs webinar for an in-depth analysis of the Dec. 2024 U.S. Treasury cyberattack, its causes and mitigation strategies.

The increasing complexity of generative AI-driven cyberthreats demands a more diligent approach to how organizations detect and respond to these dangers

The year 2025 started with a bang, with these cybersecurity stories making headlines in the first few days: New details emerged on a “major” identity-related security incident involving the U.S....

Learn about recent hacks, emerging attack methods, newly identified threats and essential mitigation strategies.

Explore an IDC InfoBrief sponsored by CyberArk, revealing the challenges and trends in identity security across industries. Understand the risks and solutions for managing human and machine identities

While Kubernetes’ Role-based access control (RBAC) authorization model is an essential part of securing Kubernetes, managing it has proven to be a significant challenge — especially when dealing...

Generated using Ideogram Abstract Privacy is a core aspect of our lives. We have the fundamental right to control our personal data, physically or virtually. However, as we use products from...