Make Memcpy Safe Again: CodeQL

August 27, 2020 Assaf Sion

Last February, I went to #OffensiveCon20 and, as you might expect, it was awesome. The talks were great, but the real gem was the CodeQL workshop that was held the second day of the event. That session inspired us to start researching the potential of CodeQL and how it can be used to do variant analysis. During that research, we found 7 new vulnerabilities in the popular open-source framework – FFmpeg, an audio and video streaming and conversion framework, widely used in software including Google Chrome, VLC Player and more.

In this blog we’ll review CodeQL and its key features as well as show how I used it to find vulnerable calls to memcpy in several open-source projects.

TL;DR

Hunting bugs in open-source projects became a lot easier and fun with CodeQL. We wrote a custom CodeQL query that locates potentially vulnerable memcpy calls and found 7 new vulnerabilities in FFmpeg.

CodeQL: A Very (very) Short Introduction

CodeQL is a framework developed by Semmle and is free to use on open-source projects. It lets a researcher perform variant analysis to find security vulnerabilities by querying code databases generated using CodeQL. CodeQL supports many languages such as C/C++, C#, Java, JavaScript, Python, and Golang.

Once we generated a code database, we can use premade queries developed by Semmle and the community or write custom queries and use them.

Generating A Code Database

In order to review examples of what we can do using CodeQL, we first need a code database. Obtaining a code database can be done by downloading a code database generated by Semmle from here, or by generating a code database on our own.

Let’s dive in by creating a small code database from the code below:

codeql database create ~/semmle/databases/example.db --command="clang code.c -o example" --language=cpp

int main(int argc, char **argv)
{
    int size1 = 5;
    int size2 = size1 + 5;

    char *dst = malloc(size1); // allocation size is 5 bytes
    char *src = malloc(size2); // allocation size is 10 bytes

    if (strcmp(argv, "change") == 0)
        size1 = 15;
        
    memcpy(dst, src, size1);
}

Figure 1 – code.c

We will use the database generated from this simple C program in the next three examples.

The Query Structure

CodeQL’s syntax is very similar to SQL, and is comprised of these main parts:

Imports – At the beginning of the query we denote which CodeQL libraries we wish to import. For example, to use the basic features of CodeQL for C/C++, we import cpp.
from – We must define the CodeQL variables and their types. Each CodeQL variable represents an object from the CodeQL library, e.g., Function, FunctionCall, VariableAccess, Variable and Expression.
where – Once we’ve defined CodeQL variables, we can then construct the predicates to be applied to them. Although this part is optional, it is also the core of the query.
select – Under this clause, we set how the output is going to look. We can bind CodeQL variables and present them in different ways, usually in a table.

Variable

A variable is a name that holds one or more values. When referring to a variable in CodeQL, we mean the declaration of that particular variable. The same rules of variable declaration apply like in any procedural coding language — one variable declaration in one code scope.

Using a simple query, we can extract all the variable declarations from the code above:

import cpp

from Variable var
select var, var.getLocation() as location

Figure 2 – Extracting variable declarations with CodeQL

Running that query against the code database we created will produce the following results:

Figure 3 – Variable declarations in vscode

The results are displayed in vscode because we are using the CodeQL extension for vscode. We would highly recommend setting up a CodeQL workspace for vscode even though it is possible to use only the CodeQL cli tool.

Variable Access

We can also easily find all the accesses to a variable. The query below will present all the access locations to all the variables. By adding a condition to the query, we could look for an access to a specific variable.

import cpp

from VariableAccess var_access
select var_access, var_access.getLocation() as access_location

Figure 4 – Extracting the access locations to all the variables

And the results:

Figure 5 – Variable accesses

We can see that the left column shows the name of the variable we access and the right column the exact location of the access.

Do keep in mind the difference between variable and variable access; while a variable represents the declaration of that variable, variable access represents every access to that variable. Therefore, for example, we see that the variable size1 appears multiple times in the left column since it is accessed several times.

Local Taint Tracking

One of the many reasons CodeQL is such a powerful tool is local/global taint tracking.

CodeQL creates a graph that represents a given code. By doing so, it can track the flow of any variable and tell which variable affected another and where.

In the code above (figure 1), size1 = 5 and size2 = size1 + 5, meaning size1 tainted size2. Using this logic, we can see which variable affects other expressions in our code:

import cpp
import semmle.code.cpp.dataflow.TaintTracking

from Variable source_var, VariableAccess source_access, Expr sink
where
    // Find all the variable accesses that affect any expression in our code
    TaintTracking::localExprTaint(source_access, sink)	
    and
    // Linking the access to the variable itself
    source_access.getTarget() = source_var
select source_var, source_var.getLocation() as source_location, sink, sink.getLocation() as sink_location

Figure 6 – A simple taint tracking example

And the results:

Figure 7 – Taint tracking for variable accesses

Under the column source_var we see all the variable accesses that affect the expression under column sink. For example, in row 7 the access to size1 (size1 = 5) affects the call to memcpy in line 13.
So now we see which variable affects which and where, exactly.

Vulnerable Memcpy

After learning and experimenting a bit with CodeQL, our goal was to write a new query that will find heap-based write buffer overflows caused by memcpy.

The following section describes our thought process, which eventually led us to write a single query that achieves our goal. Since that query is a bit long and complexed, we determined reviewing each logical part from that query will make it more understandable.

Finding the Calls to Memcpy

Obviously in order to find a vulnerable memcpy call, we first need to find all the calls to memcpy.

In addition to finding all the memcpy calls, we extract the size of that memcpy (the 3rd argument) which will help us distinguish safe from unsafe memcpy calls.

We will create a new code database from the following code:

int main(int argc, char **argv)
{
    int dst_size = 5;
    int src_size = 10;

    char *dst = malloc(dst_size);
    char *src = malloc(src_size);

    memcpy(dst, src, src_size);
}

Figure 8 – An example of a simple, unsafe call to memcpy

After creating the database, we will use the following query:

import cpp
import Utils

from FunctionCall memcpy, Expr size
where
    memcpy.getTarget().hasName("memcpy")
    and
    // Example: memcpy(dst, src, a * b) => size = a || size = b
    size = memcpy.getArgument(2).getAChild*()
    and
    // Example: memcpy(dst, src, a * 5) => size = a || size = 5, the number 5 is not a variable!
    if exists(VariableAccess va | size = va)
    then isNumber(size.(VariableAccess).getTarget())
    else size = size

select memcpy.getLocation() as location, size as value

Figure 9 – Extracting the calls to memcpy

And the results:

Figure 10 – A single call to memcpy , as expected

The “location” column shows the exact location of a memcpy call and the corresponding cell under the “value” column shows all the variables/values that affect how many bytes to copy. As expected, the results show a single memcpy call with src_size bytes to copy.

From an Allocation Function to Memcpy 1st Argument

malloc
calloc
realloc
Custom methods that wrap the above

Given a call to memcpy, we need to find where the memory block provided to it as the destination argument (1st argument) was allocated.

To do so, we can use taint tracking, the same way we showed earlier:

import cpp
import Utils
import semmle.code.cpp.dataflow.TaintTracking

// CallAllocationExpr represent a call to malloc/calloc/realloc/etc./custom wrappers
from CallAllocationExpr alloc, Expr memcpy_dst, FunctionCall memcpy, Expr size
where
    // Setting memcpy, memcpy_dst
    memcpy.getTarget().hasName("memcpy")
    and
    memcpy_dst = memcpy.getArgument(0)
    and

    // Every allocation that flows to the 1st arguemnt in a memcpy call
    TaintTracking::localExprTaint(alloc, memcpy_dst)
    and
    // malloc(a * b) => size = a || size = b
    size = alloc.getAnArgument().getAChild*()
    and
    // malloc(a * 5) => size = a || size = 5, the number 5 is not a variable!
    if exists(VariableAccess va | size = va)
    then isNumber(size.(VariableAccess).getTarget())
    else size = size

select memcpy.getLocation()  as memcpy_location, 
    alloc.getLocation()  as allocation_location, 
    size

Figure 11 – Finding the allocation function that allocated the memory block provided as the destination argument in a memcpy

And the results:

Figure 12 – Finding the source buffer size in memcpy

“memcpy_location” – the location of the memcpy.
“allocation_location” – the location of the allocation function call that allocated the memory for the destination argument of the matched memcpy.
“size” – the variable/value that affected the size of the matched allocation function

Notice, we copy src_size bytes from src to dst even though dst points to an allocation with the size of dst_size.

Since src_size and dst_size are different variables, this puts this memcpy call on our list of potentially vulnerable call sites – and it is a great way of surfacing these weaknesses at a fairly low level of effort!

Things Are Getting Complicated – Affecting Variables

The following code shows a slightly more complicated scenario, in which the allocation size is affected by the memcpy length variable.

int main(int argc, char **argv)
{
    int memcpy_size = 5;
    int src_size = 10;
    int dst_size = memcpy_size * src_size;

    char *dst = malloc(dst_size);
    char *src = malloc(src_size);

    memcpy(dst, src, memcpy_size);
}

Figure 13 – Simply allocating and copying

Here, “dst” points to a dst_size sized allocation and we copy memcpy_size bytes from “src” to “dst”.
But, this memcpy call is not vulnerable – dst_size value derived from memcpy_size and src_size and in this case, dst_size > memcpy_size.

The query below is using taint tracking to find all the variables that dst_size is derived from. Using a similar query like the one below, we can find all the variables that affect the value of memcpy_size.

import cpp
import Utils
import semmle.code.cpp.dataflow.TaintTracking

// CallAllocationExpr represent a call to malloc/calloc/realloc/etc./custom wrappers
from CallAllocationExpr alloc, Expr memcpy_dst, FunctionCall memcpy, Expr size, VariableAccess affecting_size
where
    // Setting memcpy, memcpy_dst
    memcpy.getTarget().hasName("memcpy")
    and
    memcpy_dst = memcpy.getArgument(0)
    and

    // Every allocation that flows to the 1st arguemnt in a memcpy call
    TaintTracking::localExprTaint(alloc, memcpy_dst)
    and
    // malloc(a * b) => size = a || size = b
    size = alloc.getAnArgument().getAChild*()
    and
    // malloc(a * 5) => size = a || size = 5, the number 5 is not a variable!
    if exists(VariableAccess va | size = va)
    then isNumber(size.(VariableAccess).getTarget())
    else size = size

    and
    // Setting affecting_size -> all the variables that affect the size of the allocation
    TaintTracking::localExprTaint(affecting_size, size)

select memcpy.getLocation()  as memcpy_location, 
    alloc.getLocation()  as allocation_location, 
    size, affecting_size

Figure 14 – Extracting the variables that affect the allocation size

The query above is almost identical to the previous query (figure 11) with a small addition of a 2nd taint tracking check.

And of course, the results:

Figure 15 – Affecting variables on the source buffer in a memcpy

As expected, we see that dst_size is derived from memcpy_size, src_size, and dst_size.

At this point, we might assume that the memcpy from the snippet above, is not vulnerable, but we shouldn’t.

int main(int argc, char **argv)
{
    int memcpy_size = 5;
    int src_size = 10;
    int dst_size = memcpy_size * src_size;

    char *dst = malloc(dst_size);
    char *src = malloc(src_size);

    // Changing memcpy_size to be bigger than dst_size
    memcpy_size = 100;
    memcpy(dst, src, memcpy_size);
}

Figure 16 – Why we need Global Value Numbering

Using the previous query to identify which variables affect dst_size is not enough; we need memcpy_size to have the same value at both the definition of dst_size and at the call to memcpy.

To do so, we will use a new feature in CodeQL named “Global Value Numbering.” According to Semmle: “The global value numbering library provides a mechanism for identifying expressions that compute the same value at runtime.“. Value numbering is extremely powerful since it allows us to determine whether expressions are equal or not. We can now instantly infer which calls to memcpy are safe, and which should be further analyzed.

However, we cannot use this mechanism to determine if an expression is bigger or smaller than a different one, but we can know if they’re different. Let’s edit our query and add some global value numbering:

import cpp
import Utils
import semmle.code.cpp.dataflow.TaintTracking
import semmle.code.cpp.valuenumbering.GlobalValueNumbering

// CallAllocationExpr represent a call to malloc/calloc/realloc/etc./custom wrappers
from CallAllocationExpr alloc, Expr memcpy_dst, FunctionCall memcpy, VariableAccess size, VariableAccess affecting_size
where
    // Setting memcpy, memcpy_dst
    memcpy.getTarget().hasName("memcpy")
    and
    memcpy_dst = memcpy.getArgument(0)
    and

    // Every allocation that flows to the 1st arguemnt in a memcpy call
    TaintTracking::localExprTaint(alloc, memcpy_dst)
    and
    // malloc(a * b) => size = a || size = b
    size = alloc.getAnArgument().getAChild*()
    and
    // size should be a variable and a number.
    isNumber(size.getTarget())

    and
    // Get the variables that affect the size of the malloc
    TaintTracking::localExprTaint(affecting_size, size)

    and
    exists
    (
        // memcpy_length is the variable that affect hoe many bytes to copy in a memcpy
        VariableAccess memcpy_length | memcpy_length = memcpy.getArgument(2).getAChild*()
        and
        (
            
            (not memcpy_length.getTarget() = affecting_size.getTarget())
            or
            // Since it`s an or expression, this only applies to variable accesses with a different varible
            globalValueNumber(memcpy_length) = globalValueNumber(affecting_size)
        )
        
    )
select memcpy.getLocation()  as memcpy_location, 
    alloc.getLocation()  as allocation_location, 
    size, affecting_size

Figure 17 – Using “GVN” in a query

We can see that this query is the same as the previous one (figure 14) with a small (but significant) addition. Now, we make sure that if a variable affects both the size of the allocation and the size value in the memcpy call, that variable keeps the same value.

Executing this query against the non-vulnerable code (figure 13) will present these results:

Figure 18 – The awesomeness of GVN (part 1)

But, executing our query against the vulnerable code (figure 16, where we changed memcpy_size right before the memcpy) will produce the following results:

Figure 19 – The awesomeness of GVN (part 2)

As you can see, memcpy_size is no longer affecting dst_size since memcpy_size does not keep its value when calling to memcpy.

Guarding the Memcpy

In CodeQL, guards allow us to identify conditions that control the execution of other parts in our program. For instance:

int main(int argc, char **argv)
{
    int size = 5;
    int memcpy_size = 10;

    char *dst = malloc(size);
    char *src = malloc(size);

    if (memcpy_size < size)
    {	
        memcpy(dst, src, memcpy_size);
        return 0;
    }
    else
    {
        return -1;
    }
    
}

Figure 20 – Guarding correctly a memcpy

In this example, the if statement is guarding the memcpy. If the condition holds in runtime, the call to memcpy will occur – otherwise, it will be avoided.

Our previous queries would assume that this memcpy is vulnerable – but, that is no longer the case!

Not only that, but this guard also guards the memcpy, it defends the memcpy correctly. The following (insufficiently safe) example explains what correctly means:

int main(int argc, char **argv)
{
    int size = 5;
    int memcpy_size = 10;

    char *dst = malloc(size);
    char *src = malloc(size);

    if (memcpy_size > 0)
    {
        memcpy(dst, src, memcpy_size);
        return 0;
    }
    else
    {
        return -1;
    }
    
}

Figure 21 – insufficient guard example

Even though this memcpy is guarded as well, the guard itself is checking a condition that is irrelevant when determining whether this memcpy is vulnerable or not.

Since this guard is only checking the lower bound of memcpy_size, memcpy_size could be bigger than size, and this will cause an overflow.

The following query is a snippet from the final query (GitHub link at the end of the blog):

predicate isMemcpyNotGuardedEnough(FunctionCall memcpy){
    exists
    (
        BasicBlock bb, GuardCondition gc, Expr left, Expr right | 
        bb = memcpy.getBasicBlock()
        and 
            // No Guards at all - Easy scenario
            (
                not gc.controls(bb, _)
                
            )
            // Guard exists - checking the type of the guard and where is the length variable.
            or
            (
                gc.controls(bb, _)
                and
                (
                    (
                        // Condition has the form: x < y and must be true in order for memcpy to execute // Make sure the check dosen`t check the maximum value of the length variable exists ( boolean enterBlock | gc.ensuresLt(left, right, _, bb, enterBlock) and if enterBlock = true then not lengthVariable.getTarget() = left.getAChild*().(VariableAccess).getTarget() else not lengthVariable.getTarget() = right.getAChild*().(VariableAccess).getTarget() ) ) or ( // Meaning guard is from the form of x == y to execute memcpy gc.ensuresEq(left, right, _, bb, true) and // In that case, make sure the length variable is not checked // If it is being checked, it means length must have a specific value => well guraded
                        not 
                        (
                            lengthVariable.getTarget() = gc.getAChild*().(VariableAccess).getTarget()
                        )
                        
                    )
                )
            )
    )
}

Figure 22 – We make sure that if a memcpy is guarded, then it is guarded correctly

The logic here is simple: determine if the guard is checking for an upper bound of the variables that create the 2^nd argument in a memcpy.

With this, we can determine if a memcpy call is guarded correctly or not. We can now filter out calls to memcpy that might have been vulnerable, but the guard prevents the bug from occurring.

Sanity Check

As mentioned earlier, we reviewed each logical part from the finalized query (GitHub link below).

Eventually, we wrote a single query that contains all the logical steps we reviewed in this blog.

To validate the query, we must prove that the query works as expected. To do so, we have to accomplish two things:

Prove we find vulnerable calls to memcpy
Clear the safe calls to memcpy – narrowing down the number of calls a researcher must review

To check the credibility of that query, we created code databases with existing vulnerabilities that are caused by an unsafe call to memcpy, for example:

CVE-2020-12284 in FFmpeg
CVE-2016-9453 in LibTiff

To eliminate false positives, we tested the query against different versions of several projects and cleared 90%-99% of the calls to memcpy:

ImageMagick
GraphicsMagick
Linux/Torvalds
Many more

Finding X Vulnerabilities

Once we deemed the query reliable, we decided to run it against an updated (at the time) version of FFmpeg. I chose FFmpeg for two reasons:

Familiarity with the code and how to debug it
New code is shipped to the library daily

After creating a new code database from FFmpeg, executing the query produced the following results:

Figure 23 – Vulnerable memcpy calls

It was extremely easy to pull up these 7 new calls to memcpy that did not appear during the sanity check procedure. Digging deeper showed that those calls are, in fact, vulnerable. An attacker can control the address of both source and destination in those calls to memcpy and cause a read/write heap-based overflow that could lead to RCE.

Conclusions

Besides being a potent tool, CodeQL is relatively easy to learn and use for vulnerability research. The community is growing, and the language is evolving significantly with each day passing. Yet, mastering all the capabilities of it might take time. If you have an idea for a query, I would highly recommend working as organized as possible:

Define the bug you’re looking for. What are you looking to find? What are you not looking to find?
Breakdown the query. Find out what are the logical components that you need in order to find the bug you’re looking for
Make sure each logical part returns the results you expect before writing a single complex query
Enjoy the process and learn from your mistakes

Finally, CodeQL is relatively new and still evolving, do keep in mind that bugs might exist so stay sharp and understand everything you write.

Future Work

Right now, the query checks the flow from standard allocation functions to the standard memcpy call.

If the code you’re analyzing has wrappers to standard allocation functions, you might need to add small modifications in the query. Those modifications will occur in the Config.qll and Utils.qll files. By creating a new CodeQL class that will represent those wrapper functions, we could solve that issue, and no adjustments would be necessary.

Finally, we will continue to perform variant analysis by writing new queries to find even more complex and unique vulnerabilities.

Links & References

Learn CodeQL – https://codeql.github.com/docs/codeql-for-visual-studio-code/
The query – https://github.com/assafsion/DangerousMemcpy

Running Sensitive Apps in WSL: (SAFE + SAFE) < SAFE

This blog is intended to be a warning bell and to draw attention to a potential security risk involved in r...

Using Kubelet Client to Attack the Kubernetes Cluster

In this blog post, we are going to look at the Kubernetes agent, kubelet (see Figure 1), which is responsib...

Up Your Security I.Q. by Checking Out Our Collection of Curated Resources.

Make Memcpy Safe Again: CodeQL

TL;DR

CodeQL: A Very (very) Short Introduction

Generating A Code Database

The Query Structure

Variable

Variable Access

Local Taint Tracking

Vulnerable Memcpy

Finding the Calls to Memcpy

From an Allocation Function to Memcpy 1st Argument

Things Are Getting Complicated – Affecting Variables

Guarding the Memcpy

Sanity Check

Finding X Vulnerabilities

Conclusions

Future Work

Links & References

Previous Article

Next Article

STAY IN TOUCH

Make Memcpy Safe Again: CodeQL

TL;DR

CodeQL: A Very (very) Short Introduction

Generating A Code Database

The Query Structure

Variable

Variable Access

Local Taint Tracking

Vulnerable Memcpy

Finding the Calls to Memcpy

From an Allocation Function to Memcpy 1st Argument

Things Are Getting Complicated – Affecting Variables

Guarding the Memcpy

Sanity Check

Finding X Vulnerabilities

Conclusions

Future Work

Links & References

Previous Article

Next Article

Recommended for You

In July 2024, Google introduced a new feature to better protect cookies in Chrome: AppBound Cookie Encryption. This new feature was able to disrupt the world of infostealers, forcing the malware...

Unless you lived under a rock for the past several months or started a digital detox, you have probably encountered the MCP initials (Model Context Protocol). But what is MCP? Is this just a...

The Model Context Protocol (MCP) is an open standard and open-source project from Anthropic that makes it quick and easy for developers to add real-world functionality — like sending emails or...

TL;DR In this post, we introduce our “Adversarial AI Explainability” research, a term we use to describe the intersection of AI explainability and adversarial attacks on Large Language Models...

Introduction The term “Agentic AI” has recently gained significant attention. Agentic systems are set to fulfill the promise of Generative AI—revolutionizing our lives in unprecedented ways. While...

In the past two years, large language models (LLMs), especially chatbots, have exploded onto the scene. Everyone and their grandmother are using them these days. Generative AI is pervasive in...

Cryptojacking malware—a type of malware that tries to steal cryptocurrencies from users on infected machines. Curiously, this kind of malware isn’t nearly as famous as ransomware or even...

Introduction Identity providers (IdPs) or Identity and Access Management (IAM) solutions are essential for implementing secure and efficient user authentication and authorization in every...

You might not recognize the term “OAuth,” otherwise known as Open Authorization, but chances are you’ve used it without even realizing it. Every time you log into an app or website using Google,...

While Kubernetes’ Role-based access control (RBAC) authorization model is an essential part of securing Kubernetes, managing it has proven to be a significant challenge — especially when dealing...

TL;DR ByteCodeLLM is a new open-source tool that harnesses the power of Local Large Language Models (LLMs) to decompile Python executables. Furthermore, and importantly, it prioritizes data...

Generated using Ideogram Abstract Privacy is a core aspect of our lives. We have the fundamental right to control our personal data, physically or virtually. However, as we use products from...

Recently, we researched a project on Portainer, the go-to open-source tool for managing Kubernetes and Docker environments. With more than 30K stars on GitHub, Portainer gives you a user-friendly...

As large language models (LLMs) become more advanced and are granted additional capabilities by developers, security risks increase dramatically. Manipulated LLMs are no longer just a risk of...

In software development, CI/CD practices are now standard, helping to move code quickly and efficiently from development to production. Azure DevOps, previously known as Team Foundation Server...

tl;dr: Large language models (LLMs) are highly susceptible to manipulation, and, as such, they must be treated as potential attackers in the system. LLMs have become extremely popular and serve...

Over the short span of video game cheating, both cheaters and game developers have evolved in many ways; this includes everything from modification of important game variables (like health) by...

Following our post “A Brief History of Game Cheating,” it’s safe to say that cheats, no matter how lucrative or premium they might look, always carry a degree of danger. Today’s story revolves...

During a recent customer engagement, the CyberArk Red Team discovered and exploited an Elevation of Privilege (EoP) vulnerability (CVE-2024-39708) in Delinea Privilege Manager (formerly Thycotic...

Golang applications that use HTTPS requests have a built-in SSL verification feature enabled by default. In our work, we often encounter an application that uses Golang HTTPS requests, and we have...