4 Lessons I Learned Trying to Build a Secrets Scanner
January 29, 2015 | DevOps | Shaun Gosse
I worked on trying to build a secrets scanner: software to find private keys, API keys, or passwords. How? Keys we can often be identified by extension or the head of the file. For the other secrets, most of the time these are automatically generated and look like ‘random’ values. So if we eliminate anything with a pattern, what is left are the secrets. Sounds simple, right?
I created a multiple-pass system. There is a special-purpose initial pass which looks for files like “NAME.pub” so that it can flag “NAME” as a possible private key file in its second pass. After that it looks for files of interest (key files, configuration files, or code files) by extension; it then examines the identified files in greater detail according to their type.
The current code is still very preliminary, and results vary greatly, but as an example on my personal machine with about 45GB mounted in almost 400,000 files on an HDD, the whole system scan took about 22 minutes and ultimately flagged 1,171 files, after examining 28,434 files based on extension. Currently, most of what it flags are actually false positives, in part because although the false positive rate is low, there are very few files with secrets on my machine. Additionally, the source code other than Ruby is not being truly parsed, so there are cases where it misses a true result or incorrectly flags some benign results which it could correctly judge with proper parsing.
There are a lot of keys and source code files on a machine. I was surprised at just how many keys there were. LibreOffice alone had more than 30 keys. Now, these could be excluded very simply by checking for public keys versus private keys, which is not yet done, but it illustrates some of the volume I found. Similarly, I was not expecting the thousands of source files from all sorts of applications, although it is logical in hindsight given the interpreted languages they are using.
There are a few obvious ways to reduce the problem. We can search only in a targeted location, do a base scan, and subsequently only consider modified files, or even better, work intelligently with diffs. Thus whatever false positives may remain need not be reviewed more than once.
Use a language parsing library to extract strings from source code. It seems like a simple problem to figure out, what is a string and what isn’t, but it’s actually non-trivial. If one starts to try to solve it by regexes, it will be found that eventually a parser will be needed anyhow. In my code, only a Ruby parser has been added so far, but the other languages will need to be parsed as well to achieve accurate results.
The most intractable problem for the software is separating the hard false positives. For instance, cryptographic values, like a hash, will also appear random. When these values are embedded in a source file, they trigger my detector. In general, cryptographic values and machine generated secrets are hard to distinguish by software because they can come from the same sample space.
Ultimately, we must call in a human. But even if many of the strings or files identified are false positives, the area which needs human review is greatly reduced. Not only that but the false positives, which are difficult for the software to recognize, are often very obvious to the human reviewer. If this is combined with an efficient UI for review, the software and human reviewer team can be far more effective than either one on it’s own.