Earlier this year we announced Nosey Parker, a new scanner that uses machine learning techniques to detect hardcoded secrets in source code with few false positives. Since then we’ve continued its development and expanded its use in security engagements at Praetorian. In a few cases Nosey Parker has contributed to critical-severity findings, such as complete infrastructure takeover. In this post I’ll discuss some of the enhancements we’ve made, summarize the things we’ve learned about secret scanning, and hint at what’s next for Nosey Parker.
Nosey Parker: low-noise secret scanning with machine learning
Six months ago, Nosey Parker could only support scanning Git repositories using regular expression-based matching and a machine learning-based denoiser to suppress false positive findings. We have continued development, however, and made the following substantial improvements:
More flexibility: Nosey Parker now can scan arbitrary files and directories, rather than being limited to only source code in Git repositories. It has found secrets in Docker and firmware images, in executables, and in memory dumps. The tool also now includes some simple OSINT capabilities, allowing users to enumerate and scan GitHub organizations and users simply by naming them.
New and improved rules: We added a dozen new regular expression rules based on false negatives our security engineers reported during engagements. Furthermore, we revised ten other rules to improve signal-to-noise or fix typos that had prevented expected matching. (Yes, several regular expressions in existing open-source scanners have typos in them.)
More useful reporting of findings: Instead of terse JSON output, a new human-oriented reporting format shows syntax-highlighted source code snippets and provides a GitHub permalink to the commit that first introduced a secret (when reporting findings from code hosted on GitHub). The tool groups and deduplicates findings by the secret content, which in practice improves signal-to-noise by an order of magnitude. Instead of showing ten variations of the same finding, they appear in the report once. This new format has been hugely helpful when triaging tool findings.
Deployment of a purely machine learning secret scanner: In addition to the existing regular expression-based scanner, we’ve deployed a purely machine learning-based scanner. This scanner is able to detect secrets that have no regular expression rules, and will help us keep ahead of the endless stream of new token formats. At its core, it is a transformer-based large language model that we pre-trained on token prediction in a corpus of source code, then fine-tuned for secrets classification using a proprietary labeled data set of more than 10,000 real secrets.
The net result of all these improvements is that compared to other tools, Nosey Parker produces findings with much higher signal-to-noise, reports them more usefully, and is able to detect secrets where no explicit rule exists yet.
Hardcoded secrets are everywhere, and so is Nosey Parker
We’ve been using Nosey Parker in security engagements at Praetorian for six months now, and it has yielded great results across the full spectrum of our service offerings. Why? Because hardcoded secrets can occur anywhere.
Secrets are ubiquitous in cloud-based systems, because anytime one system component accesses another some kind of authentication takes place. Usually the authentication involves an API token or username/password combination that needs to be kept secret. However, these secrets have a way of turning up in places they shouldn’t, such as within source code repositories, binaries or IoT firmware bundles, and configuration files. When an attacker discovers one of these misplaced secrets, they can exploit it to commit IP theft, fraud, ransomware attacks, cryptomining, etc.
If secrets are ubiquitous, then hardcoded secrets are at least common. Occasionally during Praetorian’s Red Team engagements, we discover a hardcoded secret that allows initial access into the client’s environment. Once inside, they’re typically even more common, enabling lateral movement to other systems.
We encountered exactly that scenario during a recent Red Team engagement. We obtained an Okta session token through phishing, which enabled us to move laterally to a Splunk instance. From there, using Nosey Parker to scan accessible logs, we found an additional GitHub personal access token that enabled us to accomplish two attacks. We used it to gain access to their internal network through a GitHub Actions self-hosted runner, compromising an internal container registry. We also used the same token to enumerate and scan their organization-private GitHub repositories, where we found yet another GitHub personal access token. That one had organization admin privileges, which would have allowed an attacker to completely compromise the client’s GitHub organization.
How do hardcoded secrets come to be?
The solution seems simple: secrets just need to be kept secret! But as with all security axioms, many factors contribute to why things don’t work out that way. The four we encounter most often are expedience, ignorance, accident, and moving trust boundaries.
Imagine you’re a developer with 19 story points still to close out in the current sprint, and there is absolutely no way you are going to let the ugly plumbing of the system stop you from finishing that work for that last-minute feature request that will help close a sale in that big PoC (Whew!). You have no established pattern or infrastructure to facilitate proper movement of the credentials to the place in code where you need them, and the deadline is approaching rapidly. So you go ahead and store them in a variable in your code, get the feature working, and move along to your other tasks.
We recently saw an example of this in an Application Security engagement, where we demonstrated to a client how hardcoded secrets could be the downfall of their sophisticated access controls. They had specifically put these controls in place to prevent contractors from accessing core IP and infrastructure. However, with Nosey Parker we found numerous hardcoded credentials throughout their source code, which enabled us to bypass their elaborate IAM easily.
Another example that we see occasionally in security engagements is a contractor or employee publicly cloning a private company Git repo into their personal GitHub account (typically to enable or simplify remote work). Whoops! The perimeter has been breached—the cat’s out of the bag—and now an observant attacker can use any secrets that were included in the commit history.
Unless you’ve worked in security, you might not realize how serious the implications of a misplaced secret can be. If you’re in the part of development where you’re still trying to get things working at all, you might not even realize that a token needs to be kept secret. (It doesn’t help that documentation for many APIs doesn’t clearly point out the secrecy requirements of their tokens.)
We have seen this occasionally in mobile application security reviews, where the developer has embedded credentials into their application binary to allow access to a particular cloud API, not realizing how easy it is for a curious end user to pick apart the application archive and find those secrets.
Imagine that after an afternoon of work you just did a
git add -a .; git commit -m 'checkpoint work'; git push
to checkpoint the changes to the big pull request you’re working on. In the process of doing this, you mistakenly recorded numerous secret production environment variables from a file named prod.env into the Git history and pushed it to the internet. Oops.
Or perhaps when sorting out authentication issues during local development, you log the values of secrets for print-statement debugging. You don’t intend to commit those changes, but when you run locally, the logging configuration sends the messages to a log aggregation system (Splunk, Datadog, Elasticsearch, etc), unintentionally recording and revealing those secrets.
Moving trust boundaries
Perhaps your company has an internal software project, not intended for distribution outside the company. Out of expediency and inertia, the project uses hardcoded credentials. Some time later, the project is successful and your company decides to release it under an open-source license. At this point, the initial assumption that only company people would see the secrets no longer holds—the trust boundary has moved. Unless the production team carefully scrubbed the contents and history of the project, the hardcoded credentials now show up on the internet.
How to find hardcoded secrets
I’ve argued that hardcoded secrets are surprisingly common and incredibly useful in security engagements. So how can you go about finding them?
Occasionally you stumble upon secrets just by reading through source code. Once you have some experience with this, you learn heuristics about the most likely places to look. For example, you learn that *.env files are interesting because they usually contain secrets to be loaded into environment variables, and you learn that secrets often appear in Ansible and Terraform configuration files. With enough heuristics like these, you can go dorking to better focus your manual inspection effort.
Manual inspection only goes so far, though, even with good heuristics to focus your effort. To scale up, you need tool assistance: a secret scanner that inspects input files and points out the places that appear to include hardcoded secrets. There are numerous open-source tools for this, such as TruffleHog , git-secrets , Wraith , Gitrob , and Gitleaks . These tools’ main scanning technique is to use a large set of regular expressions written to detect known patterns of hardcoded secrets. This approach presents two problems, which we designed Nosey Parker to address.
First, traditional scanners struggle to detect hardcoded secrets without producing a huge number of false positives. For example, a naive attempt to find hardcoded passwords might use the pattern password\s*=\s*(\S+) , which matches the word “pattern” followed by whitespace, an equals sign, more whitespace, and some number of non-whitespace characters for the actual password. However, this will erroneously match things like a non-hardcoded password being assigned to a variable whose name ends with “password”. You can try to address this problem by writing a more complicated regular expression that will match only cases involving a literal string, but this is a Sisyphean challenge. Many types of secrets are difficult to match precisely without resorting to complicated heuristic rules (though this is changing somewhat with newer token formats designed to be easily greppable). For point of reference, the refined version of this regex that Gitleaks uses is 205 characters long (an order of magnitude longer), and still produces many false positives.
The second problem with regular expression-based matching is that you will never have all the regular expressions you need to keep your scanner current. Developers constantly create new services and token formats, each requiring a new regular expression. For a point of reference, the newest version of TruffleHog contains over 700 regular expressions.
We recently had a whitebox IoT security engagement that highlighted these challenges in tool-assisted scanning, and shows the way Nosey Parker’s machine learning approach solves for them. On this project, the regular expression-based scanners that we tried did not report any useful findings. In contrast, Nosey Parker’s purely machine learning-based scanner identified LDAP credentials, third-party API keys that we verified were live in production, encryption keys, and an SSH key passphrase. In some cases the secrets were simply GUIDs, which would be very difficult to match with a regular expression without introducing false positives.
Nosey Parker has been extremely useful for us internally during security engagements, and has increased in value with each additional feature. We will continue to expand the breadth of what it can detect by adding more regular expression rules when it makes sense and regularly retraining the neural network that underlies the machine learning-based scanner to improve its signal-to-noise. We also are expanding what the tool can scan easily, possibly including Slack and Confluence, where secrets frequently appear.
Many clients have expressed interest in using Nosey Parker in their own internal operations after security engagements in which our security engineers used it. I’m pleased to say that we are making Nosey Parker more widely available in response to this demand. We are actively working on integrating Nosey Parker with Chariot, which is our full attack lifecycle managed service provider, so that subscribing clients benefit from all Nosey Parker has to offer. Additionally, in the next few months we hope to release part of Nosey Parker as open-source under a permissive license, allowing the larger security community to benefit from it.