GitGuardian is famous for its annual State of Secrets Sprawl report. In their 2023 report, they found over 10 million exposed passwords, API keys, and other credentials exposed in public GitHub commits. The takeaways in their 2024 report did not just highlight 12.8 million new exposed secrets in GitHub, but a number in the popular Python package repository PyPI.
PyPI, short for the Python Package Index, hosts over 20 terabytes of files that are freely available for use in Python projects. If you've ever typed pip install [name of package], it likely pulled that package from PyPI. A lot of people use it too. Whether it's GitHub, PyPI, or others, the report states, "open-source packages make up an estimated 90% of the code run in production today." It's easy to see why that is when these packages help developers avoid the reinvention of millions of wheels every day.
In the 2024 report, GitGuardian reported finding over 11,000 exposed unique secrets, with 1,000 of them being added to PyPI in 2023. That's not much compared to the 12.8 million new secrets added to GitHub in 2023, but GitHub is orders of magnitude larger.
A more distressing fact is that, of the secrets introduced in 2017, nearly 100 were still valid 6-7 years later. They did not have the ability to check all the secrets for validity. Still, over 300 unique and valid secrets were discovered. While this is mildly alarming to the casual observer and not necessarily a threat to random Python developers (as opposed to the 116 malicious packages reported by ESET at the end of 2023), it's a threat of unknown magnitude to the owners of those packages.
While GitGuardian has hundreds of secrets detectors, it has developed and refined over the years, some of the most common secrets it detected in its overall 2023 study were OpenAI API keys, Google API keys, and Google Cloud keys. It's not difficult for a competent programmer to write a regular expression to find a single common secret format. And even if it came up with many false positives, automating checks to determine if they were valid could help the developer find a small treasure trove of exploitable secrets.
It is now accepted logic that if a key has been published in a public repository such as GitHub or PyPI, it must be considered compromised. In tests, honeytokens (a kind of "defanged" API key with no access to any resources) have been tested for validity by bots within a minute of being published to GitHub. In fact, honeytokens act as a "canary" for a growing number of developers. Depending on where you've placed a specific honeytoken, you can see that someone has been snooping there and get some information about them based on telemetry data collected when the honeytoken is used.
The bigger concern when you accidentally publish a secret is not just that a malicious actor might run up your cloud bill. It's where they can go from there. If an over-permissioned AWS IAM token were leaked, what might that malicious actor find in the S3 buckets or databases it grants access to? Could that malicious actor gain access to other source code and corrupt something that will be delivered to many others?
Whether you're committing secrets to GitHub, PyPI, NPM, or any public collection of source code, the best first step when you discover a secret has leaked is to revoke it. Remember that tiny window between publication and exploitation for a honeytoken. Once a secret has been published, it's likely been copied. Even if you haven't detected an unauthorized use, you must assume an unauthorized and malicious someone now has it.
Even if your source code is in a private repository, stories abound of malicious actors getting access to private repositories via social engineering, phishing, and of course, leaked secrets. If there's a lesson to all of this, it's that plain text secrets in source code eventually get found. Whether they get accidentally published in public or get found by someone with access they shouldn't have, they get found.
In summary, wherever you're storing or publishing your source code, be it a private repository or a public registry, you should follow a few simple rules:
- Don't store secrets in plain text in source code.
- Keep those who get hold of a secret from going on an expedition by keeping the privileges those secrets grant strictly scoped.
- If you discover you leaked a secret, revoke it. You may need to take a little time to ensure your production systems have the new, unleaked secret for business continuity, but revoke it as soon as you possibly can.
- Implement automations like those offered by GitGuardian to ensure you're not relying on imperfect humans to perfectly observe best practices around secrets management.
If you follow those, you may not have to learn the lessons 11,000 secrets owners have probably learned the hard way by publishing them to PyPI.