12,000+ API Keys and Passwords Found in Public Datasets Used for LLM Training
Feb 28, 2025
Machine Learning / Data Privacy
A dataset used to train large language models (LLMs) has been found to contain nearly 12,000 live secrets, which allow for successful authentication. The findings once again highlight how hard-coded credentials pose a severe security risk to users and organizations alike, not to mention compounding the problem when LLMs end up suggesting insecure coding practices to their users. Truffle Security said it downloaded a December 2024 archive from Common Crawl , which maintains a free, open repository of web crawl data. The massive dataset contains over 250 billion pages spanning 18 years. The archive specifically contains 400TB of compressed web data, 90,000 WARC files (Web ARChive format), and data from 47.5 million hosts across 38.3 million registered domains. The company's analysis found that there are 219 different secret types in the Common Crawl archive, including Amazon Web Services (AWS) root keys, Slack webhooks, and Mailchimp API keys. "'Live' secrets ar...