The archive consists of approximately 8 million PDFs collected from the Internet.
“PDFs are used everywhere and are important for contracts, legal documents, XNUMXD engineering drawings and many other reasons. Unfortunately, they are complex and can be hacked to hide malicious code or distribute different information maliciously," said Tim Allison, a data scientist at JPL in Southern California.
"To address these and other PDF challenges, a large sample of real-world PDFs should be collected from the Internet to create a shared, freely available resource for software professionals."
Creating the archive was not an easy task. Allison's team used Common Crawl, an open software data storage from web crawling to locate the PDFs that make up the file. All files are publicly available and not behind firewalls or private networks.
The collection of files was conducted from July to August 2021, and the scanning software detected approximately 8 million PDFs.
The complete set of data is approximately 8 terabytes, making it the largest archive of its kind available to the public.
This file will help researchers identify threats. Privacy researchers could study these files to determine how file creation and processing software can be improved to better protect personal data.
Software developers could use the files to find bugs in their code and check whether old software versions are still compatible with newer PDF versions.
The Digital Corpora project hosts the massive data archive as part of Amazon Web Services' Open Data Sponsorship program, and the files are packaged in zip files for easy download.