NASA's JPL collection and distribution of 8 million malicious PDFs

The Jet Propulsion Laboratory (JPL) of NASA created the largest open PDF file as part of the program DARPA's Safe Documents. The aim of this move is to improve it on the internet.

The archive consists of approximately 8 million PDFs collected from the Internet.

world data

“PDFs are used everywhere and are important for contracts, legal documents, XNUMXD engineering drawings and many other reasons. Unfortunately, they are complex and can be hacked to hide malicious code or distribute different information maliciously," said Tim Allison, a data scientist at JPL in Southern California.

"To address these and other PDF challenges, a large sample of real-world PDFs should be collected from the Internet to create a shared, freely available resource for software professionals."

Creating the archive was not an easy task. Allison's team used Common Crawl, an open data storage from web crawling to locate the PDFs that make up the file. All files are publicly available and not behind firewalls or private networks.

The collection of files was conducted from July to August 2021, and the scanning software detected approximately 8 million PDFs.

The complete set of data is approximately 8 terabytes, making it the largest archive of its kind available to the public.

This file will help researchers identify threats. Privacy researchers could study these files to determine how file creation and processing software can be improved to better protect personal data.

Software developers could use the files to find bugs in their code and check whether old software versions are still compatible with newer PDF versions.

The Digital Corpora project hosts the massive data archive as part of Amazon Web Services' Open Data Sponsorship program, and the files are packaged in zip files for easy download. The Best Technology Site in Greecefgns

Subscribe to Blog by Email

Subscribe to this blog and receive notifications of new posts by email.


Written by giorgos

George still wonders what he's doing here ...

Leave a reply

Your email address is not published. Required fields are mentioned with *

Your message will not be published if:
1. Contains insulting, defamatory, racist, offensive or inappropriate comments.
2. Causes harm to minors.
3. It interferes with the privacy and individual and social rights of other users.
4. Advertises products or services or websites.
5. Contains personal information (address, phone, etc.).