Following the success of OpenAI's ChatGPT, Microsoft's Bing Chat, and Google's Bard, researchers have created a new AI model with much darker motives.
While the large language models (LLMs) powering ChatGPT and Google Bard were trained on data from the open web, DarkBERT was trained exclusively on data from the dark web. Yes, you read that right, this AI model was trained using data from hackers, cybercriminals and other crooks.
A group of South Korean researchers released a paper (PDF) detailing how they built DarkBERT using data from the Tor network, which is used to access the dark web. By scouring the dark web and then filtering the raw data, they were able to create a database that they used to train DarkBERT.
Surprisingly, DarkBERT has already managed to outperform other large models, despite being trained on data from a very unlikely place.
Although DarkBERT is a new model of artificial intelligence, it is actually based on the RoBERTa architecture, which is an artificial intelligence approach developed in 2019 by researchers at Facebook according to Tom's Hardware.
In a research paper detailing the inner workings of RoBERTa, Meta AI explains that it is a “highly optimized method for pre-training natural language processing (NLP) systems” that improves on BERT released by Google in 2018. As Google open-sourced BERT, Facebook researchers were able to improve its performance.
Thanks to Facebook's optimized method, RoBERTa was released, which was able to produce state-of-the-art results in the General Language Understanding Evaluation (GLUE) NLP benchmark.
But now the South Korean researchers behind DarkBERT have shown that RoBERTa is capable of doing even more, as it was undertrained when it was originally released. By feeding data from the dark web to RoBERTa over the course of nearly 16 days with two datasets (one raw and one pre-processed), the researchers were able to build DarkBERT.
It should be noted that these researchers have no plans to release DarkBERT to the public. However, they accept requests for academic purposes according to Dexerto. It should be noted that DarkBERT is likely very attractive to law enforcement as well as adversaries on the other side. Of course it will also give researchers an opportunity to better understand the dark web as a whole.