Edition 2 - Using Virustotal to find leaked data
Why hello, welcome to the second edition of readwrite, a newsletter about journalism and coding. This week, Hakan is writing. Please email us at readwritenewsletter@proton.me with feedback, comments and ideas, if you spot errors etc.
The TL;DR version of this post: Virustotal holds a lot of information, sometimes in places I hadn't thought of to look at.

A defunct forum
A while back, I came across a documentary on China’s far-reaching influence through the "United Front Work Department" (more on them here at the BBC and in this indictment (PDF), CRTL-F reported directly to the CCP.) One thing that caught my eye can be seen in the screenshot above.
Apparently there was a file available for download on a now-defunct site called RaidForums. It has been taken down in a joint-operation called Tourniquet but used to be a go-to spot for cybercriminals. The file was advertised as a "China United Front" leak.
As you can see, the link to where the files were hosted is clearly visible, which piqued my interest. Anonfiles[.]com was down at the time (and still is!), and if you type in the Pastebin-URL, this is what you get
This page is no longer available. It has either expired, been removed by its creator, or removed by one of the Pastebin staff.
One more thing I noticed before looking for the file: We have both a filename and a date. The post on Raidforums was published on July 21st, 2021, the file itself was named UFWD Leak Sample. Duly noted.
So, how easy is it to find a piece of information if the original source isn't online anymore? Turns out: Not "one google search away"-type easy, strictly speaking, but doable, as you might've guessed, given the title of this edition. Also, more about the "one google search"-thing at the bottom of this post.
Archive.org
One of the first things to do, if a URL is dead, is to check the Wayback Machine to see whether someone else might've captured it. Rule of thumb, for me: If I think of doing something, somebody else probably had the same idea. And, turns out, there's a hit, both for Anonfiles and for the Pastebin-link.

However, if you look at the Pastebin result, you'll immediately see that the capture of the site is in orange, meaning there was an error ("Orange indicates that the URL was not found", as it says on Wayback). The Anonfile-link is more promising. Five captures, all of them blue. All of them captured in 2021, a couple of weeks after the Raidforums post went live. At the least, we get a new and somewhat distinct filename that we can take a note of (unitedfronttlist.txt)

Content Delivery Network (CDN)
But after clicking the link, I noticed that the actual file had to be downloaded separately, so there was at least one more click involved. The file was sitting at cdn-122[.]anonfiles[.]com. Basically, if you visit a website, after determining where you're located, you'll get the files from a server ideally sitting close by. That's what a CDN is for.
In other words: While the link I was interested in was captured indeed, turns out, it was the wrong link. Sucked for me. I still didn't have the file.
Enter Virustotal
I'll talk more in detail about Virustotal in another post, but for now, know this: Virustotal is a website
- owned by Alphabet/Google
- main purpose: checking whether files/websites are malicious or not
- has an insane amount of (user-)data stored
Virustotal is an online platform that aggregates results from dozens of antivirus programs. Files and URLs are submitted by users and infosec companies, making it an interesting repository of all kinds of files, links and malware samples.
Since cybersecurity has been my beat as a reporter for quite a while now, there's one thing I know. If there was a (semi-)public leak, the chances of that leak or that file being on Virustotal are very high.
What I ended up doing was the same thing as with the Wayback Machine. I just searched for the URL. Virustotal apparently differentiates between http:// and https://, so make sure to check both versions of a URL. For the Pastebin-URL, there was a hit.
HTML Info
If you look at the "Last Analysis Date", you can see that it has been scanned four years ago. Which lines up (very roughly) with the post on Raidforums. Virustotal has different sections, I was interested in the "Details". There's a bunch of technical information in there Virustotal is logging when scanning the site. From metadata (the "first Submission" date was 2021-07-28, so a week after the Raidforums post, earlier than the first capture on the Wayback Machine, fwiw) to actual data from a given site.

I scrolled down and came across the subsection 'HTML Info' and noticed the Title of the page (if you're reading this in a browser, the title of the page is in the tab): China United Front leaked list - Pastebin.com. Up to this point, I only had two names ("UFWD leak sample", "unitedfronttlist.txt"), both of those ended up nowhere. Now I have a new thing to search for.
An old post on Reddit
And this page title was the search term that actually lead me to the leak. Because there was a post on Reddit about the leak (which I didn't find with my initial queries) and somebody in that thread decided to backup the files to the Wayback Machine. The entry was posted on 2021-08-12T18:35:23.419Z, so a couple of weeks after the leak happened. You can still download those (text)-files. I have asked a researcher – Arda Büyükkaya, Senior Cyber Threat Intelligence Analyst at EclecticIQ – to check security-wise: the files are non-malicious, just plain text.
The SHA256 of the files are b96c3992232ceec5ddd10f919af24f43059d92c0bf843c50d7ddd9f4914a8946 and 890f05d6fc123b9ab164dbb99247271b55a4f111568a1b414c106eb2b8181cf5 respectively. And, to prove my theory about the actual amount of data on Virustotal, both of the files are on Virustotal, obviously, here and here.
Please note that I only wanted to find the file. So I don't make any assessments about the contents.
On Google
When writing this post, I searched for china united front leak list and that Reddit post was on the first page. It's up to you to believe me or not when I say that back when I was looking for this file with that very same search term, the Reddit post wasn't there.
That's it for this edition. If you have liked it, feel free to share and spread the word.