By Derek Tsang, from the Knowledge Lab website.
Cyberspace, like any neighborhood, has its share of disreputable regions. According to University of Chicago graduate student and Knowledge Lab researcher Cody Braun, though, once you get into the “bad internet,” it’s hard to get out.
“Spam sites have fewer links into the most visited websites; very few are linking out to Google,” Braun said. “They want you to stay in their shady corner of the internet.”
As a part of Braun’s work on a natural language processing-based scam filter, he curated a list of unsafe websites from spam emails and PhishTank’s directory of sites known to “phish” for user’s sensitive information. Braun ran a crawler that scanned for and visited those sites’ hyperlinks, then continued to crawl out into the "bad internet". He put the results from 20,000 such websites in the graph below.
Each node represents a single domain (ignoring everything after the dot-com), with white nodes for unsafe websites and a single green node (near center) for sites considered safe by Google’s Safe Browsing API. Each edge represents a link from one domain to another, and websites with more edges are closer to the center of the graph. Green edges are links back into the safe internet.
“The average number of links leaving each of these “bad” top-level domains is only .6,” said Braun. “That means half of them have no way out, and for most of them, you have to go through several links before you’re back on the safe internet. This is a different neighborhood of internet.”
The spam websites on Braun’s graph range from misspellings of common websites -- Faceboook.com and my3spac.ru -- to incomprehensible, isolated sites like zombiee.0xhost.net. While these websites are often full of links, the vast majority of those links keep browsers within the same domain.
Braun’s graph of the “safe” internet (below), on the other hand, shows websites like Blogger and Feedburner that see users link to thousands of other websites. “Safe sites are a lot more concentrated, and there are way fewer scary loners,” said Braun.
In fact, the difference in how safe and unsafe websites link to other sites may prove to be a “useful feature” in distinguishing them, said Braun. He found that the average domain on the “safe” internet linked to about 1.4 other domains, and a greater proportion of safe websites’ links led to different domains.