There’s no denying ChatGPT and other generative AI models are a double-edged sword: While they can deliver great value in increasing business productivity and automation, they carry serious risks, especially with regard to content and data privacy. Consider the following: What if your entire business model is based on content, and success is predicated on the consistent value, visibility, and accessibility of your content to the maximum number of “unique visitors” possible? Enter the debate around content scraping.
The Good Side of Content Scraping
The process of content (or Web) scraping uses bots to capture and store content. There are definite benefits of Web scraping. If used along with machine learning, it can help reduce news bias by gathering massive amounts of data and information from websites and leveraging machine learning capabilities to evaluate the accuracy of the content as well as the tone.
Content scraping techniques can also aggregate information quickly, saving on costs by leveraging automation to reduce data extraction time and dependency on humans to get the task done. However, there are also significant risks.
The Bad Side of Content Scraping
One of these risks was evident when we first started working with a global e-commerce site. We found that an incredible 75% of the site’s traffic was bot-generated, the majority of which were scraping bots. The bots copied data that could be sold on the Dark Web or used in potentially nefarious ways such as creating fake identities or promoting misinformation or disinformation.
Another example is fake “Googlebots” — scraper bots that are particularly dangerous and cause significant harm because they evade detection on websites, mobile apps, and application programming interfaces (APIs) by disguising themselves as SEO-friendly crawlers. Knowing that websites need a good ranking on Google, opportunistic threat actors develop bots that resemble Googlebots, but carry out malicious activities once they have access to the websites, apps, or APIs.
The Gray Area in Between
ChatGPT is trained on massive amounts of data scraped from across the internet, enabling it to answer a vast array of questions. ChatGPT specifically was trained largely on Common Crawl, which produces and maintains an open repository of Web crawl data, enabling access to huge amounts of information for large language models (LLMs). Common Crawl is a legitimate, nonprofit organization. However, using its crawler bot (CCBot), ChatGPT and other LLMs can gather and enable training on any content that is not specifically protected.
This activity opens the door to significant issues. Consider a journalist who interviewed experts, researched a topic, and perfected an article, only to have the content scraped by ChatGPT without attribution. The journalist’s hard work is now completely lost thanks to a web scraping bot. Further, readers are no longer clicking on the original website where the journalist published the article, leading to the loss of website traffic and by extension, domain authority and potentially ad revenue.
Similarly, consider the recent incident in which AI was used to replicate rapper Drake’s voice in a song — that he didn’t write and was not involved with — that went viral on TikTok. This raises legal and copyright questions, as well as more wide-reaching discussions about AI and the future of music.
So, are these examples of malicious behavior, or are they more of an ethical debate or business operation question? While much of this may go beyond what we would typically consider “fair use,” AI innovation is moving faster than our laws and regulations can keep up with, putting much of this scraping activity somewhere in the gray area. It also leaves the door open for companies to decide how to proceed: to block or not to block content?
So, What Now?
If you do not want ChatGPT or other generative AI tools to train on your data, the first step you can take is to block traffic from the Common Crawler bot, CCBot. This can be done with a line of code or by blocking the CCBot user agent. However, some of the traffic generated from the ChatGPT plug-in is now coming from sophisticated bots that can impersonate human traffic. So simply blocking the CCBot is not sufficient. It’s also worth noting that LLMs like ChatGPT use other, more discreet ways to scrape content, which are likewise not as easy to block.
Another option is putting content behind a paywall. This will prevent scraping, as long as the scraper doesn’t pay for the content. However, this also limits the number of views a media website will receive organically — and risks annoying (human) readers. But with the incredible speed of AI technological innovation, will this be enough in the future?
If too many websites begin to block Web scrapers from gathering data supplied to Common Crawl or that ChatGPT and similar tools train on, developers may stop sharing their crawler identity in user agents, forcing companies to use even more sophisticated and advanced techniques to detect and block scrapers.
Additionally, companies like OpenAI and Google may decide to build data sets that can train their AI models using Bing and Google search engine scraper bots. This would make opting out of data collection difficult for online businesses that rely on Bing and Google to index their content and drive traffic to their website.
Only time will tell the future of AI and content scraping, but one thing we know for sure is that the technology will continue to evolve, as will the rules and regulations surrounding it. Companies need to decide if they want to allow their data to be scraped in the first place and what is considered fair game for AI chatbots. Creators looking to opt out of Web scraping will need to ensure they step up their defenses as quickly as scraping technology evolves and the market for generative AI expands.