OpenAI has quietly released a web crawler to scrape websites for additional data to improve future models, according to a report on The Information.
Called GPTBot, the information scraping bot will purportedly filter out sources that require paywall access by default, sites that gather personally identifiable information, or have text that violates OpenAI’s policies.
Helping OpenAI build better AIs
According to the OpenAI's documentation on GPTBot, giving it access “can help AI models become more accurate and improve their general capabilities and safety.”
Websites can customize GPTBot access by adding a GPTBot token in their robots.txt file. It will also operate from the IP address block documented on the OpenAI website here.
Of course, this news comes too late for website owners and content creators to affect ChatGPT or GPT-4's current training data, which were scraped without fanfare years ago, as Ars Technica observed.
The new instructions will not necessarily stop the web-browsing feature of ChatGPT – currently disabled – or ChatGPT plugins from accessing the website at the instruction of users.
Moreover, there are other large data sets of scraped websites out there which are commonly used to train various large-language models (LLMs). These third-party sites are unlikely to block access to GPTBot – which means websites could still be scraped for AI training against the wishes of website owners.
This news comes shortly after reports that traffic to ChatGPT dropped by almost 10% last month compared to previous months. This has resulted in speculation that the release of GPTBot is the direct result of the dip in traffic.
My personal opinion is that GPTBot was always part of the roadmap. The dip in traffic could be attributed to a reduction in novelty value, coupled with the fact that organizations and end-users are switching to access GPT-3.5 and GPT-4 through APIs.
Users can use OpenAI Playground for API access, or build their own application relatively easily. While OpenAI charges for each API call, even a heavy API user will find it substantially cheaper than the USD20 monthly subscription for ChatGPT access.
Finally, third-party subscription services such as PoE offer alternative ways to leverage not just GPT-4, but Claude-2 and other LLMs.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/BeritK