Web publishing platform Medium has announced that it will block OpenAI’s GPTBot, an agent that scrapes web pages for content used to train the company’s AI models. But the real news may be that a group of platforms may soon form a unified front against what many consider an exploitation of their content.
Medium joins CNN, The New York Times and numerous other media outlets (though not TechCrunch, yet) in adding “User-Agent: GPTBot” to the list of disallowed agents in its robots.txt. This is a document found on many sites that tells crawlers and indexers, the automated systems constantly scanning the web, whether that site consents to being scanned or not. If you would for some reason prefer not to be indexed on Google, for instance, you could say so in your robots.txt.
AI makers do more than index, of course: They scrape the data to be used as source material for their models. Few are happy about this, and certainly not Medium’s CEO, Tony Stubblebine, who writes:
I’m not a hater, but I also want to be plain-spoken that the current state of generative AI is not a net benefit to the Internet.
They are making money on your writing without asking for your consent, nor are they offering you compensation and credit… AI companies have leached value from writers in order to spam Internet readers.
Therefore, he writes, Medium is defaulting to telling OpenAI to take a hike when its scraper comes knocking. (It is one of the few that will respect that request.)
However, he is quick to admit that this essentially voluntary approach is not likely to make a dent in the actions of spammers and others who will simply ignore the request. Though there is also the possibility of active measures (poisoning their data by directing dumb crawlers to fake content, for instance), that way lies escalation and expense, and likely lawsuits. Always with the lawsuits.
There’s hope, though. Stubblebine writes:
Medium is not alone. We are actively recruiting for a coalition of other platforms to help figure out the future of fair use in the age of AI.
I’ve talked to <redacted>, <redacted>, <redacted>, <redacted> and <redacted>. These are the big organizations that you could probably guess, but they aren’t ready to publicly work together.
Others are facing the same problem, and like so many things in tech, more people aligned on a standard or platform creates a network effect and improves the outcome for everyone. A coalition of big organizations would be a powerful counterbalance to unscrupulous AI platforms.
What’s holding them back? Unfortunately, multi-industry partnerships are in general slow to develop for all the reasons you might imagine. By the standards of publishing and copyright, AI is absolutely brand new and there are countless legal and ethical questions with no clear answers, let alone settled and widely accepted ones.
How can you agree to an IP protection partnership when the definition of IP and copyright is in flux? How can you move to ban AI use when your board is pushing to find ways to use it to the company’s advantage?
It may take a 900-pound internet gorilla like Wikipedia to take a bold first step and break the ice. Other organizations may be hamstrung by business concerns, but there are others unencumbered by such things and which may safely sally forth without fear of disappointing stockholders. But until someone steps up, we will remain at the mercy of the crawlers, which respect or ignore our consent at their pleasure.