What is really inside artificial intelligence training data is the subject of a new bill
The Schiff’s Law: Sensitivity to Copyright Violation in Artificial Intelligence and Legal Cover for the Victims
It is not unusual for developers of AI models to claim their models are trained on publically available data, but they do not know which data is copyrighted. Companies have said any copyrighted materials fall under fair use. Meanwhile, many of these companies have begun offering legal cover to some customers if they find themselves sued for copyright infringement.
The model used for the training is going to be released to the public and companies have 30 days to submit a report. Existing artificial intelligence platforms are not retroactive unless they are updated after the bill becomes law.
Schiff’s bill garnered support from industry groups like the Writers Guild of America (WGA), the Recording Industry Association of America (RIAA), the Directors Guild of America (DGA), the Screen Actors Guild – American Federation of Television and Radio Artists (SAG-AFTRA), and the Authors Guild. The Motion Picture Association did not make the list of supporters, but they do back moves to protect copyrighted work from piracy. (Disclosure: The Verge’s editorial staff is unionized with the Writers Guild of America, East.)
How can companies remove your data about AI training? The impact of privacy laws on automated data scraping and AI training in the 21st century
If you have ever posted something on the internet, it most likely has been slurped up and used to help train the current wave of generative artificial intelligence. Large language models and image creators are powered by huge amounts of data. The data can be used for other machine-learned features if it isn’t powered by a chatbot.
Mireshghallah explains that companies can make it complicated to opt out of having data used for AI training, and even where it is possible, many people don’t have a “clear idea” about the permissions they’ve agreed to or how data is being used. Europe’s strong privacy laws are taken into consideration, as are copyright protections. The companies have written into their privacy policies that they can use your data to train the artificial intelligence.
There are a number of ways that the data could be removed from the system. There is very little known about the processes that are in place. The options can be buried or labor-intensive. Getting posts removed from data about training is likely to be difficult. Where companies are starting to allow opt-outs for future scraping or data sharing, they are almost always making users opt-in by default.
According to the Electronic Frontier Foundation, the reason companies add thefriction is that people will not go looking for it. Optin is an action that you have to know where you are to take it.