Thousands of Swiped YouTube Videos were used to train Artificial Intelligence
Openai, The Verge, and The Wall Street Journal: Understanding the Use of YouTube Content for Artificial Intelligence (and Machine Translations)
In previous interviews, YouTube CEO Neal Mohan has said that the use of video content to train AI — including transcripts — would violate the platform’s terms. In May, on an episode of Decoder, Pichai agreed with Mohan that Openai would have broken YouTube’s terms if they had trained Sora on the site.
The dataset has videos from popular creators like Mr.Beast, and clips from news outlets like ABC News and The New York Times. There are more than 100 videos from The Verge in the dataset.
The interactive lookup tool was released by Proof News. If you use the search feature, you will be able to see if your content is in the dataset.
“I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” she told The Wall Street Journal at the time. When pressed by the Journal about YouTube content specifically, Murati said she “wasn’t sure about that.”
“We have terms and conditions, and we would expect people to abide by those terms and conditions when you build a product, so that’s how I felt about it,” Pichai said.
Proof News: David pakman of YouTube, a left-leaning political host, and his work for training Artificial Intelligence
Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some material used to train artificial intelligence promoted theories such as the flat-earth theory.
David pakman, host of The David pakman Show, a left-leaning politics channel with over two million subscribers and two billion views, said no one had contacted him about using the show. Nearly 160 of his videos were swept up into the YouTube Subtitles training dataset.
Four people work full time on Pakman’s enterprise, which posts multiple videos each day in addition to producing a podcast, TikTok videos, and material for other platforms. He said he should be compensated for the use of his data. According to him, some media companies have recently penned agreements to be paid for using their work to train Artificial Intelligence.
Pakman said that he puts time, resources, money, and staff time into creating this content. There is no shortage of work.