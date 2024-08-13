The race to develop artificial intelligence is intense. Big tech is locked in a war to see who has the best language model on the market; with billions of dollars at stake, the competition is fierce. Although the saying goes that all is fair in love and war, not everyone agrees with it, especially if you are on the side of those harmed by its application. Earlier this month, a YouTuber named David Millette filed a class-action lawsuit in California against OpenAI on behalf of all YouTube content creators in the United States. The reason: the company that created ChatGPT may have misused the content that these people upload to the video platform.

The quality and reliability of the answers given by language models are everything. Nobody wants to use an AI whose answers do not fit what is asked or are completely incorrect, and much less would anyone want to pay for something like that. The quality of the answers depends on several parameters, but beyond how the question is formulated and how much the AI ​​is limited by its own developers or distributors, there is one fundamental one: the quality of the training that the language model has had. As a general rule, the more data and the more variety it has used to train, the better its capacity and quality of response will be.

In this sense, the problem faced by developers is that as language models become more sophisticated, they need more data, and although there are repositories of copyright-free content in both written and audiovisual formats, they are not sufficient on their own. You also can’t use content generated by other AI because they end up collapsing.. AI needs humans to improve and human content is copyrighted. Regarding the role that audiovisual content in particular can play in its training, according to the submitted writing before the Californian court, shared by The Hollywood ReporterVideos are a valuable source for AI because they contain numerous examples of natural language.

In 2022, OpenAI launched a speech recognition tool called Whisper. This model, which is capable of transcribing audio to text, was trained on 680,000 hours of videos collected from across the web. The numbers don’t add up. “One of the largest audiovisual content websites, VoxPopuli, contains 400,000 hours of untagged copyrighted video, and this is considering that the videos are in different languages. Libriheavy, one of the largest copyright-free video sites, has 50,000 hours of speeches in English. There are only a few sites whose content can be used to train the models. As is evident, the sum of the videos on the two most powerful ones still falls 200,000 hours short of the training declared in Whisper,” the text reads.

If you look at YouTube’s numbers, the fact that OpenAI wants to access its content makes perfect sense. According to the specialized website Global Media Insight, Every day, an average of 720,000 hours of video are uploaded to YouTube. In an article published in early April, the New York Times He claimed that Whisper was indeed able to transcribe the audio from YouTube videos.and that an OpenAI team transcribed more than a million hours of video from the platform. In that same article, the New York Times gave another key to the matter: why did Google, the owner of YouTube, not do anything about it when it detected this practice by OpenAI? Because according to the aforementioned media, it did exactly the same thing to train its own language model.

Taking into account the information published by the New York TimesMillette believes there are hundreds of YouTubers affected. According to the Hollywood Reporterseveral artists, tutorial authors and news sites have already joined the lawsuit. Based on the fact that according to YouTube’s terms and conditions, content creators have rights over the videos they upload to the platform and that they did not provide any kind of permission to OpenAI or receive any compensation or remuneration in return, they accuse the ChatGPT company of illegally enriching itself at their expense and of having violated California’s competition laws. For this reason, the YouTubers demand compensation.

The outcome of this class action lawsuit, as well as its potential implications if it continues to move forward, remains to be seen. Many of these class actions in the United States end with monetary settlements between the defendant company and the plaintiffs. According to Youtube’s terms and conditionswhen any user uploads a video “you retain all ownership rights to your content. What belongs to you remains yours,” however, several concessions are made obligatorily and free of charge to both YouTube and users of the platform. “By uploading content to the service, you grant YouTube a worldwide, non-exclusive, free and royalty-free, transferable and sublicensable license to use said content (including to reproduce, distribute, modify, transform, display, communicate to the public and perform it) in order to operate, promote and improve the service,” they write. The courts have an opportunity to settle what is valid and what is not in the war for AI supremacy, at least on the front of intellectual property and model training.

