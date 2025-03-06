A report published by the Rights Alliance for The Creative Industries On The Internet He has revealed that the main artificial intelligence (AI) companies in the world, including Microsoft, Meta, Openai and Deepseek, have used pirate content to train their generative models. Until now I know … He knew that these large companies had trained their AI machines with contents protected by intellectual property without the permission of the corresponding rights holders, but this Danish association report that represents more than 100,000 copyright holders goes further: copies of content obtained from classical pirate pages have been used, such as illegal file exchange or illegal transmission websites.

The report, entitled ‘Report on pirate content used in the training of generative ai’says that companies that have launched generative models have resorted to data sets obtained from pirate sites such as Libgen, Anna’s Archive and Books3, in the case of books; OpenSubtitles for films and television subtitles; Watchseries and YouTube for videos, and Common Crawl to get text housed on websites, including press and song lyrics publications. The report includes Common Crawl Because, although it is not a pirate site in the traditional sense, it has never obtained permits to copy and distribute the amount of protected content that houses.

One of the most used data sets is Books3, which contains more than 196,000 books in flat text, obtained from the Pirata Bibliotik.me site. It is distributed through Bittorrent and by individuals on several online platforms and servers. This data set has been used by companies such as Apple, Anthropic, Meta and Microsoft to train their language models, the report maintains. Another relevant data set that has been used is Opensubtitles, which includes subtitles of films and series obtained from Opensubtitles.org, a site known for hosting pirate content.

It is also mentioned to cases such as Runway AI, which has developed and provided access to a video generation model called Gen3-Alpha. This AI used software to copy thousands of YouTube videos without the consent of the creators. Also, Suno Inc, a music generation company, was sued by several US record companies for having violated copyright by reproducing their protected recordings without permission. Suno admitted to having trained his model with “tens of millions of recordings” obtained from the Internet; According to the report, they probably obtained recordings directly from ‘Cyberlockers’ or through Bittorrent technology.

This report is made public in Denmark a few weeks after the Ministry of Culture chose to withdraw the Royal Decree that the copyright was intended to regulate the development of generative models. The decree, which sought to ensure that the development of Alia, the Spanish generative AI, was accepted by respect for copyright, included a figure not exploited so far in Spain, the expanded collective license. This allowed collective management entities to grant licenses for the use of the contents of their repertoires and of those who are not part of their repertoire, unless expressly opposed by the authors.

This route caused a deep division in the sector. Within the copyright management entities themselves, there were some such as the SGAE, which supported the figure of the expanded collective license, others such as Angedi, opposed this mechanism. Several author associations also opposed this system. To this division of the sector we must add the strong pressures to Moncloa by the large technology companies, which oppose licenses for the use of protected content to train their AI models.