Roberto Di Cosmo (Parma, 1963) has an obsession: he wants to gather all the world’s source code in one place. This kind of Library of Alexandria of programming cannot be for profit and must be accessible to anyone, from researchers to private companies or individuals. That everyone, or anyone who wants, knows the architecture of the computer applications that we use will help to understand and improve them. To generate more knowledge, to the prosperity of society.
This year marks five since the dream of this Italian scientist living in Paris began to come true. Thanks to his personal commitment, the Software Heritage Inititative was launched in the summer of 2016 at the headquarters of the INRIA research center, located in the French capital. Since then, it has collected more than 11 billion unique files from more than 160 million repositories. All that code fits into one petabyte (one million terabytes, which is 1,000 gigabytes), the equivalent of the data that the Hubble Space Telescope would amass for 455 years. The master copy of that superfile is held by Software Heritage, although there are two others on Microsoft (Azure) and Amazon (AWS) cloud servers.
Di Cosmo and his team succeeded in getting UNESCO to declare in 2017 the software cultural heritage of humanity, which must be preserved like music or literature. And that’s what Software Heritage is dedicated to, to whose financing public institutions such as the French Ministry of Innovation and several universities contribute, but also banks such as Société Générale or companies such as Microsoft, Google, Intel or Huawei.
“What we do is the equivalent of creating a kind of Google of the code,” says Di Cosmo in his perfect Spanish with an Argentine accent, courtesy of his wife. He is visiting Madrid to participate in a congress on open science held at the Polytechnic University. The institution he founded and directs has a lot to say about it. “It is necessary to build an infrastructure that allows the source code used in research to be easily stored, referenced, disseminated and described in an accessible way for all,” he emphasizes. The successful collaboration of the scientific community to develop the covid vaccine is a strong argument in favor of this historic claim.
Source code is lines of text written in some programming language that allow computer programs to run. Many companies and developers jealously guard these codes: they live by selling them or developing products from them. But there are also those who publish their creations so that others can take advantage of them. The culture of software libre, which has its origins in the eighties and promoted by Richard Stallman, promotes that vision of programming: the transparency of the source code of the programs, sharing with the community its own developments so that others can improve them or take them as a starting point of major projects.
The triumph of free software
“Somehow, the software free has won. It is estimated that in 2017 between 80% and 90% of the code of the new applications was reused from another that already existed ”, he points out. “Large companies like Microsoft, which a few years ago did not even use the word, now use open source massively.” This change in third is due to the fact that software It has become so complex that no one, no company or any country, is able to write everything on their own from scratch: the most efficient thing is to cut and paste parts of code that are already known to work and focus efforts on the new functionalities.
That it is free does not mean that it does not contribute to moving the economy. According European Commission estimates, European companies invested around € 1 billion in 2018 in software open source, which had an impact on the European GNP of between 65,000 and 95,000 million.
Despite its rise, its existence should not be taken for granted. “In 2015, Google Code, the code repository sponsored by the American multinational, closed, putting 700,000 projects at risk. Gitorius, another of the most popular sites in the world, was bought by GitLab, which chose to close it, affecting 120,000 projects. A few months ago, Bitbucket decided to modify a technical aspect and deleted 250,000 projects. Saving all that is complicated, ”explains the computer scientist.
Software Heritage collects the material from its large virtual library in three ways. “We are going to search all the source code on all the platforms that we know, with the difficulty that each one speaks a different language technically. This is how we get the vast majority of data ”, he details. “But we also open two other doors: the possibility for anyone to indicate a website with source code, so that we can retrieve it automatically, and collaboration with scientific associations.”
Mirror copies in each country
Di Cosmo and his colleagues chose early on to keep multiple copies of their universal source code archive. In addition to theirs and those they have in the cloud, the Software Heritage Initiative is developing a mirror copy system (disk mirroring). These are copies of the file, but under the administrative and technical control of other entities. The first will be in Italy, in the National Agency for New Technologies and Energy (ENEA). “They will have our data, but we will not be able to write to their file. Therefore, if a hacker comes and deletes everything, he will not be able to do the same with that copy: he will have to hack it too ”, he explains.
The normal thing, says the scientist, is that the governments of the countries do not take long to realize that they are interested in supporting the initiative and having their own mirror copy. “Today the software it is essential for everything to continue working. We create a copy of everything that we can collect, and the countries that want will have their mirror copy. That way, you don’t lose your data and you also make sure that, whatever happens, no one will be able to cut off your access to the program you use. So, paradoxically, this global collaboration initiative also responds to a need for strategic autonomy in each country ”, concludes Di Cosmo.
The source code haven he runs has a minimal team. “We need between 30 and 50 full-time people and an annual budget of between five and 10 million. “If you compare it to the cost of a telescope, an oceanographic ship, or a particle accelerator, that’s nothing. But it is true that being virtual our work is less tangible than others, and the Administration is better at financing machines in general than people ”.
You can follow EL PAÍS TECNOLOGÍA at Facebook and Twitter or sign up here to receive our weekly newsletter.
#Software #Heritage #Alexandria #Library #source #code #billion #files