A US researcher retrieved gene sequencing data from samples of the new coronavirus that had been collected from the first Covid-19 cases in Wuhan, China, and which had disappeared from a global scientific database. More than 200 sequences of the Sars-CoV-2 genome were deleted from the database at the request of Chinese researchers about a year ago; the material can help clarify the origins of the pandemic.
Evolutionary biologist Jesse Bloom, an influenza virus specialist at the Fred Hutchinson Cancer Research Center in Seattle (USA), reports that he found 13 of these original sequences by digging through files stored on the Google Cloud. The analysis of these data, published on Tuesday as pre-print on BioRxiv platform, supports the hypothesis that a variety of coronaviruses may have circulated in Wuhan before the outbreaks related to the city’s food and seafood market. The study has not yet undergone independent peer review and has not been published in a scientific journal.
Bloom, who has been advocating more research into the origin of the Covid-19 pandemic, was reviewing coronavirus genetic data published by several researchers when he found an article that mentioned 241 genetic sequences from virus samples collected by researchers at the University of Wuhan in January and February 2020. The information indicated that the data had been entered into an online database called the Sequence Read Archive (SRA), coordinated by the US government’s National Institutes of Health (NIH). This platform is used by scientists around the world who deposit genetic sequences so that other experts can analyze them.
However, when Bloom did a search for these sequences in the database earlier this month, he didn’t find them, as they had been removed. The researcher also learned that the removal of SRA files only occurs upon request.
The NIH told the British newspaper Telegraph that his team “reviewed the data removal request made by the submitting researcher” in June 2020, and deleted them. “The applicant indicated that the information on the sequences had been updated, had been submitted to another database, and wanted the data removed from the SRA to avoid version control issues,” a spokesman told the newspaper. “Researchers who submit data retain the rights to their data and may request that it be removed.”
The American researcher noticed that many of the sequences had been stored in files on Google Cloud. Since the filenames followed the same pattern, it was able to find and retrieve 13 of the sequences that were still in the cloud.
Virus family history
Genetic sequencing of different versions of the virus helps scientists build their “family tree”. Experts agree that the “ancestors” of Sars-CoV-2 are bat coronaviruses, but the path it took to reach humans is still being investigated – and that includes the hypotheses of transmission from animals or a laboratory accident.
Thus, the oldest genetic sequences must be more similar to those of the bat coronavirus, as the virus continues to evolve, with mutations that make it more different from the original virus.
Among the oldest samples of the new coronavirus are those collected from cases related to the seafood market in Wuhan, as of December 2019. These samples contain three mutations that are not present in samples collected weeks later. Viruses from later weeks do not have these mutations and are more like bat coronaviruses, which would indicate that some earlier strain did not experience the Wuhan market outbreak.
Bloom found that the samples that were deleted from the archive also did not contain these mutations. “They are three steps more similar to bat coronaviruses than the viruses in the Huanan seafood market,” the researcher told the New York Times. This indicates, he says, that when the virus hit the market it had been circulating for some time in Wuhan or elsewhere; that is, these sequences would represent an earlier stage in the virus’s evolution.
In a series of Twitter posts, Bloom talked about the implications of the findings: “First, the fact that this dataset was deleted should make us skeptical that all relevant initial sequences from Wuhan were shared. We already know that many laboratories in China ordered the destruction of initial samples.”
The researcher also said that it may be possible to obtain additional information about the initial spread of the coronavirus in Wuhan even if actions by investigators at the site are suppressed.
What Other Experts Say
For virologist David Matthews, from the University of Bristol (UK), the analysis lends weight to the idea that, although the pandemic probably originated in Wuhan, the city’s seafood market, investigated by the WHO team and others , probably was not ‘ground zero’. “The article also suggests that Chinese scientists themselves were interested in openly sharing the data they were generating, but that their initial openness may have been restricted,” Matthews told the Science Media Center (SMC).
Martin Hibberd, professor of Emerging Infectious Diseases at the London School of Hygiene & Tropical Medicine, found the article interesting but “speculative.” “More work is needed to know how solid these findings are, particularly the accuracy and reasons for deletion of the sequences,” he told SMC.
Andrew Preston, from the University of Bath (United Kingdom), said that, although the analysis has interesting aspects, “it will be difficult to corroborate the work”, since the inclusion of deleted data also raises questions about its origin. The researcher also told the SMC that the article has unusual language, contains many assumptions and conjectures, and tends to go into non-scientific areas, such as “cover-ups and deliberate data retention.”