Regaining perspective on SARS-CoV-2 molecular tracing and its implications

Our take —

This study explored the limitations of the publicly available sequencing data at the time, and investigated whether these sequences were representative of the pandemic thus far. Researchers found that the data were not representative of reported cases, with sequence data lacking from many case hotspots (e.g., most of Italy). The authors explain how this sampling bias may lead to misinterpretation of data, but fail to point out the utility and benefit of phylogenetic analysis early in an outbreak, and neglect to instruct readers on the important, relevant, and valid conclusions already drawn from this type of data.

Study design

Other

Study population and setting

The study involved the analysis of 331 full genome sequences from 29 countries downloaded from the public repository GISAID (https://www.gisaid.org/). The purpose of the study was to warn readers about the uncertainty that is intrinsic to phylogenetic studies with limited and/or biased sampling, and to highlight the extent to which SARS-CoV-2 sequencing has been limited and biased thus far.

Summary of Main Findings

The study analyzed SARS-CoV-2 sequences that were deposited in the GISAID database, which is used by researchers worldwide. They noted that the sampling of the sequences per country was inhomogeneous (not consistent), with confirmed cases clearly uncorrelated to the number of genomes. Even within the same country, sequenced genomes were usually sampled from a few hotspots, and were not necessarily representative of the whole epidemic in that country. The authors explained that inhomogeneous sampling and missing data can have dramatic effects on the results of phylogeographic analyses, like the ones recently rushed through news and (social) media to claim specific dissemination routes of SARS-CoV-2 among countries. They also demonstrated a low phylogenetic signal in the data due to limited mutation accumulation in the first 3 months of the outbreak, which also contributes to unreliable phylogeographic inferences.

Study Strengths

The study used different methods to highlight imbalances in sample collection, and the lack of sufficient phylogenetic signal in the current datasets which likely makes some phylogeographic inferences questionable.

Limitations

The paper does not balance the potential pitfalls of phylogenetics with the benefits of the phylogenetic analyses performed during the early periods of the COVID-19 pandemic, and does not explain what types of conclusions can be trusted, and to what degree. There are other studies that have been done using the incomplete datasets that have provided useful information, such as the evolutionary rate of the virus, the approximate date of its origin, the growth rate and the basic reproduction number. It would be useful to also highlight important information we can draw from these early datasets, and how to address the mentioned pitfalls to increase the reliability of the phylogenetic findings. As this is a historical review of important articles, it is also important to remember that the relationship between the number of published genomes and the number of cases presented in this paper are already extremely outdated, with exponential increases in both cases reported and genomes published in the weeks following the paper submission.

Value added

This article highlights issues associated with phylogenetic studies that can lead to misinterpretation of data. It educates researchers on the importance of minimizing sampling bias and evaluating the data for phylogenetic signal. The paper also directly counters some of the (unsubstantiated) narratives propagated in the media, such as the relationship between SARS-CoV-2 and snakes, and the relationship between virus lineages and disease severity.