Skip to main content

A new SARS-CoV-2 lineage that shares mutations with known Variants of Concern is rejected by automated sequence repository quality control

Our take —

This study, available as a preprint and thus not yet peer-reviewed, describes the identification of a new SARS-CoV-2 lineage (B.1.x/B.1.321.1) in central California in early 2021. The B.1.x lineage has mutations found in other variants of concern (VOC), which may contribute to increased transmission or evasion of host immune responses. The B1.x lineage does not appear to impose a greater health risk than other VOC, but some B.1.x sequences in the UK have additionally acquired the E484K mutation. If these B.1.x sublineages become widespread, they may impact the efficacy of vaccines or antibody-based treatments for COVID-19. The B.1.x lineage’s 35 base pair deletion in ORF8 leads to a frameshift and premature stop codon, making submission of these sequences to standard databases (i.e. GenBank, GISAID) problematic. This illustrates a potential limitation when using curated sequence data to monitor the spread of B.1.x and similar lineages, and authors suggest mechanisms to limit future submission bias, which would improve genomic surveillance results.

Study design

Retrospective Cohort

Study population and setting

This study describes the identification of a novel SARS-CoV-2 lineage, B.1.x (sometimes referred to as B.1.321.1) in Santa Cruz County, CA, USA, in early 2021. Phylogenetic analysis was performed using consensus sequences from SARS-CoV-2- positive residual samples (n=339) and randomly selected global background sequences (n=1,000). Similar sequences were retrieved from GenBank and GISAID for comparison. The growth rate of the B.1.x lineage was estimated using a simple logistic regression model.

Summary of Main Findings

More than half of the sequences identified in this dataset were from the B.1.427 and B.1.429 lineages, which were first identified in California. Two B.1.1.7 sequences were also found, but no other CDC-designated variants of concern (VOC) were identified. However, eight samples (2.4%), collected in February and March 2021, appeared to represent a new lineage within B.1, which the authors temporarily refer to as B.1.x, awaiting more refined classification. Prevalence of B.1.x increased over time, from 1% in January 2021 to 10% in March 2021. Additional sequences similar to B.1.x were identified in over 20 US states and 6 countries. (Of note, some UK sequences had been submitted under lineage B.1.321.1.)

Lineage-defining point mutations for B.1.x include several in spike protein (S494P, N501Y, D614G, P681H, K854N, and E1111K) and N:M234I. While several of these mutations are shared with other VOC, it appears unlikely that B.1.x is the result of a recombination event. B.1.x sequences also contain a large 35 base pair deletion in ORF8, which results in a premature stop codon. The biological significance of ORF8 inactivation, which is also present in B.1.1.17, is still unknown. However, because the deletion in B.1.x sequences leads to a frameshift, their submission to database repositories is automatically rejected. Successful submission of these sequences requires additional, lengthy steps in the manual curation process that many labs elect not to complete, instead choosing to abandon submission or to modify the sequences (i.e. adding N’s in place of deleted residues) in order to bypass quality control mechanisms. This means that B.1.x and other lineages with frameshift mutations may be underrepresented in sequence databases, limiting the ability to accurately estimate their impact on the pandemic. Authors suggest adding rapid phylogenetic analysis as a step in the submission process, in order to allow closely-related novel sequences to validate each other at the time of submission.

Study Strengths

Routine genomic surveillance with whole genome sequencing was used to identify a new SARS-CoV-2 lineage harboring several mutations found in other VOC.


The sample size for B.1.x sequences in this dataset is quite small (n=8), and none were detected at the last two study timepoints, both of which limit the accuracy of growth estimates. Additionally, the samples do not represent a randomized sample from the region. The growth rate of the B.1.x lineage was estimated using only a simple logistic regression model, as samples were anonymized and lacked covariate data. Functional relevance of the combination of mutations found in B.1.x was not assessed.

Value added

This study describes the identification of a new SARS-CoV-2 lineage (B.1.x) by genetic surveillance in early 2021. B.1.x contains a large deletion (and consequent frameshift mutation that inactivates ORF8) which may lead to underrepresentation of this lineage in sequence databases, as initial submissions of sequences containing frameshift deletions are automatically rejected. This illustrates a limitation in our ability to accurately monitor the spread of some SARS-CoV-2 lineages and VOC.

This review was posted on: 14 May 2021