We described previously that the newly identified HCoV-NL63 virus has a typical coronavirus genome structure and gene order . The nucleotide composition of the genomic (+) strand RNA of several coronaviridae members is presented in Figure 1, demonstrating a common pattern with U as the most abundant nucleotide and G and in particular C as underrepresented nucleotides. HCoV-NL63 has the most extreme nucleotide bias among the coronaviridae, with 39% U and only 14% C. As a general trend, U and C seem to compete directly, because the genomes with the lowest C-count (HCoV-NL63, HCoV-OC43 and BCoV) have the highest U-count and vice versa (Figure 1).
Nucleotide content of coronaviridae RNA genomes. We arranged the viruses based on their C-count, which ranges from 14% (HCoV-NL63) to 20% (SARS-CoV).
To investigate if all coding regions of HCoV-NL63 display a similarly strong preference for U and against C, we also plotted the nucleotide count for the individual genes and 5′ and 3′ non-coding regions (Figure 2). The typical nucleotide bias is observed in all genome segments. The highest U-count is found in the ORF3 and E genes (43%) and the lowest C-count in the 1a/1b genes and the 3′ UTR (13%, 14% and 14%, respectively). The N gene is most moderate in its nucleotide bias, with 21% C and 32% U, confirming the “competition” idea that was already suggested by inspection of Figure 1.
Nucleotide content of individual HCoV-NL63 genes and the 5’/3′ untranslated regions (UTR).
We plotted the nucleotide distribution along the genome (Figure 3) to determine whether there is any significant variation. We observed that local changes in A-count are inversely linked to changes in G-count. This is most striking in the 20400-26000 nt region, where three A peaks are mirrored by three G anti-peaks. Although the typical bias is maintained along the genome, the most notable variation occurs in the last one-third of the genome, where an increase in C and a decrease in G content is apparent. This region encodes the structural proteins.
Nucleotide distribution along the HCoV-NL63 genome. The change in the C- and G-count at two-third of the genome is statistically significant for all tested coronaviruses (HCoV-NL63, HCoV-229E, SARS-CoV, HCoV-OC43) with p < 0.01 for C-count and p < 0.05 for G-count in Mann-Whitney U test for two independent samples.
Recently, Grigoriev reported an interesting feature within coronaviral genomes that is visible when the cumulative GC-skew is plotted [20, 21]. Cumulative GC skew graph is a way to visualize the local G:C ratio along the genome, discarding the local fluctuations. A biphasic pattern was described that separates the 1a/1b polyprotein genes and the structural genes. The cumulative GC-skews for HCoV-NL63 and four other coronaviruses: HCoV-OC43, HCoV-229E, PEDV and SARS-CoV are presented in Figure 4. In the 1a/1b genes, the G:C ratio reaches high levels, whereas for all coronaviruses, including HCoV-NL63, the 3′ end of the genome displays a flattening of the curve, as the G:C ratio reaches value ~ 1 or less. Grigoriev proposed that this biphasic pattern is due to the discontinuous transcription process . He suggested that the frequent deamination of cytosine on the (-) strand RNA results in a decrease of G on the (+) strand in the region encoding the structural genes. In the discussion section we will present an alternative mechanistic explanation.
Cumulative GC-skew diagrams for several coronaviral RNA genomes. The vertical bar indicates the border between the 1a/1b and the structural genes.
HCoV-NL63 codon usage
The bias in the nucleotide count led us to compare the codon usage of HCoV-NL63 with that of human mRNA (Table 1). The codon usage of HCoV-NL63 differs markedly from that of human mRNAs. Third-base choices in the four-codon families (Thr, Pro, Ala, Gly, Val) provide a convenient example of this contrasting codon usage. For instance, the Gly codons in human mRNAs prefer C (34%) over G (25%), A (25%) and U (16%). In contrast, HCoV-NL63 prefers U (83%) over A (7%), C (8%) and G (2%). This result strongly suggests that the codon usage is shaped directly by the unusual nucleotide composition of the viral genome, that is a high U-count and a low G/C-count. All HCoV-NL63 genes, except for the E gene, follow this trend (Table 1). The coronaviral addiction to the U nucleotide is most prominent in the “free” third position of degenerate codons. For the complete genome, the U-count at the third position is up to 58% whereas the A-count is 20%, G-count is 13% and C-count is only 9% (Figure 5). This illustrates that the U-pressure mainly affects the %C and %G.
Nucleotide composition of the first, second and third codon positions in the HCoV-NL63 genome.
Identification of the HCoV-NL63 TRS elements
The 5′ end of HCoV-NL63 genome RNA contains the L sequence of 72 nucleotides that ends with the L TRS element. This TRS has a high similarity to short sequences that are located in front of each open reading frame (S-ORF3-E-M-N) . We previously identified the L TRS and body TRS of the N gene using a cDNA bank , which allowed us to predict the body TRS of the other genes. To confirm these predictions, we amplified and sequenced all sg mRNA fragments with a general L primer and gene-specific 3′ primers in an RT-PCR protocol.
Inspection of sg mRNA junctions indicated that they are indeed composed of the part of the HCoV-NL63 genome that is directly downstream of a particular body TRS, with its 5′ end derived from the leader sequence. Apparently, strand transfer occurred on the 5′ end of the body TRS, as indicated in Figure 6. The most conserved TRS region was defined by multiple sequence alignment as AACUAAA (gray box). This core sequence is conserved in all sg mRNA, except for the E gene that contains the sub-optimal TRS core AACUAU A (Figure 6). Interestingly, the E gene contains a 13-nucleotide sequence upstream of the core sequence with perfect homology to the L sequence. Perhaps the upstream sequence compensates for the absence of an optimal TRS core during discontinuous (-) strand synthesis. This would suggest that these sequences are copied during (-) strand synthesis, and that the actual strand transfer within the E sequences occurred after copying of the core TRS and the next 13 nucleotides. Evidence for such a “delayed” strand transfer is provided by the junction analysis of the M and N sg mRNAs, which clearly demonstrates that the nucleotides directly upstream of the core TRS are derived from the body TRS element and not from the leader.
Body-leader junctions of all HCoV-NL63 sg mRNAs. Shown on top is the leader (L) sequence and below the specific sequences upstream of the structural genes. The fusion of 5′ L sequences to 3′ sg RNA is indicated by the boxes. Sequence homology between the strands near the junction is marked by asterisks, the conserved AACUAAA TRS core is highlighted in gray.
Analysis of the subgenomic mRNAs of HCoV-NL63
To determine whether the predicted sg mRNAs encoding the S-ORF3-E-M-N proteins are produced in virus-infected cells, we performed Northern blot analysis on total cellular RNA (Figure 7). We used a (-) strand N gene probe that anneals to both genomic RNA and all sg (+) strand mRNAs. We included RNA from MHV-infected cells to obtain discrete size markers. Six distinct mRNAs are produced in HCoV-NL63 infected cells. The sizes of the RNA fragments were estimated and these values nicely fit the size of the genomic RNA and the five predicted sg mRNAs. All HCoV-NL63 ORFs that have the potential to encode viral proteins are indeed transcribed into sg mRNAs (Figure 7).
The left panel shows the Northern blot analysis of HCoV-NL63 RNA in infected LLC-MK2 cells. RNA of HCoV-NL63 (NL63 lane) was compared with RNA of MHV strain A59 (MHV lane). Non-infected LLC-MK2 cells are included as a negative control (control lane). MHV RNA bands represent the complete genome (1) and sg mRNAs 2a (2), S (3), 17.8 (4), 13.1 and E (5), M (6), N (7). HCoV-NL63 RNA includes the complete genome (1) and sg mRNAs for S (2), ORF3 (3), E (4), M (5) and N (6). The right panel shows the MHV and HCoV-NL63 genome organization and the HCoV-NL63 sg-mRNAs.
To determine the expression level of each subgenomic RNA, we measured the intensity of the signals. When plotted as a function of the genome position (Figure 8), there appears a correlation between the relative distance of a gene to the 3′ terminus and its RNA expression level, with the exception of the E gene.