Chapter 1 Introduction
1.1 Why do we do scRNA-seq?
Here, we try to understand the reasons behind performing scRNA-seq experiments. First, we compare bulk RNA-seq and its single-cell counterpart to highlight some of the advantages (and disadvantages) of scRNA-seq. Second, we discuss the types of studies that involve scRNA-seq. Third, we wish to highlight some considerations when interpreting results from single-cell studies.
1.1.1 Difference between bulk and single-cell RNA-seq
Before the emergence of scRNA-seq, genome-wide transcriptomes can already be profiled at a bulk level on a population of cells. In general, for both bulk and scRNA-seq, RNA is being reverse-transcribed into cDNA, which is then subjected to high-throughput sequencing. However, there exist many differences between bulk RNA-seq and scRNA-seq, summarized by the table below.
Bulk RNA-seq | scRNA-seq |
---|---|
First performed by Bainbridge et al. (2006) | First performed by Tang et al. (2009) |
Measures an average expression across a population of cells |
Measures a distribution of expression levels across a population of cells |
Useful for comparative transcriptomics between conditions |
Useful for studying heterogeneity of individual cells |
Higher starting RNA material | Low initial RNA material (noiser gene expression) |
Triplicate or duplicate libraries | 102 to 106 single cells |
Between bulk RNA-seq and scRNA-seq, the most obvious difference lies in what is being measured. Bulk RNA-seq measures gene expression in the form of a population average since there is no way to assign the RNA levels to individual cells unlike scRNA-seq. Thus, bulk RNA-seq is commonly used in comparative transcriptomics where the average RNA levels are compared between two (or more) conditions, for example between healthy and diseased cases. Furthermore, prior to bulk RNA-seq, the cells may be often purified / sorted using several surface markers to ensure a more homogeneous population.
There also exist differences in the structure of the data. Bulk RNA-seq gene expression matrices are often “long” with only a few samples / columns and tens of thousands of genes / rows. In contrast, scRNA-seq expression matrices are often “wider” with hundereds to thousands of single cells / columns. Furthermore, in scRNA-seq, there is usually less reads per cell, resulting in a sparse expression matrix where 80–90% of the values are zeroes.
1.1.2 Types of single-cell studies
The main advantage of single-cell techniques lies in the cellular resolution. In general, single-cell studies can be broadly classified into one of these three categories which exploits the cellular resolution.
First, there are “cell atlas”-type studies where samples that are highly heterogeneous in terms of cell populations e.g. an organ or a tissue re subjected to scRNA-seq. The cellular resolution allows us to distinguish the various cell populations within the sample. This allows for the identification and characterisation of rare or hard-to-isolate cell populations. Interactions or signalling between cell states as well as cell-state-specific changes between conditions can also be investigated.
Second, there are “timeseries”-type studies where scRNA-seq are applied to samples obtained at various timepoints during a timeseries experiment. Here, the single cells represent snapshots in the biological process. As different cells at the same timepoint may have undergone the biological transition to a different extent, the single cells can span the entire biological trajectory with an optimal choice of timepoints.
Third, there are “screening”-type studies where the single-cell transcriptomes are used as readouts in a screening experiment e.g. CRISPR. The ability to profile numerous single cells is compatiable with high-throughput screens where each single cell represents an individual screening experiment.
1.1.3 Things to consider…
At this point, it would be appropriate to highlight a few considerations that would be relevant in planning or interpreting scRNA-seq experiments.
In scRNA-seq analysis, it is common to aggregate the single cells into a much smaller number of groups, such as the number of cell states. This not only makes the analysis more interpretable but also increases the detection power. This aggregation is often done by taking the mean or median gene expression. However, the mean / median is often not the best reflection of the underlying gene expression distribution. This is especially so if the gene expression displays a multimodal distribution. An example would be aggregating cells in different phases of cell cycle. If one is interested in the cell cycle dynamics, this information will be lost. This is also echoed by the Simpson paradox in which trends disappear or reverse when different groups of data are combined.
Another relevant consideration is on interpreting changes in gene expression. Changes in gene expression between groups of cells could be due to either a (i) change in regulation where we assume that all the cells are changing their expression in a similar manner or a (ii) change in composition where a subset of cells expressing certain genes are enriched or depleted. Again, this is related to the fact that we have aggregated the single cells. Notably, this is also relevant when we interpret bulk RNA-seq data or any other bulk-level analysis.
1.2 Experimental design
Before diving into the computational analysis of single-cell datasets, it is worthwhile to understand some of the experimental aspects of scRNA-seq. The choice of experimental technique often have a huge impact on the downstream computational analysis. Furthermore, good experimental design helps to alleviate any potential confounders.
1.2.1 RNA-seq and scRNA-seq
As mentioned in Section 1.1.1, RNA-seq involves the reverse transcription (RT) of RNA into cDNA which is then subjected to next generation sequencing (NGS). In conventional RNA-seq, RNA samples are subjected to RNA fragmentation prior to RT and NGS (Fig. 1.4). This is because of the size limitations of most sequencing platforms. After sequencing, these RNA fragments are then bioinformatically aligned to the transcriptome. Each fragment constitute a read which is then assigned to the corresponding transcript / gene. Such techniques which capture the entire RNA are called full-length methods. As expected, longer genes tend to have more reads. Thus, counts obtained through full-length RNA-seq have to be adjusted for the gene lengths through measures such as Fragments Per Kilobase per Million reads (FPKM) or Transcripts Per kilobase Million (TPM).
Apart from full-length approaches, there exist an alternative where only parts of the RNA molecule is captured. One such example is Cap Analysis Gene Expression (CAGE) where the 5’ends of RNA are captured and then sequenced. CAGE allows for the accurate identfication of transcription start sites, which was used to map the promoterome across the entire genome (The FANTOM Consortium 2014). Such methods are called tag-based methods and the reads do not need to be adjusted for gene length since only a single fragment is captured per RNA molecule.
Understanding RNA-seq is important as scRNA-seq simply involves single cell isolation combined with RNA-seq. In the next section, we will discuss the plethora of scRNA-seq techniques.
1.2.2 scRNA-seq techniques
Just over a decade since the first scRNA-seq study, there has been a dramatic increase in the number of cells being studied. Commercial kits e.g. from 10X Genomics can profile thousands of cells with ease while atlas-level studies often involve hundreds of thousands or even millions of cells.
This “single-cell explosion” is also accompanied by a variety of scRNA-seq techniques. In general, scRNA-seq techniques differ in two ways, namely (i) the way single cells are captured and (ii) the quantification of RNA abundance. Here, we sumarise a few scRNA-seq techniques based on these distinctions.
With regards to single cell capture, there are microfluidics / droplet-based methods and plate / well-based methods. Droplet-based approaches encapsulates cells with barcoded particles in water-in-oil droplets while well-based methods have isolated single cells placed in individual wells. The former approach have higher throughput in terms of the number of cells while the latter can be combined with FACS to select cells based on certain markers. As for the quantification of RNA abundance, techniques can be separated into full-length or tag-based methods depending on whether the entire RNA molecule is being captured (Section 1.2.1). Full-length methods allow for isoform quantification while tag-based approaches tend to reduce PCR amplification bias.
1.2.3 Unique molecular identifiers (UMI)
Due to the low RNA content in each single cell, PCR-based amplification is often required to generate enough starting material for sequencing. However, this leads to the problem of PCR amplifcation bias where different cDNA fragments get amplified at different rates. To alleviate this problem, the cDNA molecules are tagged with a random sequence, known as an unique molecular identifier (UMI), during RT and prior to PCR (Fig. 1.8). After PCR amplification, cDNA molecules with the same UMI are determined to have originated from one single RNA molecule. Thus, regardless of the extent of amplification, all the cDNA molecules with the same UMI is treated as a single count.
1.2.4 Sequencing considerations
Next, we discuss some of the sequencing considerations in scRNA-seq, in particular (i) the number of cells to sequence and (ii) the number of reads required for each cell.
1.2.4.1 How many cells to sequence?
The number of cells to sequence depends on (i) the estimate number of cell states and (ii) the intended number of cells per cell state. First, the number of cell states is highly dependent on the biology of the system and the definition of a cell state, related to the scope of the study. If one is interested in cell subtypes as opposed to major cell types, then the number of cell states of interest increases and more cells would need to be sequenced. Second, the number of cells per cell state determines the statistical power of downstream analysis. Between full-length and tag-based methods (Section 1.2.2), tag-based methods usually capture less information and thus more cells are required than full-length counterparts. From experience, a minimum of 100 and 50 cells per cell state is required for tag-based (e.g. 10X) and full-length methods (e.g. Smart-seq2) respectively.
Another consideration is the detection of rare cell types. More cells would need to be sequence if a desired cell population only comprises a small fraction of the entire sample. To aid in the calculation, the Satija lab has created an online calculator (Fig. 1.9) where users can specify the estimated number of cell states \(m\), the minimum fraction \(p\) and the minimum number of cells per cell state \(n\). The probablity of observing at least \(n\) cells from each cell state is estimated using the negative binomial distribution \(\text{NBcdf}(k; n, p)^m\) where \(k\) is the number of cells in the library i.e. the number of cells to sequence. The calculator then computes the \(k\) required to achieve 0.95 and 0.99 probablity of observing the rare population.
On a final note, the intended number of cells to sequence should be amendable to the scRNA-seq technique. For example, 10X genomics commerical kits allow for up to 10,000 cells to be captured per channel and a 10X chip contains 8 of such channels. For such droplet-based methods, increasing the number of cells often leads to a higher doublet rate where two or more cells are captured in a single droplet. Such doublets affect downstream analysis and have to be removed bioinformatically. Furthermore, for droplet-based methods, the number of cells loaded initially is not the eventual number of cells that are sequenced. This is because not all the loaded cells will be encapsulated in oil droplets. Current 10X genomics commerical kits (v2 and v3 chemistry) have a capture efficiency of ~60% and these capture efficiencies have to be considered when planning scRNA-seq experiments.
1.2.4.2 How deep do we sequence?
Typically, a cell contains approximately \(10^6\) mRNA molecules, which serves as a natural upper bound on the number of reads per cell. However, scRNA-seq techniques are unable to capture all the mRNA and the number of reads per cell is highly technique-dependent. For example, the v2 and v3 10X Genomics commercial kits have a recommended minimum of 50,000 and 20,000 reads per cell respectively. It is also important to realize that the number of reads does not directly translate to counts in techniques employing UMIs (e.g. 10X). This is because reads with the same UMI are treated as a single molecule / count (Section 1.2.3). From experience, the number of UMIs is often about one eighth to one third (12.5% to 33.3%) of the number of reads for 10X Genomics commercial kits.
Most scRNA-seq techniques generate standard Illumina paired-end constructs, which can then be sequenced using Illumina sequencers. Thus, the choice of sequencer and the number of lanes can be determined after working out the desired number of cells and number of reads per cell.
1.2.5 Multi-library designs: mitigating batch effect
In more elaborate studies, it is possible that not all the cells can be captured in a single library / experiment. Furthermore, multiple libraries might be required to distinguish different samples e.g. different timepoints during a timeseries experiment. When separate single-cell experiments are conducted, batch effects could arise. Probably one of the greatest challenges in genomics research, batch effect arises when measurements are affected by laboratory conditions, reagent lots and personal difference. To alleviate batch effects, experiments are best performed using similar conditions with the same person. Notably, commercial scRNA-seq kits (e.g. 10X) are quite consistent and often introduce little to no batch effects. Furthermore, it is important to ensure that the sequencing reads are not confounded by the sequencing lane or sequencer used. To achieve that, all scRNA-seq libraries should be distributed evenly across all sequencing lanes.
Apart from consistency in sample preparation, there are library designs that allow one to check (or even correct) for any potential batch effects. One way is to introduce a common (and known) population of cells into each single-cell library. In the absence of batch effects, this common group of cells should cluster together. Otherwise, these common cells can be used to “align” the various samples and correct for batch effects. Alternatively, a pooled library containing cells from all libraries can be performed. Thus, the cells from individual libraries should align with some of the cells in the pooled library.
To aid in normalization between libraries, ERCC external RNA spike-ins, comprising a set of exogeneous RNA, can be added. However, it should be noted that these external RNA spike-ins are usually unsuitable for droplet-based methods as these RNA will be evenly distributed across all droplets, including non-cell-containing droplets, greatly increasing the amount of sequencing required.
1.2.6 Single-cell vs. single-nucleus RNA-seq
Conventionally, scRNA-seq is performed with intact cells which requires fresh tissue samples or cell cultures. This presents a major roadblock to frozen / fixed samples or tissues that cannot be readily dissociated. To address this, single-nucleus RNA-seq can be performed where only the nuclei is extracted and captured (Habib et al. 2017). Single-nucleus RNA-seq allow for single-cell measurements on frozen samples.
In a recent benchmarking study, Ding et al. (2020) found that single-nucleus RNA-seq generally performed well for sensitivity and classification of cell types. The transcriptome captured via single-cell and single-nucleus approaches are often comparable with some noticeable difference due to the omission of cytoplasmic RNA. In particular, ribosomal and mitochondrial (MT) RNA are no longer present in single-nucleus RNA-seq. This might be beneficial as these ribosomal and MT RNA often take up the majority of reads in scRNA-seq. Furthermore, single-nucleus libraries often contain more intronic reads from unspliced pre-mRNA. This requires a “pre-mRNA” transcriptome reference during the mapping of reads.
1.3 General computational workflow
Next, we discuss some general aspects of the computational analysis of single-cell datatsets. This include the various steps in the analysis workflow, introduction to general single-cell analysis tools and some of the challenges that can arise in single-cell analysis.
1.3.1 Overall workflow
Despite the large number of studies to date, there does not exist a standardized way to analyse single-cell data. This can be attributed to (i) the noisy nature of the data, due to low RNA content per single cell, which often require elaborate quality control and preprocessing steps, (ii) the large number of cells which opens the possibility of new analysis methods that are previously not applicable to bulk RNA-seq and (iii) the large number of computational tools available which complicates standardization. Fortunately, there exist several publications that provide guidelines and best practices for single-cell analysis (Section 7.1.3).
Overall, all single-cell analysis require a few common steps, which include raw data processing to generate counts, quality control to remove low-quality cells and uninformative genes (Section 2.3), preprocessing such as normalization and feature selection (Section 2.4), followed by basic visualization (Section 2.5). Depending on the biological questions of interest, downstream analysis could include clustering to identify cell populations and differential gene expression between clusters (Section 3) and / or the use of various dimension reduction and trajectory inference algorithms to infer cell-cell relationships (Section TBC).
1.3.2 “All-in-one” analysis package
To aid in the analysis of single-cell datasets, there exist several “platform”
packages which contain large analysis toolboxes to perform various tasks in
the analysis workflow, typically including the steps from quality control to
clustering. For the R
programming language, the Seurat
package
(Butler et al. 2018) is one of the most comprehensive package, which we will
be primarily using in this guide. Another notable package in R
is the
Scater
package (McCarthy et al. 2017) that works on SingleCellExperiment
objects, which are commonly used by other single-cell analysis packages.
For Python users, one can use the Scanpy
package (Wolf, Angerer, and Theis 2018) for
analyzing single-cell gene expression data.
1.3.3 General difficulties in computational analysis
Despite active research and interest in the single-cell field, challenges still remain in both experimental and computational aspects. Here, we highlight a few difficulties that may be encountered in the computational analysis of single-cell data.
First, scRNA-seq is inherently noisier than its bulk counterpart, owing to the amplification of small amounts of initial RNA material and inadequate sampling in terms of the relatively small number of sequencing reads per single cell. This noisy profiling results in the sparsity of single-cell data where most of the gene expression (>80%) are dominated by zero counts. This further translate to higher technical variability which has to be accounted for in downstream computational analysis. Notably, it has been shown that the large number of zero values observed in droplet-based scRNA-seq can be still be described using a negative binomial distribution, commonly used to model counts data from bulk RNA-seq (Svensson 2020). This suggest that zero-inflated models which include a zero inflation component to the models are not necessary.
Second, as mentioned in Section 1.2.2, the number of single cells being profiled is increasing at a tremendous rate. This requires new and existing computational tools to be highly scalable with respect to the number of cells. Ideally, computational tools should have an efficient memory footprint and scale close-to-linearly or linearly with respect to the number of cells. There are some published computational tools that scale poorly with the number of cells and were tested only on datasets containing hundreds of cells. For example, if an algorithm scales quadratically and required one minute to run on a 100-cell dataset, the same calculation would have taken \(100^2 = 10000\) minutes (~ 1 week) to complete on a modern 10000-cell dataset. Care has to be taken when incorporating such algorithms in analysis pipeline as they may easily become computationally prohibitive in the future. Furthermore, this has resulted in the preference of certain classes of algorithms such as the use of graph-based algorithms in cell clustering.
Overall, the single-cell field is evolving rapidly, with new algorithms emerging and old algorithms becoming obselete easily. Thus, it is important to be aware of the latest developments in the field so as to employ the appropriate tools for single-cell analysis.