Primer: Non-Linear Dimensionality Reduction Techniques for Single-Cell Transcriptomics
Overview
Due to low coverage in high-throughput sequencing experiments, dimensionality reduction techniques are commonly applied towards single-cell RNA-sequencing (scRNA-seq) data to overcome the inherent sparsity, variance, and high dimensionality exhibited by the samples. Dimensionality reduction techniques enhance the interpretability of corresponding data visualizations without compromising the global structure of the original dataset, allowing scientists to visually observe the underlying clustering patterns and accurately derive biological insights from downstream analyses [11].
In this module, we will outline the primary non-linear dimensionality reduction techniques applied towards scRNA-seq datasets, and compare the apparent advantages and limitations of t-SNE and UMAP in biological data analysis.
Single-Cell RNA Sequencing (scRNA-seq)
Single-cell RNA sequencing (scRNA-seq) refers to the process of sequencing the transcriptomes of individual cells to establish a high-resolution overview of cell-to-cell variation in a given analysis [4]. Compared to bulk RNA-seq, scRNA-seq reveals the primary cell types and their associated functions within a target biological system, allowing researchers to assess cellular heterogeneity across distinct samples in a high-throughput sequencing experiment. Although scRNA-seq provides broader insight into mechanistic behavior at the cellular level, these experiments are more expensive and time-consuming compared to those of bulk RNA-seq, which is attributed to the overall cost of sequencing and the set of reagents required for each scRNA-seq experiment.
scRNA-seq Data
scRNA-seq data are structured as filtered feature-barcode matrices, where features are represented as rows and barcodes are represented as columns. During the processing stage, all non-targeted genes are removed, leaving only the cell-associated barcodes. Each sample within an scRNA-seq experiment includes the following file types: barcodes.tsv.gz, genes.tsv.gz, and matrix.mtx.gz.
barcodes.tsv.gz: cellular barcodes, used to identify cell-specific sequencing reads and evaluate features at the cellular level
genes.tsv.gz: a file containing the gene ID values of all genes that are quantified in an scRNA-seq dataset
matrix.tsv.gz: a matrix of count values containing the measurements of cellular expression levels, with rows as gene IDs and columns as cellular barcodes
scRNA-seq Analysis Workflow
scRNA-seq data are filtered and processed in a multi-step analysis pipeline, encompassing the following stages: quality control, normalization, feature selection, dimensionality reduction, and clustering [9]. Overall, the thresholds and parameters established in each stage of the pipeline will substantially impact the results obtained from downstream analyses, as they affect the number of samples included in the resulting visualizations.
Quality Control
The presence of low-quality cells within scRNA-seq data may interfere with the characterization of cellular heterogeneity or observed levels of gene expression, which contributes to misleading results in downstream analyses. Thus, low-quality cells resulting from cell damage or technical issues are either marked or filtered out from the dataset during the quality control (QC) stage of an scRNA-seq pipeline.
Normalization
Moreover, inconsistent library preparation from a minimal set of reagents contributes to technical variation in scRNA-seq experiments. To reduce the impact of technical variation and facilitate comparison between cell profiles, scientists leverage normalization techniques to ensure the observed cellular heterogeneity stems from biological phenomena as opposed to technical biases in scRNA-seq experiments.
Feature Selection
The objective of feature selection is to extract relevant genes and eliminate noisy samples from an scRNA-seq dataset, which mitigates the impact of variance on the primary structure of the data. Thus, feature selection preserves the underlying structure of the data without the influence of variance and reduces the size of the dataset in order to optimize the computational efficiency of downstream analyses.
Dimensionality Reduction
As various genes exhibit correlation in expression due to similarities in biological processes, dimensionality reduction techniques are often applied towards the samples within scRNA-seq experiments to reduce the number of separate dimensions in the original dataset. Overall, dimensionality reduction enables a more accurate representation of the characteristics in the dataset and allows scientists to generate interpretable visualizations to assess apparent clustering patterns.
Clustering
Clustering is an unsupervised learning method that categorizes groups of cells with similar gene expression profiles. These approaches enable researchers to characterize the cellular heterogeneity within scRNA-seq datasets. In scRNA-seq, the clusters correspond to various cell types or cell states and may impact subsequent data visualizations based on the selected algorithms and parameters.
Non-Linear Dimensionality Reduction Techniques
Dimensionality reduction refers to the process of transforming data from a high-dimensional space into a lower-dimensional space while preserving the majority of the intrinsic structure and composition of the original dataset. Thus, dimensionality reduction techniques maintain the balance between the local and global structure of a given dataset [5]. Local structure isolates specific cell types as distinct clusters and facilitates interpretation of the variation in gene expression contributing to cellular heterogeneity within scRNA-seq datasets. In contrast, global structure seeks to maintain the inter-cluster embeddings and pairwise distances between the samples.
Dimensionality reduction techniques may be categorized into unsupervised/supervised, linear/non-linear, and parametric/non-parametric algorithms. Non-linear dimensionality reduction techniques are commonly applied towards scRNA-seq data and aim to identify low-dimensional manifolds that accurately represent high data density. These approaches enable the effective mapping of the scRNA-seq data from its high-dimensional representation into a lower-dimensional embedding [3]. In the following sections, we will provide an overview of two major non-linear dimensionality reduction techniques utilized in single-cell transcriptomics: t-SNE and UMAP.
t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique often applied towards high-dimensional data analysis. This approach transforms the data by constructing a Student's t-distribution to calculate the similarity between two points in the low-dimensional embedding [10]. Although t-SNE accurately represents the local structure of high-dimensional data, it fails to preserve its global structure, leading to ambiguous or misleading data visualizations [5]. Furthermore, the algorithmic runtime complexity of t-SNE is O(N^2), as the pairwise distances between all sets of points are computed.
t-SNE Parameters
perplexity: the balance between the local and global structure of the data, established based on the number of nearest neighbors of a sample within the dataset; optimal values range from 5-50, and larger values better preserve the global structure of the dataset after dimensionality reduction [10]
step size: the number of iterations required to stabilize the resulting structure of the data
epsilon: the learning rate, a parameter tuned after obtaining preliminary knowledge of the dataset size
Advantages & Limitations of t-SNE
t-SNE effectively reduces high-dimensional data into a 2-3 dimensional space, enabling researchers to visualize the relative proximities between clusters in the resulting embedding [10]. However, t-SNE is highly sensitive to minor variations in the hyperparameters, as the selected perplexity or seed values may impact the structure presented in the resulting embedding. Overall, the underlying characteristics of t-SNE increase the likelihood of misleading conclusions.
Furthermore, t-SNE is less effective on noisy samples and does not scale appropriately on larger datasets, leading to the loss of global information. From an algorithmic standpoint, t-SNE is computationally expensive when compared to UMAP and other dimensionality reduction techniques. As a result, principal component analysis (PCA) is often utilized as an initial pre-processing step to extract the signal from the noise before applying t-SNE for further dimensionality reduction.
Uniform Manifold Approximation and Projection (UMAP)
Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimensionality reduction technique commonly applied towards high-dimensional data. This technique transforms the data by constructing graph representations to uphold the structural similarity between the low-dimensional and high-dimensional embeddings [6]. The UMAP algorithm assumes the Riemannian metric is locally constant and the data follows a uniform distribution on a locally-connected Riemannian manifold. Compared to t-SNE, UMAP preserves a larger majority of the global structure of the data while maintaining a reasonable algorithmic runtime complexity.
UMAP Parameters
number of neighbors: increasing the number of neighbors maximizes the number of global distances preserved after dimensionality reduction [1]
minimum distance: the minimum distance between the samples affects the clustering distribution in the resulting embedding, where lower values represent a greater degree of similarity between dense clusters of samples
Advantages & Limitations of UMAP
Compared to t-SNE, UMAP is computationally efficient and preserves a greater amount of the global structure within the 2D projection of the data. Although UMAP does not require pre-processing with PCA and scales accordingly for larger datasets [7], this approach may not always preserve the structure of data with a highly complex topology.
Conclusion & Future Proceedings
To assess the influence of different distance metrics and hyperparameters on downstream analyses of scRNA-seq datasets, scientists may utilize clustering algorithms such as K-Means clustering or Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [8]. Furthermore, researchers may consider evaluating the impact of non-linear dimensionality reduction techniques on a broader range of biological data types across various species, increasing interpretability and optimizing the accuracy of data analysis. Overall, dimensionality reduction techniques facilitate interpretable data visualization of feature representations, enhance downstream analyses, mitigate overfitting, and effectively reduce the runtime complexity of machine learning algorithms.
References
[1] Armstrong, G., Martino, C., Rahman, G., Gonzalez, A., Vázquez-Baeza, Y., Mishne, G., & Knight, R. (2021). Uniform Manifold Approximation and Projection (UMAP) Reveals Composite Patterns and Resolves Visualization Artifacts in Microbiome Data. MSystems, 6(5), e00691-21. https://doi.org/10.1128/mSystems.00691-21
[2] Bacon, W. Filter, Plot and Explore Single-cell RNA-seq Data (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/single-cell/tutorials/scrna-case_basic-pipeline/tutorial.html
[3] DeMers, D., & Cottrell, G. (1992). Non-Linear Dimensionality Reduction. Advances in Neural Information Processing Systems, 5. https://proceedings.neurips.cc/paper/1992/hash/cdc0d6e63aa8e41c89689f54970bb35f-Abstract.html
[4] Jovic, D., Liang, X., Zeng, H., Lin, L., Xu, F., & Luo, Y. (2022). Single-cell RNA sequencing technologies and applications: A brief overview. Clinical and Translational Medicine, 12(3), e694. https://doi.org/10.1002/ctm2.694
[5] Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications, 10(1), Article 1. https://doi.org/10.1038/s41467-019-13056-x
[6] McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (arXiv:1802.03426). arXiv. https://doi.org/10.48550/arXiv.1802.03426
[7] Nanga, S., Bawah, A. T., Acquaye, B. A., Billa, M.-I., Baeta, F. D., Odai, N. A., Obeng, S. K., & Nsiah, A. D. (2021). Review of Dimension Reduction Methods. Journal of Data Analysis and Information Processing, 9(3), Article 3. https://doi.org/10.4236/jdaip.2021.93013
[8] Ozgode Yigin, B., & Saygili, G. (2023). Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq data. Scientific Reports, 13, 6567. https://doi.org/10.1038/s41598-023-32966-x
[9] Slovin, S., Carissimo, A., Panariello, F., Grimaldi, A., Bouché, V., Gambardella, G., & Cacchiarelli, D. (2021). Single-Cell RNA Sequencing Analysis: A Step-by-Step Overview. Methods in Molecular Biology (Clifton, N.J.), 2284, 343–365. https://doi.org/10.1007/978-1-0716-1307-8_19
[10] van der Maaten, L. (2008, November). Visualizing Data using t-SNE. https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
[11] Xiang, R., Wang, W., Yang, L., Wang, S., Xu, C., & Chen, X. (2021). A Comparison for Dimensionality Reduction Methods of Single-Cell RNA-seq Data. Frontiers in Genetics, 12. https://www.frontiersin.org/articles/10.3389/fgene.2021.646936
[12] Zeisel, A., et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 347, 1138-1142. https://www.science.org/doi/10.1126/science.aaa1934