Decoding Single-Cell Clustering: Why More UMAP Clusters Don’t Mean More Cell Types

Date Published: March 18, 2025

Single-cell RNA sequencing (scRNA-seq) has revolutionized how we study cellular heterogeneity, enabling the discovery of rare cell populations and dynamic cell states. However, clustering results are often misinterpreted. This blog explains why more UMAP clusters do not necessarily indicate more biologically distinct cell types and how to avoid common pitfalls in scRNA-seq analysis.

Understanding Clustering in scRNA-seq

Clustering is based on gene expression similarities, often using Seurat (Satija et al., 2015) or Scanpy (Wolf et al., 2018). However, the number of clusters depends on:

Thus, clustering methods identify cell states rather than definitive cell types.

Cell Types vs. Cell States: What’s the Difference?

Cell Types: Stable populations with defined molecular signatures (e.g., neurons, T cells).

Cell States: Temporary functional phases (e.g., activated T cells, differentiating stem cells).

Many UMAP clusters represent states rather than unique cell types.

Why Clustering Can Overestimate Cell Types

1. UMAP Distorts High-Dimensional Space

UMAP exaggerates cluster separation, making continuous gradients appear as discrete groups.

2. Over-Resolution in Clustering Algorithms

Higher resolution settings can split biologically similar cells into separate clusters.

3. Pseudotemporal Continuums

Processes like differentiation can be mistaken for distinct cell types instead of a continuum.

4. Cell Cycle Effects

Uncorrected cell cycle states can create artificial clusters.

5. The Effect of Integration Methods on UMAP Clusters

My recent work with the single cell data analysis revealed that the appearance of UMAP clusters is not only affected by biological heterogeneity but also by the choice of integration method used to correct for batch effects. When the same dataset was integrated using CCA, RPCA, Harmony, and merged without correction, the resulting UMAP plots showed dramatically different cluster structures. This highlights the critical need to:

UMAP Integration Methods
Figure: UMAP representations of scRNA-seq data integrated with various methods (CCA, RPCA, Harmony, etc.). Source: arXiv:2206.01816

How to Avoid Over-Interpretation

Conclusion

Clustering and UMAP visualizations are valuable but should not be mistaken for true cell type classification. Careful validation using marker genes and multimodal data is essential for meaningful single-cell interpretations.

References