Xinqidi,Your biological research partner
Xinqidi Biotech Co.,Ltd,Wuhan,China 2008-2021
R&D 13th year

Single-cell microRNA sequencing method comparison and application to cell lines and circulating lung

Issuing time:2021-07-21 13:53

Abstract

Molecular single cell analyses provide insights into physiological and pathological processes. Here, in a stepwise approach, we first evaluate 19 protocols for single cell small RNA sequencing on MCF7 cells spiked with 1 pg of 1,006 miRNAs. Second, we analyze MCF7 single cell equivalents of the eight best protocols. Third, we sequence single cells from eight different cell lines and 67 circulating tumor cells (CTCs) from seven SCLC patients. Altogether, we analyze 244 different samples. We observe high reproducibility within protocols and reads covered a broad spectrum of RNAs. For the 67 CTCs, we detect a median of 68 miRNAs, with 10 miRNAs being expressed in 90% of tested cells. Enrichment analysis suggested the lung as the most likely organ of origin and enrichment of cancer-related categories. Even the identification of non-annotated candidate miRNAs was feasible, underlining the potential of single cell small RNA sequencing.

Introduction

Tens of thousands of small non-coding RNAs (sncRNAs) such as piRNAs, miRNAs, small nucleolar RNAs (snoRNAs) are transcribed from the genome. Although they are not translated to proteins, these molecules are integral contributors to physiological and pathological processes. Among the best-studied sncRNAs are microRNAs (miRs, miRNAs). MiRNAs are important posttranscriptional regulators that are evolutionary very well conserved: The 20–25-nt long molecules bind to complementary regions on target mRNAs, which leads to mRNA degradation or translational repression1. Over 60% of human genes contain miRNA binding sites2. The most recent release of the miRBase (v22) lists 2654 annotated human miRNAs3. In addition to the miRBase, several other miRNA databases list, however, more specific or sensitive miRNA sets4, and the total number of human miRNAs is estimated to be in the range of 2300 miRNA5. In pathologic conditions such as cancer, the transcription of many miRNAs is altered, which in turn changes the abundance of target mRNAs. Therefore, miRNAs have great potential as diagnostic or prognostic biomarkers. Even more, they represent promising novel drugs or therapeutic targets6. However, these promising clinical applications of miRNAs require methods for the accurate and reproducible quantification of global miRNA expression.

For bulk sncRNAs sequencing, a wide range of protocols and commercial kits exists. The principle of these methods is based on either sequential adapter ligation or polyadenylation of the miRNAs. Several studies comprehensively compared the performance of these methods7,8,9,10,11,12,13. However, already at high input concentrations biases were evident: The adapter ligation efficiency varies 1000-fold depending on miRNA sequence and secondary structure, which leads to low quantification accuracy14. Improved ligation conditions and adapters with random nucleotides can reduce this bias8,9,15. In addition, polyadenylation-based protocols are influenced by miRNA sequence and other RNA species could also be polyadenylated8. Moreover, the miRNA libraries contain adapter dimers that are difficult to separate from informative reads due to the small size of miRNAs. The adapter dimer problem increases with low input samples, because a high excess of adapters is required for efficient ligation16. Chemically modified adapters or removal of excess adapters reduced the amount of adapter dimers17.

It is reasonable to hypothesize that respective experimental biases substantially increase if the input amount is decreased to single cell level. As single cell mRNA studies are widely applied to uncover insights into processes such as cellular differentiation or adaptation, single cell miRNA studies lag behind. Of note, not even the total amount and number of different miRNAs expressed in a single cell is exactly known. However, similar to scRNA-Seq, single cell miRNA-Seq would add to our understanding of molecular regulatory processes. Moreover, in order to study rare cell populations such as circulating tumor cells (CTCs), a single cell miRNA sequencing protocol is mandatory. The Sandberg group published two single cell miRNA-Seq protocol versions18,19, but other miRNA-Seq protocols for low input samples are also available9,20,21. However, no comprehensive comparison of the approaches on single cell level is available yet.

Therefore, in this study, 19 miRNA-Seq protocol variants using defined samples with very low input were evaluated regarding their accuracy, reproducibility, and major sources of bias. The best performing protocols were then used at the single cell level and the same quality parameters were determined. Finally, we show the applicability of a selected protocol to a broad range of different cell lines and even clinical samples by analyzing the miRNA profiles of single CTCs of seven small cell lung cancer (SCLC) patients.

Results

Experimental design

Towards a stable single cell sequencing of miRNAs, we implemented a three-stage approach (Fig. 1a). In the first stage, we comprehensively tested four ligation-based protocols later referred to as SB18, SBN19, CL20, and 4N9. In addition, one polyadenylation-based miRNA-Seq protocol (CATS21) was included. Remarkably, we tested 19 variations of the four protocols with adapted experimental parameters: The adapters were exchanged between protocols, the adapter ligation time was increased (16 C8), a 5′ adapter with a complementary sequence to the 3′ adapter was designed (C315), the unique molecular identifier (UMI) was shortened to 6 nt (UMI6), and an oligonucleotide to block reverse transcription of the adapter dimer was tested (Block). Details on the 19 evaluated protocol variants are provided in Supplementary Fig. 1 and Supplementary Table 1. In this first stage, we sequenced triplicates of single cells of the breast cancer cell line MCF7 spiked with 1 pg of miRXplore Universal Reference. This standard consists of 1006 miRNAs and artificial sequences from different species in equimolar concentration (Supplementary Data 1). We selected the eight best-performing protocols regarding their amount of adapter dimers, reads mapping to miRNA, number of detected miRXplore miRNAs, reproducibility, and quantification accuracy. We confirmed our findings by sequencing three additional replicates.

Fig. 1: Experimental setup and protocol comparison with miRXplore spike-in (stage 1).
figure1

a Overview of the experimental setup consisting of three stages. SCLC CTCs = small cell lung cancer circulating tumor cells. b Read distribution for all tested protocols, sorted by miRNA reads proportion. The data are presented as mean values ±  the standard deviation (n = 6 biologically independent samples for the top eight protocols, n = 3 for the others), which is shown as a smaller error bar in a darker color than its corresponding read group and only represented in one direction. c Detected miRXplore sequences for all tested protocols, sorted by decreasing average per protocol shown as boxplot (bottom) and dot plot (top). Each sample is shown as one dot and colored by protocol. The boxes span the first to the third quartile with the vertical line inside the box representing the median value. The whiskers show the minimum and maximum values or values up to 1.5 times the interquartile range below or above the first or third quartile if outliers are present. d UMAP embedding of all sequenced samples with miRXplore spike-in. The samples of the best eight protocols are highlighted in their respective color. The remaining protocols are grayed out. e Euclidean distance on the log2 transformed sequence expression showing the reproducibility between all replicates of the same protocol (green) and between all samples of one protocol variant compared to all other protocol variants (brown). Each dot represents the distance observed between two samples. Only nonredundant distances are shown (i.e., the distance of sample 1 to sample 2 is considered identical to the distance of sample 2 to sample 1). For each protocol, a dot plot (top), as well as a boxplot (bottom), is shown. The boxplot was defined in the same manner as for panel c. f Distribution of the top 100 highest expressed miRXplore sequences per sample, normalized as reads per million mapped (RPMM). The samples are grouped by protocol and ordered by ascending coefficient of variation. The vertical lines inside the areas delimit the quartiles. Every dot inside the area represents the expression level of one sequence. g Coefficient of variation for all samples grouped by the protocol in ascending order shown as dot plot (top) as well as boxplot (bottom). Each sample is represented by a dot. The boxplot was defined in the same manner as for panel c. Source data are provided in the Source Data file.

In the second stage, we evaluated these eight protocols on the single cell level. This time, we sequenced six replicates of MCF7 single cell equivalents to reduce the cell-to-cell variability. From these experiments, we selected the best protocol that showed a low amount of adapter dimers, a high amount of reads mapping to miRNAs, a high number of detected miRNAs, and high reproducibility between replicates.

In the third stage, we applied the best protocol to sequence six single cells each of eight different cell lines (fibroblasts, T-cells, monocytes, macrophages, lymphoblasts, colorectal adenocarcinoma, lung cancer, and hepatocellular carcinoma) and 67 single EpCAM positive CTCs from the blood of seven SCLC patients.

Performance of miRNA-Seq strongly depends on the applied protocol

Initially, we tested 19 miRNA-Seq protocol variants in triplicate using single MCF7 cells spiked with miRXplore. As a first quality control parameter, we measured the DNA concentration of the final libraries. The concentrations were variable, ranging from 0.39 ng µl−1 ± 0.03 ng µl−1 (protocol CL_UMI6) to 42.2 ng µl−1 ± 0.65 ng µl−1 (protocol SBN_4N; Supplementary Table 2). The fragment length distributions also showed differences. Products of around 125 bp should represent adapter dimers, products of around 145 bp represent libraries with miRNA inserts, and products larger than 155 bp are likely to represent inserts of other longer RNA types, e.g., lncRNA, mRNA, or snoRNA (Supplementary Fig. 2). Even though we quantified the libraries by qPCR prior to sequencing and pooled them in equimolar amounts, the number of sequenced reads varied from 200,000–2,650,000 (Supplementary Fig. 3). We computationally removed reads shorter than 18 nt and low-quality reads. Again, the results showed strong variations: Out of 19 tested protocols, six yielded over 90% of reads to be excluded, which disqualified the respective protocols from further analyses (Fig. 1b). In contrast, the best performing protocols had mapping rates of 60% to the human genome. Notably, about 10% of the reads mapped at multiple loci (Supplementary Data 2). Finally, between 10–50% of total reads matched annotated miRNA loci. The other mapped reads mainly match to protein-coding genes, intergenic regions, long non-coding RNAs (lncRNAs), or snoRNAs; other RNA types can be neglected. The protocols 4N, 4N_C3, 4N_CL, SB, SB_4N, CL, SBN, and SBN_CL detected almost all 1006 miRXplore spike-in sequences (Fig. 1c). Interestingly, nine of ten sequences detected in less than 10% of all samples across all protocols are artificial calibration nucleotides, and only hsa-miR-193a-3p was detected in a similarly few samples (6.3%; Supplementary Data 3). The number of additionally detected human miRNAs, that are not part of the spike-in, is low (1–22 miRNAs), indicating that single MCF7 cells contain a much lower miRNA concentration than the 1 pg spike-in. If the reads are mapped to the miRXplore spike-in, good performing protocols had over 90% of their reads mapped to the standard (Supplementary Fig. 4).

We selected the eight protocols with the highest number of reads mapping to miRNA loci and highest number of detected spike-in sequences (on average at least over 900), processed three additional replicates, and analyzed these protocols in further detail. First, we performed a dimension reduction analysis via UMAP which showed that samples cluster according to the used protocol. We observed that samples processed with the 5′ and 3′ 4N adapters showed a clear split in comparison to the other protocols (Fig. 1d). In the next step, we evaluated the reproducibility of the measurements of each sample and found that the SB protocol showed the highest reproducibility (lowest Euclidean distance between the replicates of the same protocol), followed by the SB_4N and SBN_CL protocols, while the 4N protocols (4N, 4N_CL and 4N_C3) showed the lowest reproducibility (Fig. 1e). A comparison of the replicates of one protocol to all other protocols highlighted that the samples of protocol 4N had the highest Euclidean distance, i.e., were the most different from all other protocols. It is important to interpret the results in the light of spiked-in miRNAs, which should have the same concentration within and across protocol variants. An analysis of the nucleotide content of the miRXplore sequences, as well as the minimum free energy of their secondary structures, showed that only the G-content seemed to influence the detection rate in all protocols (Spearman correlation of 0.45, P = 2.8 × 10−52), with an increasing G-content leading to an increasing detection probability (Supplementary Fig. 5 and Supplementary Data 3). For all protocols, we observed large variations in the miRXplore miRNA expression levels. Already the top 100 most expressed spike-ins show differences of several orders of magnitude (Fig. 1f). Protocol 4N_CL showed the largest variance from the expected expression values with a coefficient of variation of 3.407 and protocol SBN showed the lowest variance, and therefore the best accuracy with a coefficient of variation of 1.506 (Fig. 1g). Since six of the eight best protocols also provide UMI sequences, we deduplicated the read mappings and evaluated their variation on the remaining reads (Supplementary Fig. 6). The protocol 4N_CL remained the one with the largest variance, with a coefficient of variation of 2.170, while the protocol 4N_C3 showed the least variation (coefficient of variation of 1.046). However, some human spike-in miRNAs could be expressed by the MCF7 cells as well and might increase the observed variance as a background signal. To avoid biases due to different sequencing depths, we performed our analyses on subsampled samples with 300,000 reads as well, which confirmed the observed patterns (Supplementary Fig. 7).

Overall, we applied 19 different protocols for miRNA analysis of very low input samples. Our comprehensive evaluation based on the miRXplore Universal Reference indicated that eight protocols performed best. Only ligation-based approaches were among the list of top-performing protocols such that the polyadenylation-based approach (CATS) was not contained in the second stage.

The SBN_CL protocol shows high reproducibility and detects most miRNAs

In the second stage, the eight best protocols were analyzed using single cell equivalents of the MCF7 cell line. Similar to the first stage, the DNA concentrations of the libraries showed high variability and ranged from 1.37 ng µl−1 ± 0.13 ng µl−1 (CL) to 36.53 ng µl−1 ± 23.61 ng µl−1(4N_C3; Supplementary Table 3). The fragment length distribution was also comparable to the spike-in experiment with an increased number of small fragments (Supplementary Fig. 8). Likewise, the number of sequenced reads was comparable with the first stage and varied between 195,000–1,400,000 reads (Supplementary Fig. 9). The total number of reads remained comparable from the first to the second stage (Fig. 2a). However, the fraction of reads mapping to the human genome and the frequency of reads mapping to miRNAs clearly decreased (Fig. 2b). In case of the protocols SBN, SB_4N, 4N, and 4N_C3, less than 10% of the total reads mapped to the human genome, and the libraries consisted almost completely of adapter dimers. Compared to the spike-in experiments in stage 1, the amount of adapter dimers increased for all protocols 2- to 399-fold (Fig. 2c). Furthermore, the amount of multi-mapped reads increased to 34–73% of mapped reads (Supplementary Data 4). Finally, a maximum of 2.0% of total reads and 3.0% of mapped reads (SBN_CL) matched to annotated human miRNAs. Still, miRNAs were detected in all protocols: We monitored on average 55 to 178 different miRNAs between the protocols (Fig. 2d). One of the replicated SB libraries even contained 327 different miRNAs. The other replicates, however, yielded only between 134 and 189 miRNAs. The SBN_CL protocol showed the highest concordance with on average 178 miRNAs (SD 21.3) per replicate. Next, we performed a dimension reduction analysis via UMAP (Fig. 2e), which showed that most replicates clustered according to their protocol as the major driving factor, followed by the sequencing run (the six replicates were sequenced in two batches of three replicates). As for the UMAP dimension reduction for the protocols of the first stage, we observed again a split between protocols with 5′ and 3′ 4N adapters in comparison to all others, although this split was less pronounced. An evaluation of the reproducibility of the measurements of each sample highlighted that the replicates of the SBN_CL protocol had the highest reproducibility (lowest Euclidean distance between replicates of the same protocol), followed by the SB protocol (Fig. 2f). The CL protocol was found to be the one with the lowest reproducibility, i.e., the highest Euclidean distance between the replicates. An evaluation between single protocols in comparison to all others showed that all protocols were similarly different from each other, while the samples of protocol CL showed on a median the largest difference. Focusing on the detected miRNAs, we observed the same trends. In total, up to 713 miRNAs were present in at least one replicate of the tested protocols. MiR-21-5p was found in all sequenced libraries followed by let-7a-5p and miR-182-5p (Fig. 2g). Finally, a set analysis comparing the overlaps of all detected miRNAs per protocol was performed. This underlined the high heterogeneity of miRNAs detected per experimental setup. While 69 miRNAs were detected in at least one replicate in all protocols, 60 miRNAs were exclusively detected by the SBN protocol, followed by the SB protocol with 57 exclusive miRNAs and the SBN_CL protocol with 53 exclusive miRNAs (Fig. 2h). Analogous to the first stage, we performed our analyses on subsampled libraries with 300,000 reads and confirmed the same patterns (Supplementary Fig. 10).

Fig. 2: Protocol comparison with single cell equivalents (stage 2).
figure2

a Number of reads sequenced in stage 1 (miRXplore spike-in) and 2 (MCF7 single cell equivalents), shown as boxplot (bottom) and dot plot (top). Each dot represents one sample. The boxes span the first to the third quartile with the vertical line inside the box representing the median value. The whiskers show the minimum and maximum values or values up to 1.5 times the interquartile range below or above the first or third quartile if outliers are present. b Read distribution for all tested protocols, sorted by miRNA reads proportion. The data were presented as mean values ±  the standard deviation (n = 6 biologically independent samples), which is shown as a smaller error bar in a darker color than its corresponding read group and only represented in one direction. cAdapter dimers found in stage 1 and stage 2, shown as boxplot (bottom) and dot plot (top). Each dot represents one sample. The boxplot was defined in the same manner as for panel a. d Detected miRNAs for every sample, sorted by decreasing average per protocol shown as boxplot (bottom) and dot plot (top). Each sample is shown as one dot and colored by protocol. The boxplot was defined in the same manner as for panel a. e UMAP embedding of all samples. Each sample (dot) is colored by its protocol. f Euclidean distance on the log2 transformed sequence expression showing the reproducibility between all replicates of the same protocol (green) and between all samples of different protocols (brown). Each dot represents the distance observed between two samples. Only nonredundant distances are shown (i.e., the distance of sample 1 to sample 2 is considered identical to the distance of sample 2 to sample 1). For each protocol a dot plot (top), as well as a boxplot (bottom), is shown. The boxplot was defined in the same manner as for panel a. g Top ten miRNAs detected in multiple experiments. h Upset plot showing the miRNAs jointly detected by multiple protocols, or exclusively found in only one protocol (orange). The bar plot at the top shows on the y-axis the number of miRNAs detected by the protocols highlighted by connected black or orange dots in the grid below. The bar plot on the left shows on the x-axis the total number of miRNAs detected in the least one of the replicates of the protocol shown on the y-axis. Source data are provided in the Source Data file.

In conclusion, the protocols SB and SBN_CL showed the best results in our experiments on the single cell level. Due to its higher reproducibility, the SBN_CL protocol was selected for the miRNA analysis of eight cell lines and patient-derived CTCs in the third stage.

SBN_CL protocol shows comparable performance in eight different cell lines

In order to investigate if the SBN_CL protocol shows robust performance in different cell types, we analyzed six single cells each of the following cell lines: epithelial cancer cells (liver HepG2, lung A549, colon HT29), hematopoietic cancer cells (monocyte THP-1, T-cell Jurkat, macrophage KG1, lymphoblast REH), and healthy BJ fibroblasts. The SBN_CL protocol worked in all different cell types. The overall protocol performance is comparable to the MCF7 single cell equivalents. On average 30.5% (SD 8.5%) of total reads could be mapped to the human genome and 1.3% (SD 0.76%) matched annotated miRNAs (Fig. 3a and Supplementary Data 5). Per single cell, 32–255 different miRNAs could be detected (median 174; Fig. 3b). Dimension reduction via UMAP showed that the samples cluster by cell type, whereas the hematopoietic cell lines cluster closer together (Fig. 3c). Detailed analysis of the detected miRNAs shows that 128 miRNAs could be detected in at least one replicate of all eight cell lines and 16 (Jurkat) − 42 (HepG2) miRNAs were specific to a certain cell line (Fig. 3d). To determine if the miRNA profiles observed in the cell lines showed patterns in line with the literature, we performed a pathway enrichment analysis with miEAA 2.022 for the miRNAs of each cell line, sorted by decreasing mean expression. All profiles yielded enrichments typical for the studied cell line, i.e., for A549 and BJ we found significant enrichment for lung and skin tissue, respectively (Fig. 3e, f), for HepG2, HT29, KG1, REH, and THP-1 we found significant enrichment for their specific diseases (Fig. 3g–l). In summary, the sc-miRNA-Seq SBN_CL protocol can be used to determine tissue- and/or disease-specific miRNA profiles in a variety of different cell types.

Fig. 3: Application of protocol SBN_CL to eight different cell lines.
figure3

a Read distribution for all tested cell lines, sorted by miRNA reads proportion. The data were presented as mean values ± the standard deviation (n = 6 biologically independent samples), which is shown as a smaller error bar in a darker color than its corresponding read group and only represented in one direction. b Detected miRNAs for every sample, sorted by decreasing average per cell line shown as boxplot (bottom) and dot plot (top). Each sample is shown as one dot and colored by protocol. Each dot represents one sample. The boxes span the first to the third quartile with the vertical line inside the box representing the median value. The whiskers show the minimum and maximum values or values up to 1.5 times the interquartile range below or above the first or third quartile if outliers are present. c UMAP embedding of all samples. Each dot represents one sample. d Upset plot showing the miRNAs jointly detected in multiple cell lines, or exclusively found in only one cell line (orange). The bar plot at the top shows on the y-axis the number of miRNAs detected in the cell lines highlighted by connected black or orange dots in the grid below. The bar plot on the left shows on the x-axis the total number of miRNAs detected in at least one of the samples of the cell line shown on the y-axis. e–l Examples of significantly enriched categories for each of the analyzed cell lines. Each plot shows the computed running sum (blue), running sums of random permutations (background), and the FDR adjusted P value for the cell line specified in the title. Exact p values were computed by the gene set enrichment analysis implementation of miEAA for each enrichment and FDR adjusted, separately for each database. Source data are provided in the Source Data file.

miRNA profiles of SCLC patient CTCs show high intrapatient variability

In seven SCLC patients, single CTCs were isolated from blood using EpCAM staining (Fig. 4a). For these patients, in total 67 CTCs were sequenced using the SBN_CL protocol. Additionally, two negative controls containing only reagents without a cell were tested. The number of EpCAM positive cells per patient varied between 2 and 28 cells (Fig. 4b). For the patient CTCs, we sequenced between 370,000 and 700,000 reads per cell, of which on average 62.7% (SD 6.6%) were lost in quality control. Of the remaining reads on average 37.5% (SD 6.4%) mapped to the human genome (Supplementary Data 6). Compared to the results in stages 1 and 2, we observed an increased heterogeneity of covered RNA classes. The proportion of reads mapping to miRNAs varied between 0.02 and 5.9% with an average of 0.85% (SD 1.1%; Fig. 4c). Most reads mapped to non-annotated intergenic regions, protein-coding genes, ribosomal RNAs (rRNAs), lncRNAs, as well as to transfer RNAs (from GtRNAdb) and snoRNAs. As expected, in our negative controls, nearly no reads mapping to miRNAs, tRNAs, or snoRNAs were found (Supplementary Data 6). Next, we investigated the miRNA read duplication rates and found that the number of reads per UMI was consistent for most cells between the patients with on average 4.74 reads per UMI (SD 2.16; Fig. 4d). Subsequently, we evaluated the number of miRNA molecules found per cell. On average, 389.49 (SD 496.43) molecules per patient cell could be detected, only 7 and 14 molecules were detected in the two negative samples (Supplementary Fig. 11). We thus excluded cells that were likely of low quality by requiring at least 50 detected miRNA molecules, since we expected these to be unlikely to contain spurious signals. Among the remaining 53 cells, the most abundant miRNAs were miR-375-3p, miR-26a-5p, and let-7a-5p (Fig. 4e). Per single cell, we detected a median of 68 miRNAs, with ten miRNAs expressed in over 90% of tested cells (Supplementary Fig. 12). Altogether, all cells expressed 352 unique miRNAs (Supplementary Data 7). The six most variable miRNAs were miR-100-5p, miR-10b-5p, miR-182-5p, miR-200b-3p, miR-335p-3p, and miR-7-5p (Fig. 4f). We computed a UMAP embedding of the cells and clustered them with the Louvain community detection algorithm23 into six clusters to investigate the miRNA expression variability between patients and between cells of the same patient. As highlighted in Fig. 4g, the cells only moderately clustered per patient (no patient formed its own cluster) indicating a high variability between cells of the same patient, which was underlined by adjusted mutual information of 0.31. Because the number of CTCs per patient is very variable (Fig. 4b), we repeated the clustering analysis for only the three patients with the largest number of sequenced CTCs and we still do not observe clustering by the patient (Fig. 4h).

Fig. 4: Application of protocol SBN_CL to CTCs of small cell lung cancer (SCLC) patients.

a Representative images of two EpCAM+ cells enriched from blood samples of two SCLC patients (20x). The cells were isolated by micromanipulation for miRNA sequencing. The scale bar is equivalent to 5 µm. b Number of cells sequenced per patient and the two empty negative controls. cDistribution of the mapped reads for each cell grouped per patient and ordered by descending miRNA proportion. d Boxplot showing the number of reads per UMI for each cell, grouped by the patient in descending order. Each miRNA is shown as one dot. The boxes span the first to the third quartile with the vertical line inside the box representing the median value. The whiskers show the minimum and maximum values or values up to 1.5 times the interquartile range below or above the first or third quartile if outliers are present. e Distribution of the top 20 most expressed miRNAs across all cells shown as boxplot (bottom) and dot plot (top). The boxes span the first to the third quartile with the vertical line inside the box representing the median value. The whiskers show the minimum and maximum values or values up to 1.5 times the interquartile range below or above the first or third quartile if outliers are present. f Expression distribution of the six most variable miRNAs. Patients with less than three cells were excluded. The vertical lines inside the areas delimit the quartiles. The dots inside the area represent the expression of a miRNA in the cell of the corresponding patient. g UMAP embedding of all samples colored by the patient (left) and colored by cluster (right). h UMAP embedding of the three patients with the highest number of CTCs colored by the patient (left) and colored by cluster (right). Source data are provided in the Source Data file.

Known and non-annotated miRNAs highlight relevance for cancer

To understand the relevance of miRNAs in CTCs, we performed a pathway enrichment analysis. The miRNAs were sorted by decreasing expression in the CTCs and processed using the gene set enrichment analysis (GSEA) of miEAA22. In terms of organs from the human miRNA tissue atlas24, miEAA proposed enrichment in the lung (GSEA, adjusted p value = 1.5 × 10−11) and the colon (GSEA, adjusted p value = 2.5 × 10−11; Fig. 5a, b). Interestingly, the results also suggest an enrichment of several cancer related pathways, such as the integrin signaling pathway (Supplementary Data 8) and cancer as disease including the miRWalk categories25 neoplasms (GSEA, adjusted p value = 4.6 × 10−8) and carcinoma (GSEA, adjusted p value = 1.6 × 10−6; Fig. 5c, d). The most overrepresented cellular localization is the cytoplasm (GSEA, adjusted p value = 1.6 × 10−15) and the mitochondrion (GSEA, adjusted pvalue = 1.1 × 10−14; Fig. 5e, f). While these results are no functional proof of potential downstream cascades triggered by miRNAs in CTCs, we can at least claim a hypothetic regulatory effect of miRNAs in CTCs.

Fig. 5: Enrichment analysis of CTC miRNAs and miRNA candidates.
figure5

a–f Top enriched organs, pathway categories, and cellular locations. Each plot shows the computed running sum (blue), running sums of random permutations (background), and the FDR adjusted Pvalue. Exact p values were computed by the gene set enrichment analysis implementation of miEAA for each enrichment and FDR adjusted, separately for each database. g Pileup plot obtained from miRCarta for one of the overlapping miRNA candidates. The bars are colored according to the experiments the reads were contributed from. The expression is shown as reads per million mapped (RPMM). h Distribution of the read proportion supplied by each experiment that detected this miRNA.

The analyses performed so far were restricted to miRNAs annotated in the miRBase. MiRNAs that have not yet been reported or at least miRNAs that have not yet been added to the miRBase may have also regulatory effects. We thus performed a prediction of non-annotated miRNAs in the single CTCs. Our analysis suggested ten non-annotated miRNA candidates. Since the coverage of respective candidates is limited on the single cell level, we set to compute the coverage in bulk sequencing data. Indeed, in four cases we found hits for the potential candidates in miRCarta26, namely hsa-11781-8351.1, hsa-2644-2657.1, hsa-2810-2791.1, and hsa-9809-4031.1, partly with excellent read mapping profiles (Fig. 5g, h;Supplementary Fig. 13). Interestingly, the discovered miRNAs that were supported by data from miRCarta also regulated relevant pathways, including cell junction (GSEA, adjusted pvalue of 4 × 10−8) and cell adhesion (GSEA, adjusted p value of 5 × 10−6).

Discussion

In this study, we comprehensively compared 18 ligation-based miRNA-Seq and one polyadenylation-based protocol. The only polyadenylation-based protocol investigated, showed in the first stage lower performance and was excluded together with 10 of the 18 ligation-based protocols. These results are in line with similar protocol comparisons on the bulk level8,10,11,13. Eight different protocol variants showed promising results for miRNA-Seq from very low input samples. Both, exonuclease digestion of excess adapters, and chemically modified adapters led to a reduction of adapter dimer formation, with CleanTag adapters20 appearing to be the most effective strategy. Surprisingly, adapters with random nucleotides at the ligation sites do not show improved miRNA quantification accuracy of the equimolar spike-in miRNAs, and in addition, 4N libraries contained more than 40% of adapter dimer reads. Since similar miRNAs are under- or overrepresented between the different protocols, we showed the observed bias seems to be partially caused by miRNA sequence G-content. Other studies suggest the nucleotide sequence at 5′ or 3′ end, miRNA free energy, enthalpy, entropy, and secondary structure formation as additional causes of bias9,15.

Single cells probably contain much less than 1 pg miRNA as deduced from the much lower performance of all tested protocols in stage 2 compared to stage 1. Surprisingly, the original SB protocol showed in our experiments an increase in performance compared to the updated SBN protocol. More miRNAs could be detected, and the amount of adapter dimers was lower, which might be explained by the higher adapter concentrations of the SBN protocol. The SB and the SBN_CL protocol can be recommended for the use on single cell level, with advantages in terms of reproducibility for the SBN_CL protocol. We demonstrated the applicability of the SBN_CL protocol in a broad range of cell types. However, this protocol could also be further improved based on our insights, e.g., to detect even more cell-type specific (novel) miRNAs, to investigate the distribution of 3′ and 5′ miRNA arm usage, and to analyze isomiR expression in health and disease27,28. Future research should focus on the increase of the number of reads mapping to miRNAs and the reduction of adapter dimers. This might be achieved by optimizing ligation reactions with special attention to the adapter concentrations29, by the usage of splint adapters12, by adapter-miRNA-circularization10,30, by application of CRISPR/Cas9 to deplete adapter dimer reads16, or by a combination thereof. Furthermore, combined profiling of single cell mRNA and miRNA expression seems possible31,32. The current SBN_CL protocol allows sc-miRNA-Seq of about 15 samples within 2 days for library preparation. The protocol could also be easily automated in 96- or 384-well format due to bottom-up reactions and the avoidance of gel or column-based purification steps. Further increase in throughput might be achieved by introducing a barcode to the 3′ adapter and pooling multiple samples after ligation29.

Our pilot study on seven SCLC patients demonstrates the feasibility of single cell miRNA profiles as potential biomarkers. We have identified many different oncogenic miRNAs in SCLC CTCs. The most abundant miRNAs from the CTC study are known as cancer miRNAs. A comprehensive literature review revealed that miR-21-5p, miR-146b-5p, miR-142-5p, miR-148a-3p, miR-92a-3p, miR-26a-5p, and miR-25-3p were each found to be connected to cancer in over 50 manuscripts. Of note, from all 30 miRNAs that were present in at least 60% of CTCs, 27 have already been suggested to be of relevance for lung cancer, with the most prominent cases of miR-21-5p2,33 and miR-142-5p34,35. Until now, changes in expression profiles of these miRNAs have only been associated with clinical and molecular features of non-small cell lung cancer primary tumors and circulating miRNA36, but not with CTCs. Our study shows that these miRNAs are also frequently expressed in CTCs of SCLC patients. Pathway enrichment also provided a first glimpse into the biology of CTCs, as the integrin signaling pathway was the top enriched pathway in SCLC CTCs. Integrin expression seems to be directly related to the aggressiveness of SCLC comprising high metastatic potential and resistance development to chemotherapy37,38,39,40,41. To explore the biological relevance and the diagnostic potential of CTC-derived miRNAs, however, larger cohorts with more cases and controls, repeated sampling over time, outcome data, and mechanistic studies are required. Our comprehensive protocol comparison providing an assay for measuring miRNAs in CTCs of cancer patients paves the way for this goal.


Article classification: Biological abstract
Share to:
Tel:+86-027-87610298
Tel:+86-027-87610297
Add:Room A11-329, 1st Floor, No.1, SBI Venture Street, Optics Valley, East Lake
New Technology Development Zone, Wuhan, China.
Certificate NO.:U18Q28010569R0S