Enlibio,Your biological research partner
Xinqidi Biotech Co.,Ltd,Wuhan,China 2008-2022
R&D 14th year

Genome-wide mapping of somatic mutation rates uncovers drivers of cancer

Issuing time:2022-06-28 14:42


Identification of cancer driver mutations that confer a proliferative advantage is central to understanding cancer; however, searches have often been limited to protein-coding sequences and specific non-coding elements (for example, promoters) because of the challenge of modeling the highly variable somatic mutation rates observed across tumor genomes. Here we present Dig, a method to search for driver elements and mutations anywhere in the genome. We use deep neural networks to map cancer-specific mutation rates genome-wide at kilobase-scale resolution. These estimates are then refined to search for evidence of driver mutations under positive selection throughout the genome by comparing observed to expected mutation counts. We mapped mutation rates for 37 cancer types and applied these maps to identify putative drivers within intronic cryptic splice regions, 5′ untranslated regions and infrequently mutated genes. Our high-resolution mutation rate maps, available for web-based exploration, are a resource to enable driver discovery genome-wide.


Neutral (passenger) mutations that do not provide a proliferative advantage to a cell dominate the mutational landscape of tumors1,2. Only a relatively small fraction of mutations are under positive selection3,4,5 due to their ability to drive cancer by promoting cell growth, resisting cell death or enabling tissue invasion6. Because positively selected mutations reoccur across tumors7, genomic elements (for example, coding sequences, promoters, enhancers and long non-coding RNAs) with carcinogenic potential accumulate more mutations than expected compared to the rates at which neutral mutations occur when counted across multiple tumors8,9. Searching for mutational excesses attributable to positive selection to discover driver mutations, genes and non-coding elements provides crucial insight into the mechanisms of cancer4,5,10,11,12,13,14,15.

Because robust identification of mutational excess requires an accurate model of the neutral mutation rate, computational tools that carefully model somatic mutation rates are central to locating additional cancer drivers. This task is made challenging by the highly variable and tissue-specific patterns of neutral mutations across the cancer genome16,17. Existing methods address this challenge by fitting bespoke statistical models of mutation rates to specific regions of the genome4,9,18,19,20,21. For example, methods designed to identify driver genes model mutation rates specifically within protein-coding sequences by using synonymous mutations as a proxy for neutral mutations3,4,21,22. Recent methods designed to identify non-coding cancer drivers train sophisticated machine learning methods, such as gradient boosting machines, to model mutation rates within a subset of the genome18,19,20 (~4% of the genome in a recent pan-cancer analysis of non-coding drivers5). Additionally, some models search for driver mutations in unexpected nucleotide contexts10, in unexpected clusters23 or by directly (and interpretably) predicting the consequences of variants within the coding sequence of select genes24. Despite this progress, the ability to search for evidence of driver mutations in arbitrary genomic regions remains incomplete: existing methods are not applicable to most of the genome (for example, because they operate only within coding sequences); require time-consuming and computationally expensive model training for each set of regions to test in a cancer cohort; or cannot test with base-pair resolution. These limitations contribute to catalogs of cancer driver elements remaining incomplete, particularly in the non-coding genome25, hindering precision oncology4,11,26,27.

Here we introduce a genome-wide neutral mutation rate model that allows rapid testing for evidence of positively selected driver mutations anywhere in the genome. This approach, called Dig, is predicated on two key methodological advances. First, we introduce a deep learning approach to map cancer-specific somatic mutation rates at kilobase-scale resolution across the entire genome. Second, we propose a probabilistic model that uses these maps to test any set of candidate mutations from an arbitrary cancer cohort for evidence of positive selection. Through this framework, our maps enable millions of mutations to be evaluated in arbitrary cancer cohorts in minutes using the resources of a personal computer. We applied our deep learning framework to map cancer-specific somatic mutation rates for 37 cancer types present in the Pan-Cancer Analysis of Whole Genomes (PCAWG) dataset12, using high-resolution epigenetic assays from healthy tissues as predictive features (well-known correlates of tumor mutation rates at the megabase scale16,28). We then used Dig to identify new coding and non-coding candidate cancer drivers in publicly available whole-genome, whole-exome and targeted sequencing cancer datasets. Our mutation maps are publicly available both as an interactive genome browser and as a standalone software tool for quantifying excess somatic mutations anywhere in the genome in a dataset of interest.


Testing mutational excess with probabilistic deep learning

To enable rapid evaluation of mutational excess anywhere in the genome, we designed Dig to model somatic mutation rates genome-wide for a given type of cancer. Thus, the distribution of neutral mutations over any set of genomic positions for a cohort of tumors from that cancer type can be looked up nearly instantaneously. The method employs a probabilistic deep learning model that explicitly captures two central determinants of somatic mutation rate variability16,17,21: (1) kilobase-scale variation driven by epigenomic properties, such as replication timing and chromatin accessibility, that broadly impact efficacy of DNA repair9; and (2) base-pair-scale variation driven by the sequence context biases of processes that induce somatic mutations, such as APOBEC-driven cytidine deamination and UV light exposure10,17,29,30. Kilobase-scale variation is modeled with a custom deep learning architecture31 that uses a neural network to predict cancer-specific mutation rates within 10-kb regions and a Gaussian process (GP) to quantify the prediction uncertainty, taking as input high-resolution epigenetic assays (and, optionally, flanking mutation counts) (Fig. 1a, Extended Data Fig. 1 and Methods). By strictly partitioning the genome into non-overlapping train, validation and held-out test sets with five-fold cross-validation (predicting mutation rates in each one-fifth of the genome using a model trained and validated on observed mutations in the remaining four-fifths; Methods), the network constructs a kilobase-scale map of the mutation rate genome-wide for a given type of cancer (Fig. 1b). Base-pair variation is subsequently modeled using a generative graphical model that simulates how mutations should be distributed to individual positions in a region according to the nucleotide biases of mutational processes (Supplementary Fig. 1 and Methods). The marginal distribution over the number of neutral mutations at any set of positions has a closed-form solution that depends on the predicted regional mutation rate, the prediction uncertainty and the genome-wide probability that a position is mutated based on its neighboring nucleotides (Methods). Thus, once values for these parameters are learned from a training cohort of a given cancer type, the distribution of mutations expected at any set of positions in the genome can be queried for any tumor cohort of the same cancer and used to test for evidence of positive selection by quantifying if excess mutations are observed (Fig. 1c and Methods).

Fig. 1: Modeling the genome-wide neutral somatic mutation rate and identifying cancer driver elements.
figure 1

a, Deep learning scheme to predict expected number of somatic mutations and prediction uncertainty using epigenetic sequencing of healthy tissue from the Roadmap Epigenomics consortium and ENCODE. b, Genome-wide neutral somatic SNV map and observed density of SNVs in 1-Mb windows from the PCAWG cohort (n = 2,279 samples). For clarity, only chromosomes 1, 3 and 5 are shown. Highlighted regions correspond to panels with the matching colored symbol. Inset: region on chromosome 1 modeled at 100-kb and 10-kb resolution. The reported R2 statistic between observed and expected SNV counts was calculated genome-wide. c, Examples of burden tests in the PCAWG dataset (n = 2,279 samples) for coding mutations in NRAS (n = expected versus observed mutations; synonymous: 0.81 versus 1; missense: 2.62 versus 15; nonsense: 0.22 versus 0; indels: 0.23 versus 3), non-coding mutations in the TERT promoter (SNVs: 2.12 versus 99; indels: 0.14 versus 0) and splice site SNVs in VHL (canonical splice SNVs: 0.03 versus 5; cryptic splice SNVs: 0.17 versus 0). Expected is mean with 95% CIs. P values from Dig. d, Proportion of variance of non-synonymous SNV count in genes 1–1.5 kb in length (n = 3,740 genes) in 16 PCAWG cohorts explained by different methods (size of each cohort reported in Supplementary Table 1). Box plot elements are defined in Methods. e, Approximate numbers of false-positive and true-positive driver genes identified in the PCAWG cohort by method (across a range of calling thresholds). Numbers are approximated because the true set of driver genes is unknown. CGC genes were used as a conservative approximation of true positives (a non-CGC gene may still be a true driver). f, Runtime of coding and non-coding driver detection methods. Comparison was restricted to SNVs because not all methods support indels. Coding analysis over n = 19,210 genes for Dig and dNdScv and n = 18,862 genes for MutSigCV. Non-coding analysis over n = 139,404 elements for Dig, DriverPower and Larva and n = 117,180 of those elements for ActiveDriverWGS. ActiveDriverWGS required >2 days to analyze the largest cohort.

We constructed mutation rate maps and inferred nucleotide mutation biases for 37 cancer types (Supplementary Tables 1 and 2 and Supplementary Data File 1) based on somatic mutations from the PCAWG dataset12 and 100-bp patterns of 723 chromatin marks in 111 tissues from Roadmap Epigenomics32, replication timing from ten cell lines from ENCODE33, and average nucleotide and GC content of the reference genome (Supplementary Table 3). We then benchmarked the accuracy of our somatic mutation rate models using the metric of proportion of variance explained, which we calculated as the square of the correlation coefficient between predicted and observed mutation counts as in previous work16. Dig successfully predicted a median of 77.3% (mean, 70.6%; range, 22.7–92.3%) of variance in observed single nucleotide variant (SNV) rates in 10-kb regions and a median of 94.6% (mean, 91.9%; range, 73.1–98.0%) of variance in 1-Mb regions (Fig. 1b, Supplementary Table 4 and Methods) across 16 cancer types for which benchmarking power was sufficient (>1 million mutations and excluding lymphomas, in which activation-induced cytidine deaminase produces extreme outlier mutation counts in locally hypermutated regions). Compared to existing methods designed specifically to analyze tiled regions34, coding sequence4,21 and non-coding elements in which synonymous mutations cannot be used to calibrate mutation rate models18,19 (for example, enhancers and non-coding RNAs), Dig explained the most variation of SNV counts within 10-kb regions in 14 of 16 cohorts, of non-synonymous SNV counts in 16 of 16 cohorts and of enhancer and non-coding RNA SNV counts in 15 of 16 cohorts, respectively (Fig. 1d, Table 1, Supplementary Fig. 2 and Supplementary Tables 46). Our approach’s accuracy is attributable, in part, to the ability of the deep learning network to identify local epigenetic structures, such as active transcription start sites, and to associate these structures with mutation rates (Extended Data Fig. 2 and Supplementary Note 1).

Table 1 Proportion of variance in observed SNV counts in the PCAWG cohort (n = 2,279 samples) explained by different methods

This accuracy enabled correspondingly powerful driver identification. In benchmarks testing downstream ability to identify evidence of positive selection (that is, excess of mutations) within previously identified driver elements, Dig matched or exceeded the performance of methods tailored toward specific classes of elements4,18,19,20,21 in whole-genome and whole-exome sequenced samples (Fig. 1e, Supplementary Figs. 35, Supplementary Tables 710 and Supplementary Notes 2 and 3). Considering driver genes—for which high-quality databases of known driver genes that can approximate gold standard true positives exist (Methods)—Dig had the highest F1-score (a measure of accuracy) in 24 of 32 PCAWG cohorts (excluding skin and blood cancers as in previous work19 due to local hypermutation processes) and the most power in 14 of 16 whole-exome cohorts compared to widely used, burden-based driver gene detection methods (Fig. 1e, Supplementary Figs. 3 and 4 and Supplementary Tables 8 and 9) (power was measured as the area under approximated receiver operating characteristic curves, which could be estimated due to the larger sizes of the exome-sequenced cohorts; Methods).

Identifying potential driver elements with Dig was 1–5 orders of magnitude faster than existing methods that train new models for every element and cohort analyzed (Fig. 1f). For example, testing 107 observed mutations for evidence of positive selection within 105 non-coding elements with Dig completed in <90 seconds on a single CPU core compared to between ~10 minutes and >2 days for other methods. Thus, our method matches or exceeds the power of existing approaches while requiring less runtime and providing flexibility to identify drivers with mutation-level precision genome-wide.

Small mutation sets increase power to identify drivers

Previous searches for non-coding driver elements have concluded that such drivers are likely rare, carried by <1% of samples5. A power analysis using our model’s generative capabilities concurred (Methods), indicating the most known non-coding elements (for example, enhancers) require at least 1–2% of samples to carry driver mutations to have a >90% likelihood of detecting mutational excess at current sample sizes (~102 for individual cancer types; ~103 for pan-cancer cohorts) (Supplementary Fig. 6). However, by reducing the size of tested elements to encompass only tens to hundreds of positions (as opposed to the thousands of base pairs spanned by most non-coding elements considered to date—for example, average enhancer size: 1,717 bp; range, 600–30,200 bp), power to identify driver mutations in <1% of samples increased by ~20% (Supplementary Fig. 6). To demonstrate the ability of Dig to find putative drivers, we, thus, defined and tested specific sets of mutations with potential functional impact for evidence of selection. The ability to test user-specified sets of specific mutations genome-wide is a unique feature (to our knowledge) of our method.

Quantifying pan-cancer selection on cryptic splice SNVs

Alternative splicing is increasingly recognized as functionally relevant to cancer35,36, and recent studies have associated specific somatic mutations outside canonical splice sites with alternative splicing events observed in expression data37,38. We, thus, applied Dig to rigorously quantify the extent to which cryptic splice SNVs, which may exist in both exons and introns of a gene (Fig. 2a), occur in excess of the neutral mutation rate and, therefore, may function as driver mutations under selection. In tumor suppressor genes (TSGs) from the Cancer Gene Census (CGC)39, cryptic splice SNVs as predicted by spliceAI40 (Methods) occurred significantly more often than expected under neutrality (648 SNVs observed in 283 TSGs versus 550 SNVs expected; P = 2.38 × 10−5) (Fig. 2b and Supplementary Tables 11 and 12); were primarily enriched in introns (where most such mutations occur); and were biased to occur in sites with high predicted impact on splicing (SNVs with predicted impact Δ score >0.8 exhibited a 1.75-fold enrichment (95% confidence interval (CI): 1.31–2.22 fold), P = 2.52 × 10−5) (Fig. 2b,c). Overall, intronic cryptic splice SNVs were estimated to account for 4.5% (95% CI: 1.3–7.4%) of excess (potential driver) SNVs in TSGs, similar in magnitude to the 7.4% (5.6–9.7%) attributable to canonical splice SNVs, whose driver potential is well-established4 (Fig. 2d) (exonic excess SNV estimates were consistent with estimates from dNdScv; Supplementary Fig. 7). Results were robust to high mutation burden samples (Supplementary Fig. 8) and consistent with an analysis that did not rely on our mutation maps (Supplementary Fig. 9). Neither control genes not in the CGC nor oncogenes in the CGC were enriched for cryptic splice SNVs (Extended Data Fig. 3 and Supplementary Table 11). The lack of enrichment in oncogenes suggests that gain-of-function splice mutations beyond those that induce skipping of MET exon 14 are extremely rare, which may reflect the low likelihood of an intronic splice mutation resulting in the in-frame addition of residues that pathologically activate an oncogene. Conversely, the enrichment in TSGs suggests that cryptic splice mutations are generally inactivating, likely by triggering nonsense-mediated decay of mRNA transcripts or generating a protein with impaired function.

Fig. 2: Evidence of positive selection on intronic cryptic splice SNVs in TSGs.
figure 2

a, Schematic of the splice-altering SNVs considered in this analysis. Predicted impact on splicing measured by the SpliceAI Δ score (higher score approximates higher likelihood of altered splicing). We stratified possible SNVs by predicted impact on splicing: low predicted impact (0.2 < Δ < 0.5), medium predicted impact (0.5 < Δ < 0.8) and high predicted impact (0.8 < Δ < 0.1). b, Estimated enrichment (with 95% CI) of observed mutations compared to expected neutral mutations in TSGs stratified by variant type and predicted impact on splicing in n = 2,279 pan-cancer samples from the PCAWG dataset (n mutations per category in Supplementary Table 11). c, Predicted splicing impact (SpliceAI Δ score) for intronic cryptic splice SNVs observed in recurrently mutated TSGs (see e) compared to those observed in genes not in the CGC (** indicates bootstrapped P < 3 × 10−4; Methods). Box plot elements are defined in Methods. d, Proportion of excess SNVs in TSGs contributed by each protein-altering SNV category. e, Known TSGs per cancer with a significant burden (FDR < 0.1) of predicted intronic cryptic splice SNVs (n mutations per gene in Supplementary Table 13). f, Distribution of distance to nearest exon boundary for the intronic cryptic splice SNVs observed in recurrently mutated TSGs. g, Pileup of RNA-seq reads in a Lymph-BNHL carrier of a predicted, deeply intronic cryptic splice SNV (labeled in red) in CIITA and a control Lymph-BNHL sample, showing the inclusion of a cryptic exon (gold) in the cryptic splice SNV carrier. Arc labels indicate the number of RNA-seq reads that support each exon junction.

Considering individual genes, seven TSGs in 12 cancer types had a significant burden of intronic cryptic splice SNVs (false discovery rate (FDR) < 0.1 for n = 283 TSGs in 37 cancers) (Methods, Fig. 2e and Supplementary Table 13), with patterns of TSG–cancer associations consistent with known tissue specificity of TSGs. Pan-cancer, TP53 and SMAD4—both implicated in many cancers—carried an excess of cryptic splice SNVs. In contrast, the hematopoietic-specific TSG CIITA and the renal-specific TSG PBRM1 carried excess cryptic splice SNVs in blood and kidney malignancies, respectively. In further support of these associations, the intronic cryptic splice SNVs observed in these TSGs, most (79.3%) of which fell outside annotated splice regions (that is, >20 bp from exon–intron boundaries) (Fig. 2f), had significantly higher predicted impact on splicing than those observed in genes not in the CGC (Fig. 2c) (mean SpliceAI Δ score = 0.55 versus 0.33; P < 3 × 10−4; Methods). Moreover, of the six cryptic splice SNV carriers with available RNA sequencing (RNA-seq) data with sufficient coverage, five had evidence of alternative splicing (Fig. 2g, Supplementary Fig. 10, Supplementary Table 14 and Supplementary Note 4) as quantified by LeafCutter41 (Methods). Overall, these results provide evidence that intronic cryptic splice SNVs are under positive selection in TSGs and likely act as driver events in several percent of tumors across multiple cancer types.

Nine genes not in the CGC also had a significant burden of intronic cryptic splice SNVs in six cancers (Supplementary Table 15) at FDR < 0.1, of which two genes had a significant burden at the more stringent Bonferroni (α < 0.05) correction for 712,600 tests conducted across all genes and cancers. The burdens of four genes were driven by recurrent mutations at a single intronic location per gene (Supplementary Table 16). Implicated genes include BTG2 in lymphoma, which is involved in the regulation of the G1/S transition of the cell cycle and has recently been implicated as a driver of blood cancers based on mutations in its coding sequence10, and ADAM19 in hemopoietic tumors, which has been implicated in the oncogenesis of breast42, prostate43, colorectal44 and ovarian45 cancers. Although the computational prediction of new drivers should be interpreted with caution (Discussion), these genes may be promising targets for future experimental studies to investigate their potential tumorigenic properties.

Non-coding candidate cancer driver mutations in 5′ untranslated regions

Hypothesizing that indels could have large effect size on gene expression by disrupting transcription factor binding motifs, we searched promoters (n = 19,251) for a burden of indels in the PCAWG dataset (Methods). The TP53 promoter was the only element with a genome-wide significant (FDR < 0.1) burden of indels (7 observed versus 0.54 expected; P = 9.4 × 10−7) (Fig. 3a), consistent with a previous analysis that used restricted hypothesis testing to boost statistical power5. The observed mutations—all deletions significantly larger than expected (Fig. 3b) (median length = 17 bp versus 1 bp expected; P = 7.4 × 10−4, one-sided Mann–Whitney U-test)—specifically affected exon 1 of the canonical 5′ untranslated region (UTR), disrupted critical sequence elements (transcription start site, WRAP53 binding sequence46, internal ribosome entry site47,48 and the donor splice region of the multi-exonic 5′ UTR) (Fig. 3a) and exhibited enrichment comparable to cryptic exonic splice SNVs in TP53, which are well-characterized cancer drivers49 (Fig. 3c). More than half of the mutations (four of seven) within the exon 1 splice region did not alter the canonical splice sites, an unexpected pattern compared to other TP53 splice regions (Fig. 3d) (P = 1.8 × 10−3, two-sided Fisher’s exact test). The 5′ UTR mutation carriers had significantly lower expression of TP53 than individuals without TP53 mutations and individuals with predicted functional coding TP53 mutations (1–2 standard deviation decreases in TP53 expression compared to non-carriers, P = 1.2 × 10−4; Methods, Fig. 3e and Supplementary Fig. 11), suggesting that these mutations either directly inhibit TP53 transcription or result in nonsense-mediated decay of the mRNA transcripts. Corroborating these results, seven of 2,399 distinct samples from the Hartwig Medical Foundation50 showed a similar mutational pattern, with three carrying >10-bp deletions and four carrying SNVs in TP53 exon 1 and its donor splice region (Fig. 3a).

Fig. 3: Enrichment of somatic mutations in the 5′ UTRs of TP53 and ELF3.
figure 3

a, Mutations from the PCAWG and Hartwig Medical Foundation cohorts observed within exon 1 of the 5′ UTR of the canonical TP53 transcript. DNA sequence from GRCh37 reference genome (+ strand). Mutation types, relevant sequence and regulatory elements as indicated in the legend. be, Analysis on PCAWG dataset (n = 2,279 samples). b, Distribution of indel sizes observed within 5′ UTRs of genes other than TP53 (n = 3,988 indels) and within the TP53 5′ UTR (n = 7 indels). P value comparing median indel lengths from one-sided Mann–Whitney U-test. c, Estimated mutation enrichment relative to the neutral mutation rate (observed / expected neutral mutations) within TP53 stratified by mutation type and location (number of mutations per category in Supplementary Table 17). Error bars, 95% CI. d, Distribution of mutations observed within donor and acceptor splice regions (defined as the 20 bp 3′ and 5′ of an exon, respectively) of the canonical TP53 transcript. Canonical splice SNVs and indels: mutations altering the two base pairs immediately adjacent to an exon boundary; splice region SNVs and indels: mutations intersecting the splice region but not the canonical splice sites. The donor splice region of exon 1 of the 5′ UTR (shown in a) is bolded. P value of observing the distribution of canonical and splice region mutations in the donor splice region of exon 1 5′ UTR compared to all other TP53 splice regions computed via a two-sided Fisher’s exact test. e, Expression of TP53 on standard deviation scale in carriers of TP53 5′ UTR mutations (n = 6) and non-carriers (n = 1,205), adjusted for tumor type and copy number in the PCAWG dataset (n = 2,279 samples). P value via one-sided Mann–Whitney U-test on adjusted and standardized expression values. Box plot elements are defined in Methods. f, SNVs overlapping ELF3 in the PCAWG and Hartwig Medical Foundation cohorts. Insets: zoom-in of the ELF3 5′ UTR region and estimated mutational enrichments with 95% CIs within this region (number of mutations per category in Supplementary Tables 17 and 18).

These results motivated a targeted search for mutational burden in 5′ UTRs and their splicing regions across 106 TSGs and 95 oncogenes with multi-exonic 5′ UTRs (Methods). One additional element, the 5′ UTR of ELF3, had a significant burden of SNVs (Fig. 3f) in PCAWG samples (6 observed SNVs versus 0.96 expected; P = 2.9 × 10−4); samples from the Hartwig Medical Foundation displayed a similar enrichment (10 observed versus 1.5 expected; P = 3.8 × 10−4; Methods). In both sets of samples, the enrichment was concentrated within the canonical ELF3 5′ UTR; surrounding sequences (upstream promoter and intron 1) were not enriched for mutations (Fig. 3f). The 16 mutations largely altered distinct base pairs within the 5′ UTR—although two positions mutated in PCAWG samples were also mutated in the Hartwig samples—suggesting that this 5′ UTR might be broadly sensitive to perturbation, possibly by prompting changes in promoter methylation that alter ELF3 expression51. An alternative possibility could be an unmodeled local mutational process or technical artifact in this region9; however, a careful analysis did not find evidence for any such features that have explained other non-coding mutational hotspots5 (Supplementary Note 5). The small number of carriers and limited availability of transcriptomic assays (only three carriers from PCAWG had RNA-seq data) prevented investigation into the possible function of these 5′ UTR mutations. Thus, additional follow-up, particularly experimental assays assessing the impact of 5′ UTR mutations52, will be necessary to determine whether the mutational enrichment here represents positive selection or represents a new neutral mutational process.

The shared landscape of common and rare driver genes

Small sample sizes have limited assessment of whether rare coding mutations (which account for most exonic mutations in tumors) act as drivers even in well-characterized driver genes. We increased statistical power in two ways: (1) by analyzing large meta-cohorts of non-synonymous SNVs from 14,018 whole-exome and targeted sequencing samples, representing ten solid tumor types (median samples per cancer, 1,195; range, 515–3,110) (Supplementary Table 19 and Methods); and (2) by considering only activating mutations in oncogenes (obtained from the Cancer Genome Interpreter23) and predicted loss-of-function (pLoF) mutations in all other genes. Such analysis has previously been impeded by the exclusion of synonymous mutations from large, publicly available targeted sequencing datasets53,54,55,56,57 because existing driver gene detection methods are reliant upon synonymous mutations. Dig circumvents this difficulty because model parameters have already been inferred from a separate training cohort.

For each cancer, we first restricted our analysis to ‘long-tail’ genes, which we defined as oncogenes and TSGs not associated with that cancer type in any of three recent, large, pan-cancer surveys of driver genes7,10,11. Dig estimated that 1–5% of samples (depending on the cancer) carried activating SNVs in long-tail oncogenes (Fig. 4a) and 3–6.5% carried pLoF SNVs in long-tail TSGs (Fig. 4b). These rates were significantly higher than expected (P < 3.78 × 10−9 for activating SNVs in all cohorts; P < 3.10 × 10−4 for pLoF SNVs in all cohorts except prostate (P = 0.056 for prostate)) (Supplementary Fig. 12, Supplementary Tables 20 and 21 and Methods). These rates were consistent when we restricted the analysis to only whole-exome sequenced samples, although power to detect positive selection was decreased due to reduced sample size (Supplementary Fig. 13 and Supplementary Tables 22 and 23). Considering individual genes, 92 oncogene–tumor pairs not reported in recent pan-cancer surveys of driver genes had a significant (FDR < 0.1) burden of activating SNVs (Fig. 4c and Supplementary Table 24). Forty-six TSG–tumor pairs not reported in the pan-cancer surveys had a significant burden of pLoF mutations (Fig. 4d and Supplementary Table 25). The newly identified candidate driver genes were rare compared to driver genes in existing databases (0.28% (interquartile range, 0.14–0.53%) versus 1.3% (interquartile range, 0.59–3.0%) for newly implicated and known driver genes, respectively; P = 3.1 × 10−27, two-sided Mann–Whitney U-test). Further supporting these predictions, the distribution of activating mutations in a given driver gene was similar between cancers in which the gene is a known, common driver and cancers in which we newly implicated the gene as a putative rare driver (Extended Data Fig. 4). For example, the G12, G13, Q61 and A146 positions of KRAS accounted for most KRAS SNVs in both common and rare scenarios (lung non-small-cell tumors: 568/586 mutations; prostate tumors: 12/17 mutations; gliomas: 11/15), and the V600E mutation accounted for the plurality of BRAF SNVs in common and rare scenarios despite each gene having dozens of known activating SNVs (52 and 71, respectively). Additionally, carriers of mutations in several predicted rare driver genes exhibited phenotypes consistent with those reported in tumors in which the genes are common drivers (Supplementary Note 6). For example, central nervous system tumors with rare pLoF mutations in the DNA mismatch repair genes MSH2 and MLH1 exhibited significantly increased global mutation rates across 213 targeted sequenced genes (MSH2: mean 30.1 mutations in carriers versus 3.0 in non-carriers; P = 3.8×10−7, one-sided Mann–Whitney U-test; MLH1: mean 35.3 mutations in carriers versus 3.1 in non-carriers; P = 8.8×10−6, one-sided Mann–Whitney U-test).

Fig. 4: Enrichment of protein-altering SNVs in ‘long-tail’ genes reveal a shared landscape of common and rare driver genes.
figure 4

a,b, Estimated mutation rates with 95% CIs of excess oncogenic SNVs in oncogenes (a) and pLoF variants in TSGs (b) that were not previously associated with a given cancer (x axis) in three large driver gene catalogs7,10,11. Stars indicate that the burden of oncogenic (pLoF) SNVs was significant in long-tail oncogenes (TSGs) in the cancer type (P values and number of SNVs per category are in Supplementary Tables 20 and 21). c,d, Oncogene–tumor pairs and TSG–tumor pairs with a significant burden of oncogenic or protein-truncating SNVs. Gene–tumor pairs previously reported by Dietlein et al.10, Bailey et al.11 or Martínez-Jiménez et al.7 are marked in gray. Pairs that are not present in those catalogs are marked in red, with color intensity indicating significance of association. Marker size is proportional to the estimated rate of excess mutations after accounting for cancer-specific neutral mutation rates. CNS, central nervous system; NSC, non-small-cell.

A further 29 gene–tumor pairs had a significant (FDR < 0.1) burden of pLoF mutations in genes not in the cancer driver databases for any cancer (Methods and Supplementary Table 26), of which two were significant at the more stringent Bonferroni (α < 0.05) correction for the total number of genes tested, and six were additionally supported by a nominal (P < 0.05) burden of missense mutations. The top hit is the cell polarity gene PARD3 in gastroesophageal cancer (9 observed pLoF SNVs versus 1.1 expected; P = 1.57 × 10−6), which, despite not appearing in major driver gene databases, is a known fusion partner of the oncogene RET and has been implicated in the tumorigenesis of multiple solid cancers58. The ability to distinguish mutational burdens in genes with a low frequency of mutations, such as PARD3 (nine carriers in 827 samples), highlights the increased statistical power that our approach can achieve by testing specific sets of mutations in large cohorts for evidence of positive selection.

Our results represent progress toward an unbiased, pan-cancer catalog of driver genes and suggest that driver mechanisms are shared across the common and rare driver landscape of solid cancers. However, computational identification of rare driver genes at current sample sizes relies upon small mutation counts, and predictions should be interpreted with care. Experimental characterization of the functions of genes in the relevant cancers is essential to confirming their carcinogenic roles.


Dig is a probabilistic deep learning method that enables rapid tests for evidence of positive selection on genomic elements that can be defined with the precision of individual mutations anywhere in the genome. The strong performance of the method in modeling mutation rates and identifying candidate drivers highlights the power of deep learning to capture complex cellular processes with data derived from high-throughput sequencing40,59,60,61,62,63. Specifically, building upon the observation that epigenetics correlate with somatic mutation rates17, we showed that neural networks applied to a corpus of high-resolution chromatin immunoprecipitation followed by sequencing (ChIP-seq) assays are able to learn nuanced, non-linear associations between local epigenetic structures and patterns of somatic mutations. Moreover, techniques presented here are adaptable to other contexts. For example, quantification of prediction uncertainty by coupling a Gaussian process to the final layer of a neural network may be a practical solution to improve the reliability and interpretability of predictions in other deep learning settings64.

The application of our high-resolution mutation rate maps to quantify mutational burdens genome-wide provides a glimpse into the landscape of rare and non-coding driver mutations that we anticipate will emerge as cancer sequence sample sizes continue to grow. Although the driver candidates we report—in cryptic splice sites, 5′ UTRs and rarely mutated genes—occurred at low frequencies individually, our estimates suggest that they collectively contribute to the disease pathology of up to 10% of tumors (summing across the percent of tumors predicted to carry excess mutations in each of these elements). This estimate may be conservative, as several analyses used datasets of mutations that are unlikely to be comprehensive (for example, catalogs of predicted cryptic splice SNVs and known activating SNVs). The quantification of these rare driver events is important, in part, because it suggests avenues to expand patient treatment options by repurposing therapeutics; a targeted therapy approved for a mutation in one cancer type may prove beneficial to patients with the same mutation in other cancer types. Indeed, cancer-agnostic approaches to patient stratification are currently being deployed at some cancer centers65.

Additionally, current sample sizes are not adequate to uncover infrequent drivers under moderate or weak positive selection. We anticipate that Dig will be particularly useful in uncovering such mutations due to its ability to rapidly evaluate mutations spread over large swaths of the genome. For instance, a preliminary analysis that we performed on enhancer networks identified several genes with a burden of enhancer mutations (Supplementary Table 27 and Supplementary Note 7), including FOXA1, in which promoter mutations are thought to drive breast cancer by increasing gene expression66. A possible approach to increase sample size with existing data is to call somatic mutations in regions flanking coding sequence using off-target reads from large targeted or whole-exome sequenced clinical cohorts.

However, computational prediction alone is not sufficient to establish the causal role of an element or mutation in cancer pathology because an excess of mutations compared to the neutral mutation rate does not definitively prove positive selection. Moreover, recent studies have shown that canonical cancer driver mutations can be present in seemingly healthy tissues67,68,69,70,71, adding an additional layer of complexity to interpreting whether or how a mutation causally contributes to a malignant phenotype. Ultimately, experimental validation is necessary to establish the causal role for a mutation as a driver of cancer. Dig provides a tool for in silico guidance of in vitro and invivo studies because it enables prioritization of precise sets of mutations that may act as drivers in both the coding and non-coding genome. These specific sets of mutations can then be evaluated in experimental systems. For example, the predicted cryptic splice mutations that Dig identified as putative drivers could be evaluated as possible drug targets by CRISPR base editing of cell lines, followed by drug screening assays72. Thus, we anticipate that deep learning generally, and our tool specifically, can improve computational, experimental and clinical utility of the growing body of cancer genome sequencing data.

Article classification: Biological abstract
Share to:
Add:Room A11-329, 1st Floor, No.1, SBI Venture Street, Optics Valley, East Lake
New Technology Development Zone, Wuhan, China.
Certificate NO.:U18Q28010569R0S