Machine learning nominates the inositol pathway and novel genes in Parkinson’s disease
Analysis Methodology
Our objective was to nominate the most probable genes to be involved in PD from each GWAS locus based on the most recent PD GWAS (see Figure 1 for the study protocol). To do so, we first defined all the genes and SNPs that are within these loci (see below) and used a machine learning approach to nominate the top genes in each locus. Based on the previous literature and consensus between authors, we identified seven genes from well-established loci associated with PD that can be considered the likeliest driving genes of their respective loci (GBA1, LRRK2, SNCA, GCH1, MAPT, TMEM175, VPS13C). We then acquired data for multiple features, including different distance measures from top SNPs, different QTLs, expression in relevant tissues and cell types and predictions of variant consequences (78 features out of 284 were used after removal of redundant features, Supplementary Table 1). Using the seven well-established PD genes, which were labeled as positive, and 212 genes in the same loci that received negative labels (i.e. not likely to drive the association with PD, since the PD-driving gene is already well established), we trained a machine learning model. This model enabled us to generate a prediction score for each gene within each locus, assessing their potential involvement in PD. The gene with the highest score in each locus is the nominated gene to be associated with PD. We then performed multiple post hoc analyses to further validate and explore our results: burden tests for rare variants in the top-scoring genes, pathway enrichment and pathway PRS analyses, differential expression analyses and structural analyses for candidate coding variants.
Underlying Analyses
Machine Learning
Gene
Genome-wide association studies (GWAS) have nominated many variants associated with complex traits. In Parkinson’s disease (PD), the most recent GWAS revealed 90 independent risk variants across 78 genomic loci. Although many single-nucleotide polymorphisms (SNPs) are in novel genomic loci, well-established PD genes discovered many years ago, such as LRRK2, PINK1, DJ-1, SNCA, GBA1, PRKN and MAPT still account for the vast majority of research on Parkinson’s disease. Several disadvantages of GWAS limit additional functional analyses. First, above 90% of all GWAS significant SNPs are in noncoding regions. These SNPs are often passenger variants due to complex linkage disequilibrium (LD). Second, the causal gene associated with the causal SNPs remains unclear in most GWAS loci. To overcome these challenges, downstream GWAS analyses were established with the aim of identifying causal genes within GWAS loci. This involves techniques such as fine-mapping and colocalization methods to nominate causal SNPs, as well as transcriptome-wide association studies to nominate gene-trait associations. These models use LD structure, and gene expression panels to discover causal SNPs/genes. While these methods may propose causal variants and genes, additional biological evidence is generally required to pair causal variants with causal genes. Using multi-omic analyses, one can integrate a diverse range of comprehensive cellular and biological datasets such as genomic, transcriptomic and epigenetic datasets and use platforms such as Open Targets Genetics (https://genetics.opentargets.org/) to perform systematic analyses of gene prioritization across all publicly available GWASs. Although powerful, Open Targets Genetics lacks disease-specific tissues relevant to PD such as dopaminergic neurons and microglia. Using a similar approach, we may discover additional pathways and genetic targets involved in PD. In this study, we leveraged PD-relevant transcriptomic, epigenomic and other datasets in our gradient boosting model (Figure 1). We trained this model on well-established PD genes to nominate causal genes from PD GWAS loci.
Brain Tissue, Whole Blood
Genomics, Transcriptomics, Epigenetics
FOUNDIN-PD, McGill, Parkinson's Progression Markers Initiative (PPMI), BioFIND, Parkinson's Disease Biomarkers Program (PDBP), Harvard Biomarkers Study (HBS), NINDS Study of Isradipine as a Disease19 modifying Agent in Subjects With Early Parkinson Disease, Phase 3 (STEADY-PD3), Vance (dbGap phs000394), International Parkinson's Disease Genomics Consortium (IPDGC) NeuroX dataset (dbGap phs000918.v1.p1), National Institute of Neurological Disorders and Stroke (NINDS) Genome-Wide genotyping in Parkinson's Disease (dbGap phs000089.v4.p2), NeuroGenetics Research Consortium (NGRC) (dbGap phs000196.v3.p1), UK Biobank, GTEx, SMR, Cuomo et al. 2020, Bryois et al. 2021, Kamath et al. 2021
Data Dictionary
Field Name | Field Name Expanded | Short Description (optional) |
---|---|---|
Gene | Gene Name | |
Locus | Gene locus | |
ensembl_id | Ensembl ID | |
gene_type | HGNC Gene Locus Type | |
Prediction Rank | ||
Prediction Probability | ||
Nearest gene GWAS | ||
Nearest gene based on distance | ||
Nearest gene based on TSS | ||