Large‑scale pathway specific polygenic risk and transcriptomic community network analysis identifies novel functional pathways in Parkinson disease - Table 7
Analysis Methodology
To assess PD risk, summary statistics from Chang et al. PD GWAS meta-analysis involving 26,035 PD cases and 403,190 controls of European ancestry were used as the reference dataset for the primary analysis to define risk allele weights. Individual-level genotyping data not included in Chang et al. [7] and from the last GWAS meta-analysis [17] was then randomly divided as the training and testing datasets. The training dataset used to construct the PRS consisted of 7218 PD cases and 9424 controls, while the testing dataset to validate the results consisted of 5429 PD cases and 5814 controls, all of European ancestry. A polygenic effect score (PES) was generated to estimate polygenic risk for each of the 2199 gene sets representative of biological pathways and then tested for association with PD. PES was calculated based on the weighted allele dose as implemented in PRSice2 (v2.1.1). The sequence kernel association test-optimal (SKAT-O) was implemented using default parameters in RVTESTS [35] to determine the difference in the aggregate burden of rare coding genetic variants (minor allele count≥ 3) between PD cases and controls for the nominated gene-sets by PRS. ANNOVAR was used for variant annotation. Baseline peri-diagnostic RNA sequencing data derived from the blood for 1612 PD patients and 1042 healthy subjects available from the Parkinson Progression Marker Initiative (PPMI) was used to construct a network of expression communities based on a graph model with Louvain clusters. This cleaned and normalized data was downloaded from the Accelerating Medicines Partnership for Parkinson’s disease (AMP-PD) on March 1st, 2020. Scikit-learn’s extraTreeClassifer option was used to extract coding gene features for inclusion in the network builds that are likely to contribute to classifying cases versus controls under default settings in the feature selection phase, leaving 8.3 k protein-coding genes for candidate networks. Following this feature extraction phase, controls were excluded, and case-only correlations were calculated for all remaining gene features. Next, this correlation structure was converted to a graph object using NetworkX. Subsequently, the Louvain algorithm was employed to build network communities within this graph object derived from the selected feature set. Finally, pathway enrichment analysis within expression communities was performed to further dissect its biological function using the function g:GOSt from g:ProfleR. The significance of each pathway was tested by hypergeometric tests with Bonferroni correction to calculate the error rate of each network. Single-cell RNA sequencing data [25] based on a total of 9970 cells obtained from several mouse brain regions (neocortex, hippocampus, hypothalamus, striatum, and midbrain) was used to explore cell types associated with PD risk. Linear regression adjusted by the number of SNPs included in the PRS was performed to assess the trend of increased PRS R2 per decile of cell-type expression specificity. Two-sample SMR was applied to explore the enrichment of cis eQTLs within the 46 gene-sets nominated by our large scale PRS analysis. The methodology can be interpreted as an analysis to test if the effect size of genetic variants influencing PD risk is mediated by gene expression or methylation to prioritize genes underlying these gene-sets for follow-up functional studies. Additionally, we studied expression patterns in blood from the largest eQTL meta-analysis so far. The number of genes tested per gene-set were Bonferroni corrected, and a Chisq test was applied to assess whether the proportion of QTLs per geneset was significantly higher than expected by chance.
In an effort to prioritize the top genes within significant gene-sets showing the highest cumulative effect on PD risk, individual gene-based SKAT-O analyses were performed considering a MAF threshold ≤3% and three functional categories (missense, loss of function and CADD score >12). Using this approach, gene-level prioritization is highlighted in Supplementary Table 7.
Underlying Analyses
GWAS
Gene
Here, we present a novel high-throughput and hypothesis-free approach to detect the existence of PD genetic risk linked to any particular biological pathway. We apply polygenic risk score (PRS) to a total of 2199 curated and well-defined gene sets representative of canonical pathways publicly available in the Molecular Signature Database v7.2 (MSigDB) to define the cumulative effect of pathway specific genetic variation on PD risk. To assess the impact of rare variation on PD risk explained by significant pathways, we perform gene-set burden analyses in an independent cohort of whole-genome sequencing (WGS) data, including 2101 cases and 2230 controls. Additionally, we explore cell-type expression specificity enrichment linked to PD etiology by using single-cell RNA sequencing data from brain cells. Furthermore, we use graph-based analyses to generate de novo pathways that could be involved in disease etiology by constructing a transcriptome map of network communities based on RNA sequencing data derived from the blood of 1612 PD patients and 1042 healthy subjects. Subsequently, we perform summary-data-based Mendelian randomization (SMR) analyses to prioritize genes from significant gene-sets by exploring possible genomic associations with expression quantitative trait loci (eQTL) in public databases and nominate overlapping genes within our transcriptome communities for follow-up functional studies. Finally, we present a user-friendly platform for the PD research community that enables easy and interactive access to these results: Pathways Browser.
Whole Blood
Genomics, Transcriptomics
BioFIND, NABEC, LNG Path confirmed, Parkinson's Disease Biomarkers Program (PDBP), Parkinson's Progression Markers Initiative (PPMI), NIH PD CLINIC, WELLDERLY, UKBEC
Data Dictionary
Field Name | Field Name Expanded | Short Description (optional) |
---|---|---|
gene_name | Gene Name | |
hgnc_symbol | HGNC Gene Symbol | |
ensembl_id | Ensembl ID | |
gene_type | HGNC Gene Locus Type | |
NumVar | Num variants | |
Q | Variance-component score statistic | |
p_val | p-value | |
Gene_set | Gene set | Annotated gene set from curated pathways, eg. KEGG |