Blog

The explosion of biobank data offers unprecedented opportunities for gene-environment interaction (GxE) studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for set-based G×E variance component (VC) tests, which are a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based G×E tests, to permit G×E VC tests for biobank-scale data. SEAGLE employs modern matrix computations to calculate the test statistic and p-value of the GxE VC test in a computationally efficient fashion, without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of 105, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based G×E analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.

Alzheimer’s Disease (AD) is a progressive neurologic disease and the most common form of dementia. While the causes of AD are not completely understood, genetics plays a key role in the etiology of AD, and thus finding genetic factors holds the potential to uncover novel AD mechanisms. For this study, we focus on copy number variation (CNV) detection and burden analysis. Leveraging whole-genome sequence (WGS) data released by Alzheimer’s Disease Sequencing Project (ADSP), we developed a scalable bioinformatics pipeline to identify CNVs. This pipeline was applied to 1,737 AD cases and 2,063 cognitively normal controls. As a result, we observed 237,306 and 42,767 deletions and duplications, respectively, with an average of 2,255 deletions and 1,820 duplications per subject. The burden tests show that Non-Hispanic-White cases on average have 16 more duplications than controls do (p-value 2e-6), and Hispanic cases have larger deletions than controls do (p-value 6.8e-5).

The INFERNO method provides an integrative computational framework for characterizing the causal variants, tissue contexts, affected regulatory mechanisms, and target genes underlying noncoding genetic variants associated with any phenotype or disease of interest. Here we describe the computational steps required to run the full INFERNO pipeline on any dataset of interest.

Protein aggregation is the hallmark of neurodegeneration, but the molecular mechanisms underlying late-onset Alzheimer’s disease (AD) are unclear. Here we integrated transcriptomic, proteomic and epigenomic analyses of postmortem human brains to identify molecular pathways involved in AD. RNA sequencing analysis revealed upregulation of transcription- and chromatin-related genes, including the histone acetyltransferases for H3K27ac and H3K9ac. An unbiased proteomic screening singled out H3K27ac and H3K9ac as the main enrichments specific to AD. In turn, epigenomic profiling revealed gains in the histone H3 modifications H3K27ac and H3K9ac linked to transcription, chromatin and disease pathways in AD. Increasing genome-wide H3K27ac and H3K9ac in a fly model of AD exacerbated amyloid-β42-driven neurodegeneration. Together, these findings suggest that AD involves a reconfiguration of the epigenome, wherein H3K27ac and H3K9ac affect disease pathways by dysregulating transcription- and chromatin-gene feedback loops. The identification of this process highlights potential epigenetic strategies for early-stage disease treatment.

The aim of this study was to explore whether variants in LRP10, recently associated with Parkinson’s disease and dementia with Lewy bodies, are observed in 2 large cohorts (discovery and validation cohort) of patients with progressive supranuclear palsy (PSP). A total of 950 patients with PSP were enrolled: 246 patients with PSP (n = 85 possible (35%), n = 128 probable (52%), n = 33 definite (13%)) in the discovery cohort and 704 patients with definite PSP in the validation cohort. Sanger sequencing of all LRP10 exons and exon-intron boundaries was performed in the discovery cohort, and whole-exome sequencing was performed in the validation cohort. Two patients from the discovery cohort and 8 patients from the validation cohort carried a rare, heterozygous, and possibly pathogenic LRP10 variant (p.Gly326Asp, p.Asp389Asn, and p.Arg158His, p.Cys220Tyr, p.Thr278Ala, p.Gly306Asp, p.Glu486Asp, p.Arg554∗, p.Arg661Cys). In conclusion, possibly pathogenic LRP10 variants occur in a small fraction of patients with PSP and may be overrepresented in these patients compared with controls. This suggests that possibly pathogenic LRP10 variants may play a role in the development of PSP.

INTRODUCTION: Altered lipid metabolism is implicated in Alzheimer’s disease (AD), but the mechanisms remain obscure. Aging-related declines in circulating plasmalogens containing omega-3 fatty acids may increase AD risk by reducing plasmalogen availability.
METHODS: We measured four ethanolamine plasmalogens (PlsEtns) and four closely related phosphatidylethanolamines (PtdEtns) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI; n = 1547 serum) and University of Pennsylvania (UPenn; n = 112 plasma) cohorts, and derived indices reflecting PlsEtn and PtdEtn metabolism: PL-PX (PlsEtns), PL/PE (PlsEtn/PtdEtn ratios), and PBV (plasmalogen biosynthesis value; a composite index). We tested associations with baseline diagnosis, cognition, and cerebrospinal fluid (CSF) AD biomarkers.
RESULTS: Results revealed statistically significant negative relationships in ADNI between AD versus CN with PL-PX (P = 0.007) and PBV (P = 0.005), late mild cognitive impairment (LMCI) versus cognitively normal (CN) with PL-PX (P = 2.89 × 10-5 ) and PBV (P = 1.99 × 10-4 ), and AD versus LMCI with PL/PE (P = 1.85 × 10-4 ). In the UPenn cohort, AD versus CN diagnosis associated negatively with PL/PE (P = 0.0191) and PBV (P = 0.0296). In ADNI, cognition was negatively associated with plasmalogen indices, including Alzheimer’s Disease Assessment Scale 13-item cognitive subscale (ADAS-Cog13; PL-PX: P = 3.24 × 10-6 ; PBV: P = 6.92 × 10-5 ) and Mini-Mental State Examination (MMSE; PL-PX: P = 1.28 × 10-9 ; PBV: P = 6.50 × 10-9 ). In the UPenn cohort, there was a trend toward a similar relationship of MMSE with PL/PE (P = 0.0949). In ADNI, CSF total-tau was negatively associated with PL-PX (P = 5.55 × 10-6 ) and PBV (P = 7.77 × 10-6 ). Additionally, CSF t-tau/Aβ1-42 ratio was negatively associated with these same indices (PL-PX, P = 2.73 × 10-6 ; PBV, P = 4.39 × 10-6 ). In the UPenn cohort, PL/PE was negatively associated with CSF total-tau (P = 0.031) and t-tau/Aβ1-42 (P = 0.021). CSF Aβ1-42 was not significantly associated with any of these indices in either cohort.
DISCUSSION: These data extend previous studies by showing an association of decreased plasmalogen indices with AD, mild cognitive impairment (MCI), cognition, and CSF tau. Future studies are needed to better define mechanistic relationships, and to test the effects of interventions designed to replete serum plasmalogens.

Approximately 30% of older adults exhibit the neuropathological features of Alzheimer’s disease without signs of cognitive impairment. Yet, little is known about the genetic factors that allow these potentially resilient individuals to remain cognitively unimpaired in the face of substantial neuropathology. We performed a large, genome-wide association study (GWAS) of two previously validated metrics of cognitive resilience quantified using a latent variable modelling approach and representing better-than-predicted cognitive performance for a given level of neuropathology. Data were harmonized across 5108 participants from a clinical trial of Alzheimer’s disease and three longitudinal cohort studies of cognitive ageing. All analyses were run across all participants and repeated restricting the sample to individuals with unimpaired cognition to identify variants at the earliest stages of disease. As expected, all resilience metrics were genetically correlated with cognitive performance and education attainment traits (P-values < 2.5 × 10-20), and we observed novel correlations with neuropsychiatric conditions (P-values 0.42) nor associated with APOE (P-values > 0.13). In single variant analyses, we observed a genome-wide significant locus among participants with unimpaired cognition on chromosome 18 upstream of ATP8B1 (index single nucleotide polymorphism rs2571244, minor allele frequency = 0.08, P = 2.3 × 10-8). The top variant at this locus (rs2571244) was significantly associated with methylation in prefrontal cortex tissue at multiple CpG sites, including one just upstream of ATPB81 (cg19596477; P = 2 × 10-13). Overall, this comprehensive genetic analysis of resilience implicates a putative role of vascular risk, metabolism, and mental health in protection from the cognitive consequences of neuropathology, while also providing evidence for a novel resilience gene along the bile acid metabolism pathway. Furthermore, the genetic architecture of resilience appears to be distinct from that of clinical Alzheimer’s disease, suggesting that a shift in focus to molecular contributors to resilience may identify novel pathways for therapeutic targets.

The Alzheimer’s Disease Sequencing Project (ADSP) undertook whole exome sequencing in 5,740 late-onset Alzheimer disease (AD) cases and 5,096 cognitively normal controls primarily of European ancestry (EA), among whom 218 cases and 177 controls were Caribbean Hispanic (CH). An age-, sex- and APOE based risk score and family history were used to select cases most likely to harbor novel AD risk variants and controls least likely to develop AD by age 85 years. We tested ~1.5 million single nucleotide variants (SNVs) and 50,000 insertion-deletion polymorphisms (indels) for association to AD, using multiple models considering individual variants as well as gene-based tests aggregating rare, predicted functional, and loss of function variants. Sixteen single variants and 19 genes that met criteria for significant or suggestive associations after multiple-testing correction were evaluated for replication in four independent samples; three with whole exome sequencing (2,778 cases, 7,262 controls) and one with genome-wide genotyping imputed to the Haplotype Reference Consortium panel (9,343 cases, 11,527 controls). The top findings in the discovery sample were also followed-up in the ADSP whole-genome sequenced family-based dataset (197 members of 42 EA families and 501 members of 157 CH families). We identified novel and predicted functional genetic variants in genes previously associated with AD. We also detected associations in three novel genes: IGHG3 (p = 9.8 × 10-7), an immunoglobulin gene whose antibodies interact with β-amyloid, a long non-coding RNA AC099552.4 (p = 1.2 × 10-7), and a zinc-finger protein ZNF655 (gene-based p = 5.0 × 10-6). The latter two suggest an important role for transcriptional regulation in AD pathogenesis.

SUMMARY: We report Spark-based INFERence of the molecular mechanisms of NOn-coding genetic variants (SparkINFERNO), a scalable bioinformatics pipeline characterizing non-coding genome-wide association study (GWAS) association findings. SparkINFERNO prioritizes causal variants underlying GWAS association signals and reports relevant regulatory elements, tissue contexts and plausible target genes they affect. To achieve this, the SparkINFERNO algorithm integrates GWAS summary statistics with large-scale collection of functional genomics datasets spanning enhancer activity, transcription factor binding, expression quantitative trait loci and other functional datasets across more than 400 tissues and cell types. Scalability is achieved by an underlying API implemented using Apache Spark and Giggle-based genomic indexing. We evaluated SparkINFERNO on large GWASs and show that SparkINFERNO is more than 60 times efficient and scales with data size and amount of computational resources.
AVAILABILITY AND IMPLEMENTATION: SparkINFERNO runs on clusters or a single server with Apache Spark environment, and is available at https://bitbucket.org/wanglab-upenn/SparkINFERNO or https://hub.docker.com/r/wanglab/spark-inferno.
CONTACT: lswang@pennmedicine.upenn.edu.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Most regulatory chromatin interactions are mediated by various transcription factors (TFs) and involve physically interacting elements such as enhancers, insulators or promoters. To map these elements and interactions at a fine scale, we developed HIPPIE2 that analyzes raw reads from high-throughput chromosome conformation (Hi-C) experiments to identify precise loci of DNA physically interacting regions (PIRs). Unlike standard genome binning approaches (e.g. 10-kb to 1-Mb bins), HIPPIE2 dynamically infers the physical locations of PIRs using the distribution of restriction sites to increase analysis precision and resolution. We applied HIPPIE2 to in situ Hi-C datasets across six human cell lines (GM12878, IMR90, K562, HMEC, HUVEC, NHEK) with matched ENCODE/Roadmap functional genomic data. HIPPIE2 detected 1042 738 distinct PIRs, with high resolution (average PIR length of 1006 bp) and high reproducibility (92.3% in GM12878). PIRs are enriched for epigenetic marks (H3K27ac, H3K4me1) and open chromatin, suggesting active regulatory roles. HIPPIE2 identified 2.8 million significant PIR-PIR interactions, 27.2% of which were enriched for TF binding sites. 50 608 interactions were enhancer-promoter interactions and were enriched for 33 TFs, including known DNA looping/long-range mediators. These findings demonstrate that the novel dynamic approach of HIPPIE2 (https://bitbucket.com/wanglab-upenn/HIPPIE2) enables the characterization of chromatin and regulatory interactions with high resolution and reproducibility.