Blog - Wang Lab

Exploring Random Forest in Genetic Risk Score Construction

Genetic risk scores (GRS) are crucial tools for estimating an individual’s genetic liability to various traits and diseases, computed as a weighted sum of trait-associated allele counts. Traditionally, GRS models assume additive, linear effects of risk variants. However, complex traits often involve nonadditive interactions, such as epistasis, which are not captured by these conventional methods. In this study, we investigate the use of random forest (RF) models as a model-free approach for constructing GRS, leveraging RF’s capacity to capture complex, nonlinear interactions among genetic variants. Specifically, we introduce two new RF-based GRS strategies to boost RF performance and to incorporate base data information if available, including (1) ctRF, which optimizes linkage disequilibrium (LD) clumping and p-value thresholds within RF; and (2) wRF, which adjusts the chance of SNP inclusion in tree nodes based on their association strength. Through simulation studies and real data applications of Alzheimer’s disease, body mass index, and atopy, we find that ctRF consistently outperforms other RF-based methods and classical additive models when traits exhibit complex genetic architectures. Additionally, incorporating informative base data into RF-GRS construction can enhance predictive accuracy. Our findings suggest that RF-based GRS can effectively capture intricate genetic interactions, and offer a robust alternative to traditional GRS methods, especially for complex traits with nonlinear genetic effects.

Alzheimer’s disease multi-ancestry genome-wide interaction and stratified study with smoking

INTRODUCTION: Alzheimer’s disease (AD) has genetic and environmental risk factors, including cigarette smoking. Gene-environment interactions may explain AD missing heritability.
METHODS: Lifetime smoking data from 22,032 European ancestry and 3126 African ancestry participants from the Alzheimer’s Disease Genetic Consortium and the Framingham Heart Study were used to conduct genome-wide single nucleotide polymorphism (SNP)-by-smoking interaction and smoking-stratified association studies. For top-ranked loci, brain-derived bulk and single nuclei RNA-sequencing were used for differential expression and colocalization analyses.
RESULTS: Among smokers only, there was a genome-wide significant association in the APAF1/ANKS1B region (rs12368451; odds ratio = 1.19, 95% confidence interval: [1.12, 1.27], p = 3.0 × 10-8). Rs12368451 had expression quantitative trait locus (eQTL) activity that differed by smoking status and brain cell types but showed the most significant posterior probability (PP = 0.15) for being causal via ANKS1B expression in oligodendrocytes among smokers.
DISCUSSION: Potentially causal in smokers via eQTL activity, the top SNP may alter expression of ANKS1B, which encodes amyloid beta precursor protein intracellular domain associated-1, known to regulate amyloid beta plaques.
HIGHLIGHTS: Among smokers only, a novel chromosome 12 single nucleotide polymorphism (SNP) near ANKS1B was associated with Alzheimer’s disease. Evidence came from European and African ancestry cohorts. RNA-sequencing analyses implicated the top SNP as causal via ANKS1B expression in oligodendrocytes. A genome-wide African ancestry-specific significant SNP-smoking interaction was observed on chromosome 6 in SLC22A23.

GrafAnc: Reliable and reproducible inference of continental and regional population structure

Accurate inference of genetic ancestry is a fundamental step in population genetics, disease association studies, and understanding human history. However, most existing tools, whether model-based or model-free, are limited by dataset-specific characteristics, which restrict reproducibility and hinder cross-study comparisons. Additionally, these tools often struggle to resolve fine-scale population structure, requiring multiple processing steps, such as sample subsetting and repeated program execution. These practices introduce bias and reduce replicability, particularly in evolutionary and migration studies. We present GrafAnc, a robust tool for inferring ancestry at both continental and subcontinental levels without requiring dataset partitioning, iterative processing, or manual sample curation. Building upon and extending GRAF-pop, GrafAnc infers an individual’s ancestry background by comparing genotypes with allele frequencies from 26 reference populations compiled from publicly available databases. The current version of GrafAnc generates 18 ancestry scores per individual and classifies individuals into 8 continental and 38 subcontinental ancestry groups, including Middle East and North Africa. These scores are invariant to the specific composition of the study dataset and can be used directly as continuous covariates or for ancestry group assignments. GrafAnc enables seamless integration of population structure across studies and datasets, facilitating consistent interpretation in large-scale genomics. We benchmark GrafAnc using the 1000 Genomes Project, UK Biobank, and Human Genome Diversity Project datasets, demonstrating its accuracy and robustness across diverse ancestries and genotyping platforms. GrafAnc is implemented in C++ with multithreading support and is freely available.

BTS: a scalable Bayesian Tissue Score for prioritizing GWAS variants and their functional contexts across >1000s of omics datasets

MOTIVATION: statistics from genome-wide association studies (GWAS) are widely used in fine-mapping and colocalization analyses to identify causal variants and their enrichment in functional contexts, such as affected cell types and genomic features. With the expansion of functional genomic (FG) datasets, which now include hundreds of thousands of tracks across various cell and tissue types, it is critical to establish scalable algorithms integrating thousands of diverse FG annotations with GWAS results.
RESULTS: We propose BTS (Bayesian Tissue Score), a novel, highly efficient algorithm uniquely designed for (i) identifying affected cell types and functional elements (context-mapping) and (ii) fine-mapping potentially causal variants in a context-specific manner using large collections of cell type-specific FG annotation tracks. BTS leverages GWAS summary statistics and annotation-specific Bayesian models to analyze genome-wide annotation tracks, including enhancers, open chromatin, and histone marks. We evaluated BTS on GWAS summary statistics for immune and cardiovascular traits, such as Inflammatory Bowel Disease (IBD), Rheumatoid Arthritis (RA), Systemic Lupus Erythematosus (SLE), and Coronary Artery Disease (CAD). Our results demonstrate that BTS is over 100× more efficient in estimating functional annotation effects and context-specific variant fine-mapping compared to existing methods. Importantly, this large-scale Bayesian approach prioritizes both known and novel annotations, cell types, genomic regions, and variants and provides valuable biological insights into the functional contexts of these diseases.
AVAILABILITY AND IMPLEMENTATION: Docker image is available at https://hub.docker.com/r/wanglab/bts with preinstalled BTS R package (https://bitbucket.org/wanglab-upenn/BTS-R) and BTS GWAS summary statistics analysis pipeline (https://bitbucket.org/wanglab-upenn/bts-pipeline).

Integrated genomic analysis and CRISPRi implicates EGFR in Alzheimer’s disease risk

Genome-wide association studies (GWAS) have identified numerous loci linked to late-onset Alzheimer’s disease (LOAD), but the pan-brain regional effects of these loci remain largely uncharacterized. To address this, we systematically analyzed all LOAD-associated regions reported by Bellenguez et al. using the FILER functional genomics catalog across 174 datasets, including enhancers, transcription factors, and quantitative trait loci. We identified 42 candidate causal variant-effector gene pairs and assessed their impact using enhancer-promoter interaction data, variant annotations, and brain cell-type-specific gene expression. Notably, the LOAD risk allele of rs74504435 at the SEC61G locus was computationally predicted to increase EGFR expression in LOAD related cell types: microglia, astrocytes, and neurons. Functional validation using promoter-focused Capture C, ATAC-seq, and CRISPR interference in the HMC3 human microglia cell line confirmed this regulatory relationship. Our findings reveal a microglial enhancer regulating EGFR in LOAD, suggesting EGFR inhibitors as a potential therapeutic avenue for the disease.

Mosaic chromosomal alterations in blood are associated with an increased risk of Alzheimer’s disease

Mosaic chromosomal alterations (mCAs) in blood, a form of clonal hematopoiesis, have been linked to various diseases, but their role in Alzheimer’s disease (AD) remains unclear. We analyzed blood whole-genome sequencing (WGS) data from 24,049 individuals in the Alzheimer’s Disease Sequencing Project and found that autosomal mCAs were significantly associated with increased AD risk (odds ratio = 1.27; P = 1.3 × 10 -5 ). This association varied by ancestry, mCA subtype, APOE ε4 allele status, and chromosomal location. Using matched blood WGS and brain single-nucleus RNA-seq data, we identified microglia-annotated cells in the brain carrying the same mCAs found in blood. These findings suggest that blood mCAs may contribute to AD pathogenesis, potentially through infiltration into the brain and influencing local immune response.

Sex-Specific Genetic Drivers of Memory, Executive Functioning, and Language Performance in Older Adults

We previously identified sex-specific genetic loci associated with memory performance, a strong Alzheimer’s disease (AD) endophenotype. Here, we expand on this work by conducting sex-specific, cross-ancestral, genome-wide meta-analyses of three cognitive domains (memory, executive functioning, and language) in 33,918 older adults (57% female; 41% cognitively impaired; mean age=73 years) from 10 aging and AD cohorts. All three domains were comparably heritable across sexes. Genome-wide meta-analyses identified three novel loci: a female-specific language decline-associated locus, VRK2 (rs13387871), which is a published candidate for neuropsychiatric traits involving language ability; a male-specific memory decline-associated locus among cognitive impaired, DCHS2 (rs12501200), which is a published candidate gene for AD age-at-onset; and a sex-interaction with baseline executive functioning, AGA (rs1380012), among cognitive impaired. We additionally provide evidence for shared genetic architecture between lifetime estrogen exposure and AD-related cognitive decline. Overall, we identified sex-specific variants, genes, and pathways relating to three cognitive domains among older adults.

X-chromosome-wide association study for Alzheimer’s disease

Due to methodological reasons, the X-chromosome has not been featured in the major genome-wide association studies on Alzheimer’s Disease (AD). To address this and better characterize the genetic landscape of AD, we performed an in-depth X-Chromosome-Wide Association Study (XWAS) in 115,841 AD cases or AD proxy cases, including 52,214 clinically-diagnosed AD cases, and 613,671 controls. We considered three approaches to account for the different X-chromosome inactivation (XCI) states in females, i.e. random XCI, skewed XCI, and escape XCI. We did not detect any genome-wide significant signals (P ≤ 5 × 10-8) but identified seven X-chromosome-wide significant loci (P ≤ 1.6 × 10-6). The index variants were common for the Xp22.32, FRMPD4, DMD and Xq25 loci, and rare for the WNK3, PJA1, and DACH2 loci. Overall, this well-powered XWAS found no genetic risk factors for AD on the non-pseudoautosomal region of the X-chromosome, but it identified suggestive signals warranting further investigations.

Contextualizing molecular and structural aging across human organs

Organ-specific aging clocks have shown promise as predictors of disease risk and aging trajectories; however, the underlying biological mechanisms they reflect remain largely unexplored. Here, we use large-scale proteomic and imaging data to investigate the relationships among organ-specific and modality-specific aging clocks and to uncover the biological processes they represent. By estimating paired protein-based and imaging-based aging clocks across 8 major organs, we demonstrate that these omics and structural profiles exhibit distinct phenotypic and genetic signatures, each potentially quantifying different stages and playing complementary roles within a unified biological aging process. Furthermore, context-specific aging clocks from multiple organs often converge and jointly capture established biological and disease pathways. For example, 65.7% of the KEGG Alzheimer’s disease pathway is enriched by at least one of 11 protein- and imaging-based aging clocks, with each clock representing different components of the pathway. These results underscore the importance of a pan-organ multi-modal perspective for quantifying the mechanisms underlying age-related diseases. Additionally, we identify modality-specific links between aging clocks and complex diseases and lifestyle factors. In summary, we uncover intricate relationships among molecular and structural aging clocks across human organs, providing novel insights into their context-specific roles in capturing consequences of aging biology and their implications for disease risk.

Copy Number Variation and Haplotype Analysis of 17q21.31 Reveals Increased Risk Associated with Progressive Supranuclear Palsy and Gene Expression Changes in Neuronal Cells

BACKGROUND: The 17q21.31 region with various structural forms characterized by the H1/H2 haplotypes and three large copy number variations (CNVs) represents the strongest risk locus in progressive supranuclear palsy (PSP).
OBJECTIVE: To investigate the association between CNVs and structural forms on 17q.21.31 with the risk of PSP.
METHODS: Utilizing whole genome sequencing data from 1684 PSP cases and 2392 controls, the three large CNVs (α, β, and γ) and structural forms within 17q21.31 were identified and analyzed for their association with PSP.
RESULTS: We found that the copy number of γ was associated with increased PSP risk (odds ratio [OR] = 1.10, P = 0.0018). From H1β1γ1 (OR = 1.21) and H1β2γ1 (OR = 1.24) to H1β1γ4 (OR = 1.57), structural forms of H1 with additional copies of γ displayed a higher risk for PSP. The frequency of the risk sub-haplotype H1c rises from 1% in individuals with two γ copies to 88% in those with eight copies. Additionally, γ duplication up-regulates expression of ARL17B, LRRC37A/LRRC37A2, and NSFP1, while down-regulating KANSL1. Single-nucleus RNA-seq of the dorsolateral prefrontal cortex analysis reveals γ duplication primarily up-regulates LRRC37A/LRRC37A2 in neuronal cells.
CONCLUSIONS: The copy number of γ is associated with the risk of PSP after adjusting for H1/H2, indicating that the complex structure at 17q21.31 is an important consideration when evaluating the genetic risk of PSP. © 2025 The Author(s). Movement Disorders published by Wiley Periodicals LLC on behalf of International Parkinson and Movement Disorder Society.