IMPORTANCE: The chromosome 17q21.31 region, containing a 900 Kb inversion that defines H1 and H2 haplotypes, represents the strongest genetic risk locus in progressive supranuclear palsy (PSP). In addition to H1 and H2, various structural forms of 17q21.31, characterized by the copy number of α, β, and γ duplications, have been identified. However, the specific effect of each structural form on the risk of PSP has never been evaluated in a large cohort study.
OBJECTIVE: To assess the association of different structural forms of 17q.21.31, defined by the copy numbers of α, β, and γ duplications, with the risk of PSP and MAPT sub-haplotypes.
DESIGN SETTING AND PARTICIPANTS: Utilizing whole genome sequencing data of 1,684 (1,386 autopsy confirmed) individuals with PSP and 2,392 control subjects, a case-control study was conducted to investigate the association of copy numbers of α, β, and γ duplications and structural forms of 17q21.31 with the risk of PSP. All study subjects were selected from the Alzheimer’s Disease Sequencing Project (ADSP) Umbrella NG00067.v7. Data were analyzed between March 2022 and November 2023.
MAIN OUTCOMES AND MEASURES: The main outcomes were the risk (odds ratios [ORs]) for PSP with 95% CIs. Risks for PSP were evaluated by logistic regression models.
RESULTS: The copy numbers of α and β were associated with the risk of PSP only due to their correlation with H1 and H2, while the copy number of γ was independently associated with the increased risk of PSP. Each additional duplication of γ was associated with 1.10 (95% CI, 1.04-1.17; P = 0.0018) fold of increased risk of PSP when conditioning H1 and H2. For the H1 haplotype, addition γ duplications displayed a higher odds ratio for PSP: the odds ratio increases from 1.21 (95%CI 1.10-1.33, P = 5.47 × 10-5) for H1β1γ1 to 1.29 (95%CI 1.16-1.43, P = 1.35 × 10-6) for H1β1γ2, 1.45 (95%CI 1.27-1.65, P = 3.94 × 10-8) for H1β1γ3, and 1.57 (95%CI 1.10-2.26, P = 1.35 × 10-2) for H1β1γ4. Moreover, H1β1γ3 is in linkage disequilibrium with H1c (R2 = 0.31), a widely recognized MAPT sub-haplotype associated with increased risk of PSP. The proportion of MAPT sub-haplotypes associated with increased risk of PSP (i.e., H1c, H1d, H1g, H1o, and H1h) increased from 34% in H1β1γ1 to 77% in H1β1γ4.
CONCLUSIONS AND RELEVANCE: This study revealed that the copy number of γ was associated with the risk of PSP independently from H1 and H2. The H1 haplotype with more γ duplications showed a higher odds ratio for PSP and were associated with MAPT sub-haplotypes with increased risk of PSP. These findings expand our understanding of how the complex structure at 17q21.31 affect the risk of PSP.
Blog
The heterogeneity of the whole-exome sequencing (WES) data generation methods present a challenge to a joint analysis. Here we present a bioinformatics strategy for joint-calling 20,504 WES samples collected across nine studies and sequenced using ten capture kits in fourteen sequencing centers in the Alzheimer’s Disease Sequencing Project. The joint-genotype called variant-called format (VCF) file contains only positions within the union of capture kits. The VCF was then processed specifically to account for the batch effects arising from the use of different capture kits from different studies. We identified 8.2 million autosomal variants. 96.82% of the variants are high-quality, and are located in 28,579 Ensembl transcripts. 41% of the variants are intronic and 1.8% of the variants are with CADD > 30, indicating they are of high predicted pathogenicity. Here we show our new strategy can generate high-quality data from processing these diversely generated WES samples. The improved ability to combine data sequenced in different batches benefits the whole genomics research community.
The prevalence of dementia among South Asians across India is approximately 7.4% in those 60 years and older, yet little is known about genetic risk factors for dementia in this population. Most known risk loci for Alzheimer’s disease (AD) have been identified from studies conducted in European Ancestry (EA) but are unknown in South Asians. Using whole-genome sequence data from 2680 participants from the Diagnostic Assessment of Dementia for the Longitudinal Aging Study of India (LASI-DAD), we performed a gene-based analysis of 84 genes previously associated with AD in EA. We investigated associations with the Hindi Mental State Examination (HMSE) score and factor scores for general cognitive function and five cognitive domains. For each gene, we examined missense/loss-of-function (LoF) variants and brain-specific promoter/enhancer variants, separately, both with and without incorporating additional annotation weights (e.g., deleteriousness, conservation scores) using the variant-Set Test for Association using Annotation infoRmation (STAAR). In the missense/LoF analysis without annotation weights and controlling for age, sex, state/territory, and genetic ancestry, three genes had an association with at least one measure of cognitive function (FDR q<0.1). APOE was associated with four measures of cognitive function, PICALM was associated with HMSE score, and TSPOAP1 was associated with executive function. The most strongly associated variants in each gene were rs429358 (APOE ε4), rs779406084 (PICALM), and rs9913145 (TSPOAP1). rs779406084 is a rare missense mutation that is more prevalent in LASI-DAD than in EA (minor allele frequency=0.075% vs. 0.0015%); the other two are common variants. No genes in the brain-specific promoter/enhancer analysis met criteria for significance. Results with and without annotation weights were similar. Missense/LoF variants in some genes previously associated with AD in EA are associated with measures of cognitive function in South Asians from India. Analyzing genome sequence data allows identification of potential novel causal variants enriched in South Asians.
INTRODUCTION: Clinical research in Alzheimer’s disease (AD) lacks cohort diversity despite being a global health crisis. The Asian Cohort for Alzheimer’s Disease (ACAD) was formed to address underrepresentation of Asians in research, and limited understanding of how genetics and non-genetic/lifestyle factors impact this multi-ethnic population.
METHODS: The ACAD started fully recruiting in October 2021 with one central coordination site, eight recruitment sites, and two analysis sites. We developed a comprehensive study protocol for outreach and recruitment, an extensive data collection packet, and a centralized data management system, in English, Chinese, Korean, and Vietnamese.
RESULTS: ACAD has recruited 606 participants with an additional 900 expressing interest in enrollment since program inception.
DISCUSSION: ACAD’s traction indicates the feasibility of recruiting Asians for clinical research to enhance understanding of AD risk factors. ACAD will recruit > 5000 participants to identify genetic and non-genetic/lifestyle AD risk factors, establish blood biomarker levels for AD diagnosis, and facilitate clinical trial readiness.
HIGHLIGHTS: The Asian Cohort for Alzheimer’s Disease (ACAD) promotes awareness of under-investment in clinical research for Asians. We are recruiting Asian Americans and Canadians for novel insights into Alzheimer’s disease. We describe culturally appropriate recruitment strategies and data collection protocol. ACAD addresses challenges of recruitment from heterogeneous Asian subcommunities. We aim to implement a successful recruitment program that enrolls across three Asian subcommunities.
BACKGROUND: Mitochondrial DNA (mtDNA) is a double-stranded circular DNA and has multiple copies in each cell. Excess heteroplasmy, the coexistence of distinct variants in copies of mtDNA within a cell, may lead to mitochondrial impairments. Accurate determination of heteroplasmy in whole-genome sequencing (WGS) data has posed a significant challenge because mitochondria carrying heteroplasmic variants cannot be distinguished during library preparation. Moreover, sequencing errors, contamination, and nuclear mtDNA segments can reduce the accuracy of heteroplasmic variant calling.
OBJECTIVE: To efficiently and accurately call mtDNA homoplasmic and heteroplasmic variants from the large-scale WGS data generated from the Alzheimer’s Disease Sequencing Project (ADSP), and test their association with Alzheimer’s disease (AD).
METHODS: In this study, we present MitoH3-a comprehensive computational pipeline for calling mtDNA homoplasmic and heteroplasmic variants and inferring haplogroups in the ADSP WGS data. We first applied MitoH3 to 45 technical replicates from 6 subjects to define a threshold for detecting heteroplasmic variants. Then using the threshold of 5% ≤variant allele fraction≤95%, we further applied MitoH3 to call heteroplasmic variants from a total of 16,113 DNA samples with 6,742 samples from cognitively normal controls and 6,183 from AD cases.
RESULTS: This pipeline is available through the Singularity container engine. For 4,311 heteroplasmic variants identified from 16,113 samples, no significant variant count difference was observed between AD cases and controls.
CONCLUSIONS: Our streamlined pipeline, MitoH3, enables computationally efficient and accurate analysis of a large number of samples.
SUMMARY: Preparing functional genomic (FG) data with diverse assay types and file formats for integration into analysis workflows that interpret genome-wide association and other studies is a significant and time-consuming challenge. Here we introduce hipFG, an automatically customized pipeline for efficient and scalable normalization of heterogenous FG data collections into standardized, indexed, rapidly searchable analysis-ready datasets while accounting for FG datatypes (e.g., chromatin interactions, genomic intervals, quantitative trait loci).
AVAILABILITY: hipFG is freely available at https://bitbucket.org/wanglab-upenn/hipFG. A Docker container is available at https://hub.docker.com/r/wanglab/hipfg.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
INTRODUCTION: The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site Alzheimer’s Genomics Database (GenomicsDB) is a public knowledge base of Alzheimer’s disease (AD) genetic datasets and genomic annotations.
METHODS: GenomicsDB uses a custom systems architecture to adopt and enforce rigorous standards that facilitate harmonization of AD-relevant genome-wide association study summary statistics datasets with functional annotations, including over 230 million annotated variants from the AD Sequencing Project.
RESULTS: GenomicsDB generates interactive reports compiled from the harmonized datasets and annotations. These reports contextualize AD-risk associations in a broader functional genomic setting and summarize them in the context of functionally annotated genes and variants.
DISCUSSION: Created to make AD-genetics knowledge more accessible to AD researchers, the GenomicsDB is designed to guide users unfamiliar with genetic data in not only exploring but also interpreting this ever-growing volume of data. Scalable and interoperable with other genomics resources using data technology standards, the GenomicsDB can serve as a central hub for research and data analysis on AD and related dementias.
HIGHLIGHTS: The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS) offers to the public a unique, disease-centric collection of AD-relevant GWAS summary statistics datasets. Interpreting these data is challenging and requires significant bioinformatics expertise to standardize datasets and harmonize them with functional annotations on genome-wide scales. The NIAGADS Alzheimer’s GenomicsDB helps overcome these challenges by providing a user-friendly public knowledge base for AD-relevant genetics that shares harmonized, annotated summary statistics datasets from the NIAGADS repository in an interpretable, easily searchable format.
Alzheimer’s disease (AD), the leading cause of dementia, has an estimated heritability of approximately 70%1. The genetic component of AD has been mainly assessed using genome-wide association studies, which do not capture the risk contributed by rare variants2. Here, we compared the gene-based burden of rare damaging variants in exome sequencing data from 32,558 individuals-16,036 AD cases and 16,522 controls. Next to variants in TREM2, SORL1 and ABCA7, we observed a significant association of rare, predicted damaging variants in ATP8B4 and ABCA1 with AD risk, and a suggestive signal in ADAM10. Additionally, the rare-variant burden in RIN3, CLU, ZCWPW1 and ACE highlighted these genes as potential drivers of respective AD-genome-wide association study loci. Variants associated with the strongest effect on AD risk, in particular loss-of-function variants, are enriched in early-onset AD cases. Our results provide additional evidence for a major role for amyloid-β precursor protein processing, amyloid-β aggregation, lipid metabolism and microglial function in AD.
Non-coding genetic variants outside of protein-coding genome regions play an important role in genetic and epigenetic regulation. It has become increasingly important to understand their roles, as non-coding variants often make up the majority of top findings of genome-wide association studies (GWAS). In addition, the growing popularity of disease-specific whole-genome sequencing (WGS) efforts expands the library of and offers unique opportunities for investigating both common and rare non-coding variants, typically not detected in more limited GWAS approaches. However, the sheer size and breadth of WGS data introduces additional challenges to predicting functional impacts in terms of data analysis and interpretation. This review focuses on the recent approaches developed for efficient, at-scale annotation and prioritization of non-coding variants uncovered in WGS analyses. In particular, we review the latest scalable annotation tools, databases, and functional genomic resources for interpreting variant findings from WGS, based on both experimental data and in silico predictive annotations. We also review machine learning-based predictive models for variant scoring and prioritization. We conclude with a discussion of future research directions that will enhance the data and tools necessary for effective functional analyses of variants identified by WGS to improve our understanding of disease etiology.
The success of genome-wide association studies (GWAS) completed in the last 15 years has reinforced a key fact: polygenic architecture makes a substantial contribution to variation of susceptibility to complex disease, including Alzheimer’s disease. One straight-forward way to capture this architecture and predict which individuals in a population are most at risk is to calculate a polygenic risk score (PRS). This score aggregates the risk conferred across multiple genetic variants, ultimately representing an individual’s predicted genetic susceptibility for a disease. PRS have received increasing attention after having been successfully used in complex traits. This has brought with it renewed attention on new methods which improve the accuracy of risk prediction. While these applications are initially informative, their utility is far from equitable: the majority of PRS models use samples heavily if not entirely of individuals of European descent. This basic approach opens concerns of health equity if applied inaccurately to other population groups, or health disparity if we fail to use them at all. In this review we will examine the methods of calculating PRS and some of their previous uses in disease prediction. We also advocate for, with supporting scientific evidence, inclusion of data from diverse populations in these existing and future studies of population risk via PRS.