GrafAnc: Reliable and reproducible inference of continental and regional population structure

Accurate inference of genetic ancestry is a fundamental step in population genetics, disease association studies, and understanding human history. However, most existing tools, whether model-based or model-free, are limited by dataset-specific characteristics, which restrict reproducibility and hinder cross-study comparisons. Additionally, these tools often struggle to resolve fine-scale population structure, requiring multiple processing steps, such as sample subsetting and repeated program execution. These practices introduce bias and reduce replicability, particularly in evolutionary and migration studies. We present GrafAnc, a robust tool for inferring ancestry at both continental and subcontinental levels without requiring dataset partitioning, iterative processing, or manual sample curation. Building upon and extending GRAF-pop, GrafAnc infers an individual’s ancestry background by comparing genotypes with allele frequencies from 26 reference populations compiled from publicly available databases. The current version of GrafAnc generates 18 ancestry scores per individual and classifies individuals into 8 continental and 38 subcontinental ancestry groups, including Middle East and North Africa. These scores are invariant to the specific composition of the study dataset and can be used directly as continuous covariates or for ancestry group assignments. GrafAnc enables seamless integration of population structure across studies and datasets, facilitating consistent interpretation in large-scale genomics. We benchmark GrafAnc using the 1000 Genomes Project, UK Biobank, and Human Genome Diversity Project datasets, demonstrating its accuracy and robustness across diverse ancestries and genotyping platforms. GrafAnc is implemented in C++ with multithreading support and is freely available.