Research entry
Genetic Substructure Analysis via PCA
Completed · May 2021
Analysis of genetic substructure across global populations using genome-wide SNPs and principal component analysis on the 1000 Genomes Project dataset.
Overview
This project analyses genetic substructure using genome-wide SNPs from the 1000 Genomes Project. Principal component analysis (PCA) was applied to identify population clusters across five superpopulations: AFR, AMR, EAS, EUR, and SAS.
Methods
- PLINK used for QC and PCA computation (30 PCs)
- Interactive visualisation built with Plotly and Dash
- 3D scatter plots, choropleth maps, and scree plots
Results
Clear separation of superpopulation clusters in PC1/PC2 space, with notable substructure within continental groups.
Current Explorer
This work has since been extended into a broader public explorer. The current live app is hosted on Railway and is the version that makes the most sense to surface from the portfolio while a custom subdomain is being prepared.