Research entry
Genomics PCA Analysis
2021 · Academic Archive
A population-genetics project that started as academic PCA analysis and has since evolved into a public genomics explorer for browsing ancestry structure interactively.
Overview
This project explored population genetics using genome-wide SNPs from the 1000 Genomes Project. PCA was applied to 30 principal components to identify population clusters. K-Means clustering was used to group unlabelled samples, and Plotly was used to create interactive choropleth maps showing the geographic origin of ancestry.
It is one of the stronger examples of older work that still deserves space on the portfolio, because it has a clear line from coursework to a usable public-facing tool.
Original Script (2021)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go
# Load PCA output from PLINK
PCA_eigenval = pd.read_csv("last_PCA.eigenval", header=0, delimiter=r"\s+")
PCA_eigenvec = pd.read_csv("last_PCA.eigenvec", header=0, delimiter=r"\s+")
# Build PC dataframe
PC_df = pd.DataFrame(
PCA_eigenvec.iloc[:,1:].values,
columns=['Sample name'] + [f'PC_{i}' for i in range(1, 31)]
)
# Scree plot
x = range(len(PCA_eigenval))
ratio = PCA_eigenval / PCA_eigenval.sum()
plt.scatter(x, ratio * 100)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Percentage variance explained')
plt.show()
# 2D scatter plots of top PCs
sns.scatterplot(data=PC_df, x='PC_1', y='PC_2', color='red', label="PC1 vs PC2")
sns.scatterplot(data=PC_df, x='PC_1', y='PC_3', color='skyblue', label="PC1 vs PC3")
plt.show()
# K-Means elbow method
inertias = []
for k in range(1, 10):
model = KMeans(n_clusters=k)
model.fit(PC_df[['PC_1', 'PC_2']])
inertias.append(model.inertia_)
plt.plot(range(1, 10), inertias, '-p', color='gold')
plt.xlabel('Number of clusters k')
plt.ylabel('Inertia')
plt.show()
# K-Means with 10 clusters
model = KMeans(n_clusters=10)
model.fit(PC_df[['PC_1', 'PC_2']])
labels = model.predict(PC_df[['PC_1', 'PC_2']])
# Merge with population labels
Labels = pd.read_csv('igsr_samples.csv', usecols=[0,2,3,4,5,6])
List_labels = Labels[Labels['Sample name'].isin(PC_df['Sample name'])]
final_df = pd.merge(PC_df, List_labels, how='inner', on='Sample name')
# Choropleth map
fig = px.choropleth(
final_df,
locations='Population code',
color='Superpopulation code',
hover_name='Population name',
color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()
What It Did
- Loaded PLINK PCA output (eigenvalues + eigenvectors for 30 PCs)
- Built scree plots to visualise variance explained per component
- Plotted PC1 vs PC2, PC1 vs PC3 etc. using seaborn
- Applied K-Means clustering (k=1–10, elbow method) to identify population groups
- Merged with IGSR sample metadata to label populations
- Built Plotly choropleth maps of sample ancestry
Issues in the Original
- Hardcoded Windows absolute paths (
C:\Users\conno\...) — not portable - No functions or classes — everything at module level
- Plotted PCs without population colour labels initially
- Mixed matplotlib and plotly without consistent output
- No type hints or docstrings
Modern Rewrite
Key improvements over the original:
pathlibfor all file paths — no more hardcoded WindowsC:\Users\conno\...strings- Typed functions with docstrings — each step is isolated and testable
StandardScalerbefore K-Means — the original clustered on raw PC values; scaling ensures each PC contributes equally- PCA scatter coloured by superpopulation — the original plotted unlabelled points; now the continental ancestry structure is immediately visible
- Proper elbow plot — uses scaled PCs with
random_state=42for reproducibility - HTML outputs — all plots saved as self-contained HTML files via
fig.write_html(), noplt.show()calls uvdependency management — singleuv runcommand to execute with locked deps
"""
Genomics PCA Analysis — Modern Rewrite (2024)
Original: ResearchProject_PCA.py (2021)
Dependencies (uv):
uv add pandas scikit-learn plotly
Usage:
Place last_PCA.eigenval, last_PCA.eigenvec, and igsr_samples.tsv
in the same directory as this script, then run:
uv run genomics_pca_modern.py
"""
from pathlib import Path
import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
DATA_DIR = Path(__file__).parent
EIGENVAL_FILE = DATA_DIR / "last_PCA.eigenval"
EIGENVEC_FILE = DATA_DIR / "last_PCA.eigenvec"
SAMPLES_FILE = DATA_DIR / "igsr_samples.tsv"
OUT_DIR = DATA_DIR / "output"
N_PCS = 30
N_CLUSTERS = 5 # 5 superpopulations in 1000 Genomes: AFR, AMR, EAS, EUR, SAS
def load_pca(eigenval_path: Path, eigenvec_path: Path) -> tuple[pd.DataFrame, pd.DataFrame]:
"""Load PLINK PCA output into eigenvalue and eigenvector DataFrames."""
eigenval = pd.read_csv(eigenval_path, header=None, names=["eigenvalue"], sep=r"\s+")
eigenvec = pd.read_csv(eigenvec_path, sep=r"\s+")
pc_cols = [f"PC_{i}" for i in range(1, N_PCS + 1)]
eigenvec.columns = ["FID", "Sample name"] + pc_cols[: len(eigenvec.columns) - 2]
eigenvec = eigenvec.drop(columns=["FID"])
return eigenval, eigenvec
def load_sample_metadata(samples_path: Path) -> pd.DataFrame:
"""Load IGSR sample metadata TSV."""
cols = ["Sample name", "Population code", "Population name",
"Superpopulation code", "Superpopulation name"]
meta = pd.read_csv(samples_path, sep="\t", usecols=lambda c: c in cols)
return meta.drop_duplicates(subset=["Sample name"])
def compute_variance_explained(eigenval: pd.DataFrame) -> pd.Series:
"""Return per-component percentage variance explained."""
total = eigenval["eigenvalue"].sum()
return (eigenval["eigenvalue"] / total * 100).rename("pct_variance")
def run_kmeans(pc_df: pd.DataFrame, n_clusters: int, pc_cols: list[str]) -> pd.Series:
"""Scale top PCs and run K-Means. Returns cluster labels."""
X = pc_df[pc_cols].values
X_scaled = StandardScaler().fit_transform(X)
km = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
return pd.Series(km.fit_predict(X_scaled), index=pc_df.index, name="cluster").astype(str)
def plot_scree(variance: pd.Series, out_dir: Path) -> None:
"""Scree plot of variance explained per principal component."""
df = pd.DataFrame({
"PC": range(1, len(variance) + 1),
"Variance explained (%)": variance.values,
"Cumulative (%)": variance.cumsum().values,
})
fig = px.scatter(
df, x="PC", y="Variance explained (%)",
hover_data={"Cumulative (%)": ":.1f"},
title="Scree Plot — Variance Explained per PC",
template="plotly_dark",
)
fig.add_scatter(
x=df["PC"], y=df["Cumulative (%)"],
mode="lines", name="Cumulative",
line={"dash": "dot", "color": "rgba(99,179,237,0.6)"},
)
fig.write_html(out_dir / "scree_plot.html")
def plot_pca_scatter(df: pd.DataFrame, out_dir: Path) -> None:
"""PC1 vs PC2 scatter coloured by superpopulation."""
fig = px.scatter(
df, x="PC_1", y="PC_2",
color="Superpopulation code",
hover_data=["Sample name", "Population name"],
title="PCA: PC1 vs PC2 — 1000 Genomes Superpopulations",
template="plotly_dark", opacity=0.75,
color_discrete_sequence=px.colors.qualitative.Bold,
)
fig.write_html(out_dir / "pca_scatter.html")
def plot_choropleth(df: pd.DataFrame, out_dir: Path) -> None:
"""Choropleth map: sample count per population coloured by superpopulation."""
counts = (
df.groupby(["Population code", "Population name",
"Superpopulation code", "Superpopulation name"])
.size().reset_index(name="Sample count")
)
fig = px.choropleth(
counts,
locations="Population code", locationmode="ISO-3",
color="Superpopulation code",
hover_name="Population name",
hover_data={"Sample count": True, "Superpopulation name": True},
title="1000 Genomes Project — Sample Origins by Superpopulation",
template="plotly_dark",
color_discrete_sequence=px.colors.qualitative.Bold,
)
fig.write_html(out_dir / "choropleth.html")
def main() -> None:
OUT_DIR.mkdir(exist_ok=True)
eigenval, eigenvec = load_pca(EIGENVAL_FILE, EIGENVEC_FILE)
meta = load_sample_metadata(SAMPLES_FILE)
df = eigenvec.merge(meta, on="Sample name", how="inner")
pc_cols = [f"PC_{i}" for i in range(1, 11)]
df["cluster"] = run_kmeans(df, N_CLUSTERS, pc_cols)
variance = compute_variance_explained(eigenval)
plot_scree(variance, OUT_DIR)
plot_pca_scatter(df, OUT_DIR)
plot_choropleth(df, OUT_DIR)
if __name__ == "__main__":
main()
Interactive Explorer
A Streamlit app built on real 1000 Genomes Phase 3 sample metadata (2,504 individuals, 26 populations). Goes beyond the original analysis with five views:
- PCA — interactive axis picker (PC1–PC10), scree plot with published variance explained values
- UMAP — non-linear dimensionality reduction on PC1–10, revealing sub-structure invisible in PCA (Gujarati split, Mexican/Spanish clustering, Puerto Rican admixture bridge)
- PC Sub-Structure — violin plots of PC3 (African sub-populations), PC4 (South Asian), PC5 (East Asian)
- Admixture Clines — AMR populations (MXL, CLM, PUR, PEL) plotted against their published ancestry proportions; African-American bridge between YRI and GBR
- Dataset — sample counts, gender balance, Asia coverage gap
PCA coordinates are synthetic but anchored to published admixture proportions (Li et al. 2021, Sudmant et al. 2015). UMAP computed via umap-learn on the top 10 PCs.
Domain Positioning
Right now the app is reachable on a Railway deployment URL. For portfolio presentation, the better destination is a dedicated subdomain on the main personal site, for example a genomics-specific hostname that keeps the tool feeling like part of one cohesive portfolio rather than a detached deployment.