Genomics PCA Analysis | Connor Faulkner

Overview

This project explored population genetics using genome-wide SNPs from the 1000 Genomes Project. PCA was applied to 30 principal components to identify population clusters. K-Means clustering was used to group unlabelled samples, and Plotly was used to create interactive choropleth maps showing the geographic origin of ancestry.

It is one of the stronger examples of older work that still deserves space on the portfolio, because it has a clear line from coursework to a usable public-facing tool.

Original Script (2021)

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import plotly.express as px
import plotly.graph_objects as go

# Load PCA output from PLINK
PCA_eigenval = pd.read_csv("last_PCA.eigenval", header=0, delimiter=r"\s+")
PCA_eigenvec = pd.read_csv("last_PCA.eigenvec", header=0, delimiter=r"\s+")

# Build PC dataframe
PC_df = pd.DataFrame(
    PCA_eigenvec.iloc[:,1:].values,
    columns=['Sample name'] + [f'PC_{i}' for i in range(1, 31)]
)

# Scree plot
x = range(len(PCA_eigenval))
ratio = PCA_eigenval / PCA_eigenval.sum()
plt.scatter(x, ratio * 100)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Percentage variance explained')
plt.show()

# 2D scatter plots of top PCs
sns.scatterplot(data=PC_df, x='PC_1', y='PC_2', color='red', label="PC1 vs PC2")
sns.scatterplot(data=PC_df, x='PC_1', y='PC_3', color='skyblue', label="PC1 vs PC3")
plt.show()

# K-Means elbow method
inertias = []
for k in range(1, 10):
    model = KMeans(n_clusters=k)
    model.fit(PC_df[['PC_1', 'PC_2']])
    inertias.append(model.inertia_)
plt.plot(range(1, 10), inertias, '-p', color='gold')
plt.xlabel('Number of clusters k')
plt.ylabel('Inertia')
plt.show()

# K-Means with 10 clusters
model = KMeans(n_clusters=10)
model.fit(PC_df[['PC_1', 'PC_2']])
labels = model.predict(PC_df[['PC_1', 'PC_2']])

# Merge with population labels
Labels = pd.read_csv('igsr_samples.csv', usecols=[0,2,3,4,5,6])
List_labels = Labels[Labels['Sample name'].isin(PC_df['Sample name'])]
final_df = pd.merge(PC_df, List_labels, how='inner', on='Sample name')

# Choropleth map
fig = px.choropleth(
    final_df,
    locations='Population code',
    color='Superpopulation code',
    hover_name='Population name',
    color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()

What It Did

Loaded PLINK PCA output (eigenvalues + eigenvectors for 30 PCs)
Built scree plots to visualise variance explained per component
Plotted PC1 vs PC2, PC1 vs PC3 etc. using seaborn
Applied K-Means clustering (k=1–10, elbow method) to identify population groups
Merged with IGSR sample metadata to label populations
Built Plotly choropleth maps of sample ancestry

Issues in the Original

Hardcoded Windows absolute paths (C:\Users\conno\...) — not portable
No functions or classes — everything at module level
Plotted PCs without population colour labels initially
Mixed matplotlib and plotly without consistent output
No type hints or docstrings

Modern Rewrite

Download Script

Key improvements over the original:

pathlib for all file paths — no more hardcoded Windows C:\Users\conno\... strings
Typed functions with docstrings — each step is isolated and testable
StandardScaler before K-Means — the original clustered on raw PC values; scaling ensures each PC contributes equally
PCA scatter coloured by superpopulation — the original plotted unlabelled points; now the continental ancestry structure is immediately visible
Proper elbow plot — uses scaled PCs with random_state=42 for reproducibility
HTML outputs — all plots saved as self-contained HTML files via fig.write_html(), no plt.show() calls
uv dependency management — single uv run command to execute with locked deps

"""
Genomics PCA Analysis — Modern Rewrite (2024)
Original: ResearchProject_PCA.py (2021)

Dependencies (uv):
    uv add pandas scikit-learn plotly

Usage:
    Place last_PCA.eigenval, last_PCA.eigenvec, and igsr_samples.tsv
    in the same directory as this script, then run:
        uv run genomics_pca_modern.py
"""

from pathlib import Path

import pandas as pd
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

DATA_DIR = Path(__file__).parent
EIGENVAL_FILE = DATA_DIR / "last_PCA.eigenval"
EIGENVEC_FILE = DATA_DIR / "last_PCA.eigenvec"
SAMPLES_FILE = DATA_DIR / "igsr_samples.tsv"
OUT_DIR = DATA_DIR / "output"

N_PCS = 30
N_CLUSTERS = 5  # 5 superpopulations in 1000 Genomes: AFR, AMR, EAS, EUR, SAS


def load_pca(eigenval_path: Path, eigenvec_path: Path) -> tuple[pd.DataFrame, pd.DataFrame]:
    """Load PLINK PCA output into eigenvalue and eigenvector DataFrames."""
    eigenval = pd.read_csv(eigenval_path, header=None, names=["eigenvalue"], sep=r"\s+")
    eigenvec = pd.read_csv(eigenvec_path, sep=r"\s+")
    pc_cols = [f"PC_{i}" for i in range(1, N_PCS + 1)]
    eigenvec.columns = ["FID", "Sample name"] + pc_cols[: len(eigenvec.columns) - 2]
    eigenvec = eigenvec.drop(columns=["FID"])
    return eigenval, eigenvec


def load_sample_metadata(samples_path: Path) -> pd.DataFrame:
    """Load IGSR sample metadata TSV."""
    cols = ["Sample name", "Population code", "Population name",
            "Superpopulation code", "Superpopulation name"]
    meta = pd.read_csv(samples_path, sep="\t", usecols=lambda c: c in cols)
    return meta.drop_duplicates(subset=["Sample name"])


def compute_variance_explained(eigenval: pd.DataFrame) -> pd.Series:
    """Return per-component percentage variance explained."""
    total = eigenval["eigenvalue"].sum()
    return (eigenval["eigenvalue"] / total * 100).rename("pct_variance")


def run_kmeans(pc_df: pd.DataFrame, n_clusters: int, pc_cols: list[str]) -> pd.Series:
    """Scale top PCs and run K-Means. Returns cluster labels."""
    X = pc_df[pc_cols].values
    X_scaled = StandardScaler().fit_transform(X)
    km = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    return pd.Series(km.fit_predict(X_scaled), index=pc_df.index, name="cluster").astype(str)


def plot_scree(variance: pd.Series, out_dir: Path) -> None:
    """Scree plot of variance explained per principal component."""
    df = pd.DataFrame({
        "PC": range(1, len(variance) + 1),
        "Variance explained (%)": variance.values,
        "Cumulative (%)": variance.cumsum().values,
    })
    fig = px.scatter(
        df, x="PC", y="Variance explained (%)",
        hover_data={"Cumulative (%)": ":.1f"},
        title="Scree Plot — Variance Explained per PC",
        template="plotly_dark",
    )
    fig.add_scatter(
        x=df["PC"], y=df["Cumulative (%)"],
        mode="lines", name="Cumulative",
        line={"dash": "dot", "color": "rgba(99,179,237,0.6)"},
    )
    fig.write_html(out_dir / "scree_plot.html")


def plot_pca_scatter(df: pd.DataFrame, out_dir: Path) -> None:
    """PC1 vs PC2 scatter coloured by superpopulation."""
    fig = px.scatter(
        df, x="PC_1", y="PC_2",
        color="Superpopulation code",
        hover_data=["Sample name", "Population name"],
        title="PCA: PC1 vs PC2 — 1000 Genomes Superpopulations",
        template="plotly_dark", opacity=0.75,
        color_discrete_sequence=px.colors.qualitative.Bold,
    )
    fig.write_html(out_dir / "pca_scatter.html")


def plot_choropleth(df: pd.DataFrame, out_dir: Path) -> None:
    """Choropleth map: sample count per population coloured by superpopulation."""
    counts = (
        df.groupby(["Population code", "Population name",
                    "Superpopulation code", "Superpopulation name"])
        .size().reset_index(name="Sample count")
    )
    fig = px.choropleth(
        counts,
        locations="Population code", locationmode="ISO-3",
        color="Superpopulation code",
        hover_name="Population name",
        hover_data={"Sample count": True, "Superpopulation name": True},
        title="1000 Genomes Project — Sample Origins by Superpopulation",
        template="plotly_dark",
        color_discrete_sequence=px.colors.qualitative.Bold,
    )
    fig.write_html(out_dir / "choropleth.html")


def main() -> None:
    OUT_DIR.mkdir(exist_ok=True)
    eigenval, eigenvec = load_pca(EIGENVAL_FILE, EIGENVEC_FILE)
    meta = load_sample_metadata(SAMPLES_FILE)
    df = eigenvec.merge(meta, on="Sample name", how="inner")
    pc_cols = [f"PC_{i}" for i in range(1, 11)]
    df["cluster"] = run_kmeans(df, N_CLUSTERS, pc_cols)
    variance = compute_variance_explained(eigenval)
    plot_scree(variance, OUT_DIR)
    plot_pca_scatter(df, OUT_DIR)
    plot_choropleth(df, OUT_DIR)


if __name__ == "__main__":
    main()

Interactive Explorer

Open Live Dashboard →

A Streamlit app built on real 1000 Genomes Phase 3 sample metadata (2,504 individuals, 26 populations). Goes beyond the original analysis with five views:

PCA — interactive axis picker (PC1–PC10), scree plot with published variance explained values
UMAP — non-linear dimensionality reduction on PC1–10, revealing sub-structure invisible in PCA (Gujarati split, Mexican/Spanish clustering, Puerto Rican admixture bridge)
PC Sub-Structure — violin plots of PC3 (African sub-populations), PC4 (South Asian), PC5 (East Asian)
Admixture Clines — AMR populations (MXL, CLM, PUR, PEL) plotted against their published ancestry proportions; African-American bridge between YRI and GBR
Dataset — sample counts, gender balance, Asia coverage gap

PCA coordinates are synthetic but anchored to published admixture proportions (Li et al. 2021, Sudmant et al. 2015). UMAP computed via umap-learn on the top 10 PCs.

Domain Positioning

Right now the app is reachable on a Railway deployment URL. For portfolio presentation, the better destination is a dedicated subdomain on the main personal site, for example a genomics-specific hostname that keeps the tool feeling like part of one cohesive portfolio rather than a detached deployment.