Capstone - ML Portfolio

Capstone Project

Predicting Genotypes from Allele Intensity Data Using a Support Vector Machine

Author: Daan Leiva
Institution: UCLA

Overview

This project explores the use of Support Vector Machines (SVMs) to predict genotypes based on allele intensity data collected via OpenArray™ SNP genotyping technology. The goal is to improve the accuracy of genotype predictions—especially in assays with low signal separation—thereby reducing costs and improving confidence in results.

📄 Download Full Capstone Paper (PDF)

Objectives

Optimize an SVM model (linear vs. RBF) for genotype prediction.
Target challenging assays with low Minimum Cluster Sigma Separation (MCSS).
Validate the model’s accuracy using k-fold cross-validation.
Test prediction performance on non-majority genotype samples.

Data & Methodology

Dataset: 3,506 failed assays with MCSS < 5; 263 were labeled discrepant.
Features: Fluorescence intensities (VIC for allele 1, FAM for allele 2).
Labeling: Genotype ground truth assigned using a 70% majority threshold.
Tools: Python (scikit-learn), R, SQLite.
Models Tested: Linear SVM (fast but less accurate) vs. RBF SVM (more accurate, slower).

After hyperparameter tuning, the best-performing model used an RBF kernel with C = 0.3 and γ = 300.

Results

Overall accuracy: ~74.6%
High performance: Homozygous genotypes (11 and 22)
Low performance: Heterozygous genotypes (12), due to poor separability
Cross-validation: 4-fold CV confirmed generalizability
Prediction cohesion: Strong performance even on non-majority genotype samples

Conclusion

The RBF SVM outperformed the linear model in both accuracy and prediction consistency. While homozygous genotype predictions were reliable, predicting heterozygous samples remains a challenge. Future improvements could involve additional feature transformation or deeper models.

Acknowledgments

Special thanks to Professor Wei Wang for her mentorship and to Chelsea Ju for her support and collaboration throughout the project.