Capstone Project

Predicting Genotypes from Allele Intensity Data Using a Support Vector Machine

Author: Daan Leiva
Institution: UCLA

Overview

This project explores the use of Support Vector Machines (SVMs) to predict genotypes based on allele intensity data collected via OpenArray™ SNP genotyping technology. The goal is to improve the accuracy of genotype predictions—especially in assays with low signal separation—thereby reducing costs and improving confidence in results.

Objectives

  • Optimize an SVM model (linear vs. RBF) for genotype prediction.
  • Target challenging assays with low Minimum Cluster Sigma Separation (MCSS).
  • Validate the model’s accuracy using k-fold cross-validation.
  • Test prediction performance on non-majority genotype samples.

Data & Methodology

  • Dataset: 3,506 failed assays with MCSS < 5; 263 were labeled discrepant.
  • Features: Fluorescence intensities (VIC for allele 1, FAM for allele 2).
  • Labeling: Genotype ground truth assigned using a 70% majority threshold.
  • Tools: Python (scikit-learn), R, SQLite.
  • Models Tested: Linear SVM (fast but less accurate) vs. RBF SVM (more accurate, slower).

After hyperparameter tuning, the best-performing model used an RBF kernel with C = 0.3 and γ = 300.

Results

  • Overall accuracy: ~74.6%
  • High performance: Homozygous genotypes (11 and 22)
  • Low performance: Heterozygous genotypes (12), due to poor separability
  • Cross-validation: 4-fold CV confirmed generalizability
  • Prediction cohesion: Strong performance even on non-majority genotype samples

Conclusion

The RBF SVM outperformed the linear model in both accuracy and prediction consistency. While homozygous genotype predictions were reliable, predicting heterozygous samples remains a challenge. Future improvements could involve additional feature transformation or deeper models.

Acknowledgments

Special thanks to Professor Wei Wang for her mentorship and to Chelsea Ju for her support and collaboration throughout the project.