Improved Methods for Detection & Prioritization of Structural Variants from LR Sequencing Data
Wednesday, May 28, 2025
10:00 AM-12:00 PM
BIOMED PhD Thesis Defense
Title:
Improved Methods for Detection and Prioritization of Structural Variants from Long-read (LR) Sequencing Data
Speaker:
Jonathan Elliot Perdomo, PhD Candidate
School of Biomedical Engineering, Science and Health Systems
Drexel University
Advisors:
Kai Wang, PhD
Raymond G. Perelman Center for Cellular and Molecular Therapeutics
Children's Hospital of Philadelphia (CHOP)
Professor of Pathology and Laboratory Medicine
Perelman School of Medicine
University of Pennsylvania
Ming Xiao, PhD
Professor
School of Biomedical Engineering, Science and Health Systems
Drexel University
Details:
Structural variants (SVs) are the largest source of variations in the human genome and are frequently associated with disease phenotypes. Thus, the identification and characterization of SVs are essential for understanding human genome structure and function. Long read sequencing technologies such as Oxford Nanopore (ONT) and Pacific Biosciences provide increased sensitivity and resolution over short reads to resolve complex structural variants (SVs) with base-pair resolution. Widely used long-read SV callers, such as Sniffles2, cuteSV and PBSV, have limitations in the size and complexity of SVs detectable with high confidence, largely due to using limited alignment information. High-confidence SVs identified with these tools are generally <50kb in length, and therefore large, potentially disease causal SVs may be overlooked.
This dissertation involves the development of ContextSV, a novel computational method that overcomes these limitations and complements existing tools by combining long read alignments with copy number predictions from a Hidden Markov Model (HMM). Our method enables the simultaneous analysis of SVs and single-nucleotide variants (SNVs) to provide a more comprehensive understanding of genomic variation. HMM copy number predictions are based on coverage and expected SNV allele frequencies, using ethnicity-specific variant allele frequency information from human population databases, such as gnomAD.
We demonstrate that ContextSV achieves comparable performance with major long-read SV callers, and we further highlight its unique advantages in the identification and classification of large inversions and copy number variants (CNVs) that may be missed by other methods. Additionally, in this dissertation I work on addressing the low precision which is common in long-read SV callsets: SV callers typically aim to maximize the true positive rate to avoid missing important SVs which may be rare or clinically relevant, but this comes at a cost of an increased false positive rate (decreased precision). To address this, I develop a novel machine learning-based method for assigning SV confidence scores based on important genomic context and alignment features. These scores can be used to filter false positives and increase precision in the final long-read SV callset.
Contact Information
Natalia Broz
njb33@drexel.edu