My research sits at the intersection of computational method development and genomics. I build algorithms and software to detect genetic variants in regions of the genome that are too complex, repetitive, or structurally variable for standard approaches.
Haplotype-resolved variant calling in complex genomic regions
Many important genes reside in segmental duplications or highly repetitive genomic regions where standard short-read aligners and variant callers fail. I develop targeted, haplotype-aware methods that leverage whole-genome sequencing—both short-read and long-read to produce accurate calls in these challenging loci.
Haplotype-resolved variant calling in segmental duplications using TruPath Genome (formerly Constellation)
TruPath Genome is an on-flowcell proximity sequencing technology. The proximity information from sequenced reads in nearby nanowells can help infer whether a group of reads belongs to the same original DNA molecule. I designed multi-region joint detection (MRJD) algorithm that can take advantage of this long-range information from TruPath Genome to produce haplotype-resolved variant calls in segmental duplications.
I presented this work as a platform talk at the 2026 ACMG Annual Clinical Genetics Meeting: “A Rapid, Novel Approach to Rare Disease and Clinical Genetic Variant Discovery using On-flowcell Proximity Sequencing and Haplotype-resolved Variant Calling.” [Abstract]
Other key projects:
-
Alpha-thalassemia (HBA1/2) copy number genotyping — Developed a targeted copy number caller for the alpha-globin locus, one of the most structurally complex and clinically important regions of the genome (~5% global carrier frequency for alpha-thalassemia). [Blog]
-
Lynch syndrome (PMS2) variant detection — Improved variant calling accuracy in PMS2, a mismatch-repair gene with a highly similar pseudogene (PMS2CL) that causes widespread misalignment and false variant calls. [Blog]
Detecting and analyzing transposable element in Drosophila
Transposable elements (TEs) make up nearly half the human genome and are major drivers of genomic variation. During my Ph.D., I developed computational methods to detect, characterize, and study TEs using long-read sequencing technologies.
Key projects:
-
TELR — A software pipeline for detecting non-reference TE insertions in long-read (PacBio / Oxford Nanopore) sequencing data using local assembly. TELR enables phylogenomic analysis of TE insertions at base-pair resolution. Published in Nucleic Acids Research (2022).
-
ngs_te_mapper2 — A cell-line authentication tool based on TE insertion profiles, used to identify Drosophila cell lines and detect loss of heterozygosity.
-
TE dynamics in Drosophila S2 cell lines — Genomic analysis of 32 whole-genome datasets from D. melanogaster S2 sublines, characterizing ongoing transposition and phylogenetic relationships among laboratory cell cultures.
-
P element target site prediction — Machine learning models trained on 30+ engineered features to predict P element insertion site preferences.