Research

My research sits at the intersection of computational method development and genomics. I build algorithms and software to detect genetic variants in regions of the genome that are too complex, repetitive, or structurally variable for standard approaches.

Haplotype-resolved variant calling in complex genomic regions

Many important genes reside in segmental duplications or highly repetitive genomic regions where standard short-read aligners and variant callers fail. I develop targeted, haplotype-aware methods that leverage whole-genome sequencing—both short-read and long-read to produce accurate calls in these challenging loci.

Haplotype-resolved variant calling in segmental duplications using TruPath Genome (formerly Constellation)

TruPath Genome is an on-flowcell proximity sequencing technology. The proximity information from sequenced reads in nearby nanowells can help infer whether a group of reads belongs to the same original DNA molecule. I designed multi-region joint detection (MRJD) algorithm that can take advantage of this long-range information from TruPath Genome to produce haplotype-resolved variant calls in segmental duplications.

I presented this work as a platform talk at the 2026 ACMG Annual Clinical Genetics Meeting: “A Rapid, Novel Approach to Rare Disease and Clinical Genetic Variant Discovery using On-flowcell Proximity Sequencing and Haplotype-resolved Variant Calling.” [Abstract]

Other key projects:

  • Alpha-thalassemia (HBA1/2) copy number genotyping — Developed a targeted copy number caller for the alpha-globin locus, one of the most structurally complex and clinically important regions of the genome (~5% global carrier frequency for alpha-thalassemia). [Blog]

  • Lynch syndrome (PMS2) variant detection — Improved variant calling accuracy in PMS2, a mismatch-repair gene with a highly similar pseudogene (PMS2CL) that causes widespread misalignment and false variant calls. [Blog]

Detecting and analyzing transposable element in Drosophila

Transposable elements (TEs) make up nearly half the human genome and are major drivers of genomic variation. During my Ph.D., I developed computational methods to detect, characterize, and study TEs using long-read sequencing technologies.

Key projects:

  • TELR — A software pipeline for detecting non-reference TE insertions in long-read (PacBio / Oxford Nanopore) sequencing data using local assembly. TELR enables phylogenomic analysis of TE insertions at base-pair resolution. Published in Nucleic Acids Research (2022).

  • ngs_te_mapper2 — A cell-line authentication tool based on TE insertion profiles, used to identify Drosophila cell lines and detect loss of heterozygosity.

  • TE dynamics in Drosophila S2 cell lines — Genomic analysis of 32 whole-genome datasets from D. melanogaster S2 sublines, characterizing ongoing transposition and phylogenetic relationships among laboratory cell cultures.

  • P element target site prediction — Machine learning models trained on 30+ engineered features to predict P element insertion site preferences.

Software & Resources

Tool Description Link
TELR TE detection in long-read WGS GitHub
ngs_te_mapper2 Cell-line TE profiling GitHub
McClintock 2 TE detector benchmarking GitHub