For a better experience, click the Compatibility Mode icon above to turn off Compatibility Mode, which is only for viewing older websites.

Automated Genomic Wide Variants Analysis and Reporting Pipeline

Thursday, June 4, 2020

12:00 PM-2:00 PM

BIOMED Master's Thesis Defense

Automated Genomic Wide Variants Analysis and Reporting Pipeline

Nhat Duong, Master's Candidate
School of Biomedical Engineering, Science and Health Systems
Drexel University

Michael Xie, PhD
Supervisory Bioinformatics Scientist
Department of Biomedical and Health Informatics
Children's Hospital of Philadelphia (CHOP)

The field of bioinformatics contains many pipelines for different purposes. These pipelines are often for data processing, which are used to process raw files into ready to be analyzed files. Despite availability of many different pipelines, the process of going from processed files to analysis the data and making a report from the results are still largely done manually. Among different data analyses, a commonly used analysis applicable across a variety of data is the genetic/mutation load analysis on cohort data (cases vs. controls groups). It is thus desirable to produce a fully automated and flexible pipeline capable of going from the variant discovery VCF file and BED file all the way to a final interactive report file which can be presented to clinicians or principle investigators.

The pipeline was made using Snakemake as the workflow management tool. The pipeline starts with a VCF file that is derived from the user’s preferred variant discovery method. This file then gets filtered, reformatted, statistically analyzed and the results are stored and displayed in an interactive webpage. This pipeline also takes in copy number variation information in the form of a BED file and performs analyses independent of the short variants data. The MetaP Fisher method was added to the pipeline to combine the p-values from the two independent analyses (short variant and copy number variation) which can buttress any positive results. Results from these analyses would get stored in a database and displayed in an interactive webpage.

An application of the pipeline was done to study congenital heart defects (CHD) in a cohort of 22q11.2 deletion syndrome patients. 22q11.2 deletion syndrome (22q11.2DS) is the absence of a DNA segment – roughly 3 million base pairs in size – on one copy of chromosome 22. The pipeline mentioned above was used with whole exome sequencing samples of 380 22q11.2DS patients as the VCF input. After QC, there remains 147 cases (22q11.2DS patients with CHD) vs. 132 controls (22q11.2DS patients without CHD) samples. The pipeline compared the genetic loads between the case and control cohorts by performing gene-gene and functional term analyses. The goal was to identify mutations that are significantly over-represented in cases as compared to controls.

Successful run of the pipeline’s gene-gene analysis was able to identify a gene cluster on chromosome 9 – mostly 9q – that is significantly over-represented in 22q11.2DS patients with CHD. Most genes in this cluster are closely associated with early cell and embryonic development. The functional-based analysis – using GO terms and mammalian phenotypes – resulted in numerous cardiac related functions significantly over-represented in 22q11.2DS patients with CHD, as well. Some of the most significant functions include heart trabecula and cardiac ventricle morphogenesis. All of these results were reported using an HTML file. This result suggests that further interrogation of the loci on human chromosome 9 and the cardiac functions identified from the functional analysis are warranted. Most importantly, however, it shows that the pipeline was successful in performing a cohort study and generating an interactive report file.

Contact Information

Natalia Broz

Remind me about this event. Notify me if this event changes. Add this event to my personal calendar.




  • Undergraduate Students
  • Graduate Students
  • Faculty
  • Staff