Ellingson, Sally

Markey Cancer Ceter, Markey Cancer, Biostatistics

Convex-hull voting method on a large data set


Genes work in concert as a system, not as independent entities, to mediate disease states. There has been considerable interest in understanding variations in molecular signatures between normal and disease states. The selective-voting convex-hull ensemble procedure accommodates molecular heterogeneity within and between groups and allows retrieval of sample-specific sets and investigation of variations in individual networks relevant to personalized medicine.

Normalized RNA-seq data for 208 samples (104 matched normal/tumor pairs) from TCGA breast carcinoma data set were downloaded and analyzed by the edgeR package, which identified 2,882 differentially expressed genes with at least a 2-fold difference between tumor and normal samples and at 1% false discovery rate. The convex-hull voting method was applied to data from the differentially expressed genes.1

The algorithm used in this study is an adaptation of the previously published adaptive voting convex-hull.2 In this algorithm, the samples are split into test and training sets, the training set samples are used to build a convex hull for both control and cancer samples and the hulls are trimmed until there is no overlap. If test samples fall in either the control or cancer hull, they are voted on accordingly. The points in the convex hull correspond to the RNAseq values given for a pair of markers projected onto a two dimensional plane. Therefore, votes correspond to a pair of markers and not an individual marker. To remove potential biases introduced during training, this voting process is repeated many times while randomizing the order of the samples. An intersection is taken of all marker pairs that vote the same across all iterations.

Current work utilizes parallelization techniques in R and the management of multiple job submissions due to the large number of allowable jobs on DLX in order to run the algorithm on a large dataset. Future work will involve the parallelization of the entire computationally and data intensive steps in a way that reduces the complexity of job submission and scalability of the entire job. Computing paradigms such as Hadoop are being explored for this task. Additional classification algorithms may also be introduced to this ensemble-voting scheme.

Investigators:

Chi Wang, PhD and Radhakrishnan Nagarajan, PhD

Data Quality Control and GWAS


This project involves the automation of data quality control and analysis pipelines for Genome-Wide Association Studies. This allows for an extensive, efficient, and reproducible exploration of the effects of different quality control and analysis measures.

Investigators:

Dave Fardo

Development of “Gold Standard” Next Generation Sequencing (NGS) Data Sets


Since sequencing costs are dropping, improved management of data analysis and storage will be essential for state-of-the-art research and for efficient clinical decision-making based on NGS. A common challenge is the identification of variations within sequences that may be the cause of particular traits or diseases; these could be single nucleotide polymorphisms (SNPs), indels (insertion or deletions), or structural variations (swapping of the location of genes). All of these areas are still being actively researched. New methods are being developed to address experimental errors in base calling and computational errors in read alignment. It has been shown that using different sequencing technologies results in different SNP calls3 with as many as tens of thousands of SNPs being called only on a specific sequencing platform.4 In addition to variations resulting from different sequencing technologies, different SNP calling pipelines may give drastically different results. Using five different pipelines and fifteen samples from the same sequencing technology, only an average concordance of 57.4% was found for called SNPs5. Even more worrisome, using three indel-calling pipelines only gave an average concordance of 26.8% for called indels. These massive differences in results show how important benchmark data will be in testing new pipelines and technologies.As genetic data is now being used to make decisions, it is very important to use well established, tested, and verified methods while establishing and maintaining competency in the state-of-the-art in both the technology and analysis.

This research plan proposes that the development of “gold standard” data sets for various Next Generation Sequencing (NGS) studies will allow for efficient testing and benchmarking of new bioinformatics tools, algorithms, and emerging computational platforms. This project represents a first step in building NGS infrastructure for researchers and clinicians at the University of Kentucky.

Investigators:

Xiaofei Zhang

Software:

GPUs
NAMD

Using large public data repositories to discover novel genetic mutations with prospective links to melanoma


This study extends research on the causal relation between changes in the Ataxia Telangiectasia and Rad3 related (ATR) pathways and melanoma. To study the effects of mutations in the ATR gene region on melanoma, the Melanoma Genome Sequencing Project dataset (dbGaP Study Accession: phs000452.v1.p1) was used along with an available NGS data analysis pipeline that was previously developed for a lung cancer full exome-sequencing project by the Biostatistics and Bioinformatics Shared Resource Facility of the Markey Cancer Center.

Investigators:

Tamas S Gal, Chi Wang, Jinpeng Liu, Stuart G Jarrett, John A D’Orazio

References


1. Network, C. G. A., Comprehensive molecular portraits of human breast tumours. Nature 2012, 490 (7418), 61-70.
2. Nagarajan, R.; Kodell, R. L., A Selective Voting Convex-Hull Ensemble Procedure for Personalized Medicine. AMIA Summits on Translational Science Proceedings 2012, 2012, 87.
3. Rieber, N.; Zapatka, M.; Lasitschka, B.; Jones, D.; Northcott, P.; Hutter, B.; Jäger, N.; Kool, M.; Taylor, M.; Lichter, P., Coverage bias and sensitivity of variant calling for four whole-genome sequencing technologies. PloS one 2013, 8 (6), e66621.
4. Lam, H. Y.; Clark, M. J.; Chen, R.; Chen, R.; Natsoulis, G.; O'Huallachain, M.; Dewey, F. E.; Habegger, L.; Ashley, E. A.; Gerstein, M. B., Performance comparison of whole-genome sequencing platforms. Nature biotechnology 2012, 30 (1), 78-82.
5. O’Rawe, J.; Jiang, T.; Sun, G.; Wu, Y.; Wang, W.; Hu, J.; Bodily, P.; Tian, L.; Hakonarson, H.; Johnson, W. E., Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Med 2013, 5 (3), 28.

Genomics in Cancer for the Appalachian population of Kentucky

Appalachian Kentucky is home to some of the highest incidence rates of lung, colon and other lethal cancers in the United States. A key resource for this proposed project is the Markey Cancer Center’s (MCC) broad portfolio of translational research studies in Appalachia, with directed collections of biospecimens (blood, cancer tissues, toenails, urine) in addition to epidemiologic and demographic information in both cancer and normal volunteers from the region, including the Lung Cancer Research Initiative (LCRI). Additionally, the MCC Biospecimen and Tissue Procurement Shared Resource Facility’s (BSTP SRF) general banking collection has thousands of samples from cancer patients in Appalachia. We propose a pilot study of whole-exome sequencing of matched lung cancer and normal lung tissues collected by these resources and annotated with the aid of the Kentucky Cancer Registry. Our goal is to identify novel genetic aberrations that are unique to lung cancer in the Appalachian population.



Data analysis will be performed by using a combination of currently available software and customized C and R scripts. Sequence reads will be aligned to the human genome (UCSC hg19) by using BWA and will be processed by using SAMtools, Picard and GATK. Somatic single nucleotide variants will be identified by using VarScan 2 and MuTect. Somatic insertion/deletions will be detected by using VarScan 2 and Indelocator. Identified mutations will be annotated by using COSMIC. Significantly mutated genes will be identified by using MuSiC and MutSig. Logistic regression-based dominant, additive, and recessive models will be used to assess the association between each SNP locus and lung cancer. Fisher's exact tests and Cochran-Armitage tests will be alternatively considered as necessary. False discovery rate will be controlled based on the B-H procedure. In addition to generating initial genomic data from this pilot study, we will also utilize this project and resulting data to implement one of Markey's ongoing initiatives funded by UK's High Performance Computing (HPC) to develop a pipeline for processing and analysis of high-throughput bioinformatics data. Finally, in addition to the bioinformatics methods described above, we will collaborate with bioinformatics faculty in the Division of Biomedical Informatics and CTSA's Biomedical Informatics Core to employ other novel and innovative analysis strategies.

Investigators:

Susanne Arnold, MD; Chi Wang, PhD, Jinze Liu, PhD, Heidi Weiss, PhD

Students

Brian Davis
Derek Jones
Jeevith Bopaiah
Isaac J Hands, IT

Kyle Helfrich

Devin Willmott


Publications:

  1. Zhang X, Ellingson S. Computationally characterizing genomic pipelines using high-confident call sets. Procedia Computer Science. 2016;80:1023-32.
  2. Ellingson SR, and Fardo D. Automated quality control for genome wide association studies. F1000Research. 2016. (accepted)
  3. Zhang X, Ellingson SR. Computationally characterizing genomic pipelines and benchmarking results using GATK best practices on the high performance computing cluster at the University of Kentucky. BMC Bioinformatics. 2016 (accepted abstract)
  4. Sabbir AKM, and Ellingson SR. Side-Effect Term Matching for Computational Adverse Drug Predictions. BMC Bioinformatics. 2016 (accepted abstract)
  5. Ellingson SR. Accuracy and efficiency of sequence variation detection methods using high-confident variation calls. Journal of Clinical and Translational Science. 2016 (accepted abstract)
  6. Zhang X, Kucharski A, de Jong WA, Ellingson SR. Towards a better understanding of on and off target effects of the lymphocyte-specific kinase LCK for the development of novel and safer pharmaceuticals. Procedia Computer Science. 2017 Dec 31;108:1222-31.
  7. Jones D, Bopaiah J, Alghamedy F, Jacobs N, Weiss H, de Jong W, Ellingson SR. Polypharmacology Within the Full Kinome: a Machine Learning Approach. AMIA Informatics 2018. (Accepted).
  8. Alghamedy F, Bopaiah J, Jones D, Zhang X, Weiss H, Ellingson SR. Incorporating Protein Dynamics Through Ensemble Docking in Machine Learning Models to Predict Drug Binding. AMIA Informatics 2018. (Accepted)
  9. Liu J, Murali T, Yu T, Liu C, Sivakumaran TA, Moseley H, Zhulin I, Weiss H, Durbin R, Ellingson SR, Liu J, Huang B, Hallahan BJ, Horbinski C, Hodges K, Napier D, Bocklage T, Mueller J, Vanderford N, Fardo DW, Wang C, and Arnold S. Characterization of Squamous Cell Lung Cancers from Appalachian Kentucky. Cancer Epidemiology, Biomarkers & Prevention (2019) DOI: 10.1158/1055-9965.EPI-17-0984


Grants:

1. “University of Kentucky Markey Cancer Center – CCSG”, PI Evers, MD, Co-Investigator Ellingson, S, Institution/University: University of Kentucky Source of Funding: NIH/NCI, Total Award: $2,152,710

2. “Markey Women Strong”, Principal Investigator: Ellingson, PhD, University of Kentucky, Source of Funding: Markey Cancer Foundation, 04/2018 – Total Award: $50,000

3. “Novel Computational Drug Repurposing to Target Autophagy for Cancer Treatments”, Principal Investigator: Ellingson, PhD, University of Kentucky, Source of Funding: CCTS Drug Discovery and Development Pilot Award, 09/03/18 - Total Award: $49,823


Center for Computational Sciences