Publications

Ontology-Based Metabolomics Data Integration with Quality Control, BioAnalysis 2019

Metabolomics

Parallelization of Query Processing over Expressive Ontologies
E. Patrick Shironoshita, Da Zhang, Mansur R. Kabuka, Jia Xu. Proceedings of the 4th Annual International Symposium on Information Management and Big Data (SIMBig 2017). http://ceur-ws.org/Vol-2029/paper10.pdf
Efficient query answering over Description Logic (DL) ontologies with very large datasets is becoming increasingly vital. Recent years have seen the development of various approaches to ABox partitioning to enable parallel processing. Instance checking using the enhanced most specific concept (MSC) method is a particularly promising approach. The applicability of these distributed reasoning methods to typical ontologies has been shown mainly through anecdotal observation. In this paper, we present an analysis method that makes use of random graph theory to show that the enhanced MSC method results in very small, tractable concepts provided that the number of role assertions removed from consideration is large enough. We also present execution time and efficiency of a parallel implementation deployed over computing clusters of various sizes, showing the ability of the method to process instance checking for large scale datasets.

Web-based Ontology Alignment with the GeneTegra Alignment Tool
Nemanja Stojanovic, Ray M. Bradley, Sean Wilkinson, Mansur Kabuka,
and E. Patrick Shironoshita. Proceedings of the 4th Annual International Symposium on Information Management and Big Data (SIMBig 2017). http://ceur-ws.org/Vol-2029/paper11.pdf
Ontologies are increasingly gaining practical usage for semantic data in various ways and across multiple domains. From this growing applicability arises an ever-greater need to manage large datasets, reduce analytical complexity and efficiently as well as accurately integrate different heterogeneous ontologies into or within existing systems, all while minimizing data corruption and maintaining existing semantics. In this paper, we present the GeneTegra Alignment Tool (GT-Align), a practical implementation of the ASMOV ontology alignment algorithm within a Web-based interface, focusing on biomedical data and using Unified Medical Language System (UMLS) for the background knowledge. GT-Align allows iterative alignment of multiple ontologies as well as active user involvement throughout the process.

Module Extraction for Efficient Object Queries over Ontologies with Large ABoxes
Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel John, Mansur Kabuka. Artif Intell Appl. 2015 Feb;2(1):8-31. doi: 10.15764/AIA.2015.01002.
The extraction of logically-independent fragments out of an ontology ABox can be useful for solving the tractability problem of querying ontologies with large ABoxes. In this paper, we propose a formal definition of an ABox module, such that it guarantees complete preservation of facts about a given set of individuals, and thus can be reasoned independently w.r.t. the ontology TBox. With ABox modules of this type, isolated or distributed (parallel) ABox reasoning becomes feasible, and more efficient data retrieval from ontology ABoxes can be attained. To compute such an ABox module, we present a theoretical approach and also an approximation for SHIQ ontologies. Evaluation of the module approximation on different types of ontologies shows that, on average, extracted ABox modules are significantly smaller than the entire ABox, and the time for ontology reasoning based on ABox modules can be improved significantly.

Converting Instance Checking to Subsumption: A Rethink for Object Queries over Practical Ontologies
Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel John, Mansur Kabuka. Int J Intell Sci. 2015 Jan;5(1):44-62. doi: 10.4236/ijis.2015.51005.
Efficiently querying Description Logic (DL) ontologies is becoming a vital task in various data-intensive DL applications. Considered as a basic service for answering object queries over DL ontologies, instance checking can be realized by using the most specific concept (MSC) method, which converts instance checking into subsumption problems. This method, however, loses its simplicity and efficiency when applied to large and complex ontologies, as it tends to generate very large MSCs that could lead to intractable reasoning. In this paper, we propose a revision to this MSC method for DL, allowing it to generate much simpler and smaller concepts that are specific enough to answer a given query. With independence between computed MSCs, scalability for query answering can also be achieved by distributing and parallelizing the computations. An empirical evaluation shows the efficacy of our revised MSC method and the significant efficiency achieved when using it for answering object queries.

Optimizing the Most Specific Concept Method for Efficient Instance Checking
Jia Xu, Patrick Shironoshita, Ubbo Visser, Nigel John, Mansur Kabuka. Proc Int World Wide Web Conf. 2014;2014:405-406. doi: 10.1145/2567948.2577294.
Instance checking is considered a central tool for data retrieval from description logic (DL) ontologies. In this paper, we propose a revised most specific concept (MSC) method for DL SHI, which converts instance checking into subsumption problems. This revised method can generate small concepts that are specific-enough to answer a given query, and allow reasoning to explore only a subset of the ABox data to achieve efficiency. Experiments show effectiveness of our proposed method in terms of concept size reduction and the improvement in reasoning efficiency.

The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)
Menze, Bjoern, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, et al. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS).” IEEE Transactions on Medical Imaging (January 27, 2014). http://hal.inria.fr/hal-00935640. doi:10.1109/TMI.2014.2377694.
In this paper we report the set-up and results of the Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) organized in conjunction with the MICCAI 2012 and 2013 conferences. Twenty state-of-the-art tumor segmentation algorithms were applied to a set of 65 multi-contrast MR scans of low- and high-grade glioma patients-manually annotated by up to four raters-and to 65 comparable scans generated using tumor image simulation software. Quantitative evaluations revealed considerable disagreement between the human raters in segmenting various tumor sub-regions (Dice scores in the range 74%-85%), illustrating the difficulty of this task. We found that different algorithms worked best for different sub-regions (reaching performance comparable to human inter-rater variability), but that no single algorithm ranked in the top for all sub-regions simultaneously. Fusing several good algorithms using a hierarchical majority vote yielded segmentations that consistently ranked above all individual algorithms, indicating remaining opportunities for further methodological improvements. The BRATS image data and manual annotations continue to be publicly available through an online evaluation system as an ongoing benchmarking resource.

A Grouping Artificial Immune Network for Segmentation of Tumor Images
Buendia, Patricia, Thomas Taylor, Michael Ryan, and Nigel John. “A Grouping Artificial Immune Network for Segmentation of Tumor Images.” Proceedings of NCI-MICCAI BRATS 2013 (September 22, 2013).
GAIN+ is an enhanced version of the original Grouping Artificial Immune Network that was developed for fully automated MRI brain segmentation. The model captures the main concepts by which the immune system recognizes pathogens and models the process in a numerical form. GAIN+ was adapted to support a variable number of input patterns for training and segmentation of tumors in MRI brain images and adapted to train on multiple images. The model was demonstrated to operate with multi-spectral MR data with an increase in accuracy compared to the single spectrum case. Using the BRATS High Grade 2013 dataset with the 2012 tissue labels for Edema and Tumor, the model’s Dice scores were compared to published results and proved to be as accurate as the best methods. Using the 4 labels from the BRATS 2013 data sets, a Dice overlap of 73% for the complete tumor region and 64% for the enhancing tumor region were obtained for the high grade BRATS images when applying pre- and post-processing. This was attained with speed optimizations allowing segmentation at 21s per case with post-processing of all 4 tissues.

Map-Reduce Enabled Hidden Markov Models for High Throughput Multimodal Brain Tumor Segmentation
Taylor, Thomas, Nigel John, Patricia Buendia, and Michael Ryan. “Map-Reduce Enabled Hidden Markov Models for High Throughput Multimodal Brain Tumor Segmentation.” Proc. NCI-MICCAI BRATS 2013 (Sept 22, 2013).
We have developed a novel extension to Hidden Markov Models (HMMs) to enable high-throughput training and segmentation of tumors and edema in multimodal magnetic resonance images of the brain. Our method has been evaluated on the two-label BRATS2013 training dataset for both simulated and real patient highgrade glioma cases. We achieve an mean accuracy (Dice score) of [66.7]% for edema and [89.2]% for tumor in the simulated cases and [59.5]%for edema and [65.6]% for tumor in the real cases. The Map-Reduce enabled HMM is able to train on all cases simultaneously, performing 220% faster on an 8-node cluster than on a single node. Segmentation of a single patient case takes less than one minute.

Accelerating cancer systems biology research through Semantic Web technology
Wang, Z., Sagotsky, J., Taylor, T., Shironoshita, P., and Deisboeck, T. S. (2013). Accelerating cancer systems biology research through Semantic Web technology. Wiley Interdisciplinary Reviews: Systems Biology and Medicine, 5(2), 135-151. doi:10.1002/wsbm.1200
Cancer systems biology is an interdisciplinary, rapidly expanding research field in which collaborations are a critical means to advance the field. Yet the prevalent database technologies often isolate data rather than making it easily accessible. The Semantic Web has the potential to help facilitate web-based collaborative cancer research by presenting data in a manner that is self-descriptive, human and machine readable, and easily sharable. We have created a semantically linked online Digital Model Repository (DMR) for storing, managing, executing, annotating, and sharing computational cancer models. Within the DMR, distributed, multidisciplinary, and inter-organizational teams can collaborate on projects, without forfeiting intellectual property. This is achieved by the introduction of a new stakeholder to the collaboration workflow, the institutional licensing officer, part of the Technology Transfer Office. Furthermore, the DMR has achieved silver level compatibility with the National Cancer Institute’s caBIG, so users can interact with the DMR not only through a web browser but also through a semantically annotated and secure web service. We also discuss the technology behind the DMR leveraging the Semantic Web, ontologies, and grid computing to provide secure inter-institutional collaboration on cancer modeling projects, online grid-based execution of shared models, and the collaboration workflow protecting researchers’ intellectual property.

Identification of conserved splicing motifs in mutually exclusive exons of 15 insect species
Buendia, P., Tyree, J., Loredo, R., and Hsu, S. (2012): Identification of conserved splicing motifs in mutually exclusive exons of 15 insect species. BMC Genomics, 13:S1
Background: During alternative splicing, the inclusion of an exon in the final mRNA molecule is determined by nuclear proteins that bind cis-regulatory sequences in a target pre-mRNA molecule. A recent study suggested that the regulatory codes of individual RNA-binding proteins may be nearly immutable between very diverse species such as mammals and insects. The model system Drosophila melanogaster therefore presents an excellent opportunity for the study of alternative splicing due to the availability of quality EST annotations in FlyBase.
Methods: In this paper, we describe an in silico analysis pipeline to extract putative exonic splicing regulatory sequences from a multiple alignment of 15 species of insects. Our method, ESTs-to-ESRs (E2E), uses graph analysis of EST splicing graphs to identify mutually exclusive (ME) exons and combines phylogenetic measures, a sliding window approach along the multiple alignment and the Welch’s t statistic to extract conserved ESR motifs.
Results: The most frequent 100% conserved word of length 5 bp in different insect exons was “ATGGA”. We identified 799 statistically significant “spike” hexamers, 218 motifs with either a left or right FDR corrected spike magnitude p-value < 0.05 and 83 with both left and right uncorrected p < 0.01. 11 genes were identified with highly significant motifs in one ME exon but not in the other, suggesting regulation of ME exon splicing through these highly conserved hexamers. The majority of these genes have been shown to have regulated spatiotemporal expression. 10 elements were found to match three mammalian splicing regulator databases. A putative ESR motif, GATGCAG, was identified in the ME-13b but not in the ME-13a of Drosophila N-Cadherin, a gene that has been shown to have a distinct spatiotemporal expression pattern of spliced isoforms in a recent study.
Conclusions: Analysis of phylogenetic relationships and variability of sequence conservation as implemented in the E2E spikes method may lead to improved identification of ESRs. We found that approximately half of the putative ESRs in common between insects and mammals have a high statistical support (p < 0.01). Several Drosophila genes with spatiotemporal expression patterns were identified to contain putative ESRs located in one exon of the ME exon pairs but not in the other.

Cancer Data Integration and Querying with GeneTegra
E. Shironoshita, Y. Jean-Mary, R. Bradley, P. Buendia, and M. Kabuka, “Cancer Data Integration and Querying with GeneTegra,” presented at the Data Integration for the Life Sciences (DILS) 2012, College Park, MD, USA.
We present the GeneTegra system, an ontology-based information integration environment. We show its ability to query multiple data sources, and we evaluate the relative performance of different data repositories. GeneTegra uses Semantic Web standards to resolve the semantic and syntactic diversity of the large and increasingly complex body of publicly available data. GeneTegra contains mechanisms to create ontology models of data sources using the OWL 2 Web Ontology Language, and to define, plan, and execute queries against these models using the SPARQL query language. Data source formats supported include relational databases and XML and RDF data sources. Experimental results have been obtained to show that GeneTegra obtains equivalent results from different data repositories containing the same data, illustrating the ability of the methods proposed in querying heterogeneous sources using the same modeling paradigm.

ASMOV: Results for OAEI 2010
Yves R. Jean-Mary, E. Patrick Shironoshita, Mansur R. Kabuka: “ASMOV: Results for OAEI 2010,” 5th International Workshop on Ontology Matching (OM 2010), Shanghai, China.
The Automated Semantic Mapping of Ontologies with Validation (ASMOV) algorithm for ontology alignment has consistently been one of the top performing algorithms in the Ontology Alignment Evaluation Initiative (OAEI) contests. In this paper, we present a brief overview of the algorithm and its improvements, followed by an analysis of its results on the 2010 OAEI tests.

Grid-based cancer model simulation with CViT’s Computational Model Execution Framework
T. J. Taylor, R. M. Bradley, J. Sagotsky, E. P. Shironoshita, Z. Wang, P. Vazquez, T. S. Deisboeck, M.R. Kabuka, “Grid-based cancer model simulation with CViT’s Computational Model Execution Framework”, caBIG© 2010 Annual Meeting

ASMOV: Results for OAEI 2009
Yves R. Jean-Mary, E. Patrick Shironoshita, Mansur R. Kabuka: “ASMOV: Results for OAEI 2009,” 4th International Workshop on Ontology Matching (OM 2009), Chantilly, VA, USA.
The Automated Semantic Mapping of Ontologies with Validation (ASMOV) algorithm for ontology alignment was one of the top performing algorithms in the 2007 and 2008 Ontology Alignment Evaluation Initiative (OAEI) contests. In this paper, we present a brief overview of the algorithm and its improvements, followed by an analysis of its results on the 2009 OAEI tests.

Ontology Matching with Semantic Verification
Yves R. Jean-Mary, E. Patrick Shironoshita, Mansur R. Kabuka: “Ontology matching with semantic verification,” Journal of Web Semantics, vol. 7, no. 3, pp 235-251, September 2009, doi:10.1016/j.websem.2009.04.001.
ASMOV (Automated Semantic Matching of Ontologies with Verification) is a novel algorithm that uses lexical and structural characteristics of two ontologies to iteratively calculate a similarity measure between them, derives an alignment, and then verifies it to ensure that it does not contain semantic inconsistencies. In this paper, we describe the ASMOV algorithm, and then present experimental results that measure its accuracy using the OAEI 2008 tests, and that evaluate its use with two different thesauri: WordNet, and the Unified Medical Language System (UMLS). These results show the increased accuracy obtained by combining lexical, structural and extensional matchers with semantic verification, and demonstrate the advantage of using a domain-specific thesaurus for the alignment of specialized ontologies.

semQA: SPARQL with Idempotent Disjunction
E. Patrick Shironoshita, Yves R. Jean-Mary, Ray M. Bradley, Mansur R. Kabuka, “semQA: SPARQL with Idempotent Disjunction,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 3, pp. 401-414, March 2009, doi:10.1109/TKDE.2008.91.
The SPARQL LeftJoin abstract operator is not distributive over Union; this limits the algebraic manipulation of graph patterns, which in turn restricts the ability to create query plans for distributed processing or query optimization. In this paper, we present semQA, an algebraic extension for the SPARQL query language for RDF, which overcomes this issue by transforming graph patterns through the use of an idempotent disjunction operator Or as a substitute for Union. This permits the application of a set of equivalences that transform a query into distinct forms. We further present an algorithm to derive the solution set of the original query from the solution set of a query where Union has been substituted by Or. We also analyze the combined complexity of SPARQL, proving it to be NP-complete. It is also shown that the SPARQL query language is not, in the general case, fixed-parameter tractable. Experimental results are presented to validate the query evaluation methodology presented in this paper against the SPARQL standard to corroborate the complexity analysis and to illustrate the gains in processing cost reduction that can be obtained through the application of semQA.

semCDI: a query formulation for semantic data integration in caBIG
E. Patrick Shironoshita, Yves R. Jean-Mary , Ray M. Bradley and Mansur R. Kabuka: “semCDI: a query formulation for semantic data integration in caBIG“. Journal of the American Medical Informatics Association (JAMIA), vol. 15, no. 4, pp 559-568, July-August 2008, doi:10.1197/jamia.M2732.
Objectives: To develop mechanisms to formulate queries over the semantic representation of cancer-related data services available through the cancer Biomedical Informatics Grid (caBIG).
Design: The semCDI query formulation uses a view of caBIG semantic concepts, metadata, and data as an ontology, and defines a methodology to specify queries using the SPARQL query language, extended with Horn rules. semCDI enables the joining of data that represent different concepts through associations modeled as object properties, and the merging of data representing the same concept in different sources through Common Data Elements (CDE) modeled as datatype properties, using Horn rules to specify additional semantics indicating conditions for merging data. Validation In order to validate this formulation, a prototype has been constructed, and two queries have been executed against currently available caBIG data services.
Validation: In order to validate this formulation, a prototype has been constructed, and two queries have been executed against currently available caBIG data services.
Discussion: The semCDI query formulation uses the rich semantic metadata available in caBIG to build queries and integrate data from multiple sources. Its promise will be further enhanced as more data services are registered in caBIG, and as more linkages can be achieved between the knowledge contained within caBIG’s NCI Thesaurus and the data contained in the Data Services.
Conclusion: semCDI provides a formulation for the creation of queries on the semantic representation of caBIG. This constitutes the foundation to build a semantic data integration system for more efficient and effective querying and exploratory searching of cancer-related data.

Semantic Representation and Querying of caBIG Data Services
Abstract
E. Patrick Shironoshita, Ray M. Bradley, Yves R. Jean-Mary, Thomas J. Taylor, Michael T. Ryan, Mansur R. Kabuka: “Semantic Representation and Querying of caBIG Data Services,” Data Integration in the Life Sciences, Springer Berlin/Heidelberg, June 2008, pp 108-115, doi:10.1007/978-3-540-69828-9_10.
A computational grid infrastructure for biomedical research, called caGrid, is under development by the National Cancer Institute (NCI) as part of the cancer Biomedical Informatics Grid (caBIG) Initiative. In this paper we present a model that enables users to query an integrated view of caBIG data services at a conceptual semantic level. The model is based on semCDI, a formulation to generate an ontology view of caBIG semantics and pose queries against this view using the SPARQL query language complemented with Horn rules. We present here a mechanism to process these queries algebraically using our semQA query algebra extension for SPARQL, in order to create sub-expressions for each data service. We then show how resulting graphs from these sub-expressions are then merged using Horn rules.

ASMOV: Results for OAEI 2008
Yves R. Jean-Mary, Mansur R. Kabuka: “ASMOV: Results for OAEI 2008,” 3rd International Workshop on Ontology Matching (OM 2008), Karlsruhe, Germany.
The Automated Semantic Mapping of Ontologies with Validation (ASMOV) algorithm for ontology alignment was one of the top performing algorithms in the 2007 Ontology Alignment Evaluation Initiative (OAEI). In this paper, we present a brief overview of the algorithm and its improvements, followed by an analysis of its results on the 2008 OAEI tests.

Cardinality estimation for the optimization of queries on ontologies
E. Patrick Shironoshita, Michael T. Ryan, Mansur R. Kabuka: “Cardinality estimation for the optimization of queries on ontologies,” ACM SIGMOD Record, vol. 36, no. 2, pp 13-18, June 2007, doi:10.1145/1328854.1328856.
An effective, accurate algorithm for cardinality estimation of queries on ontology models of data is presented. The algorithm relies on the decomposition of queries into query pattern paths, where each path produces a set of values for each variable within the result form of the query. In order to estimate the total number of result set parameters for each path, a set of statistics is compiled on the properties of the ontology. Experimental analysis has shown that the algorithm produces estimates with high accuracy and with high correlation to actual values. Thus, this algorithm can be used as the cornerstone of an effective optimization strategy for queries on diverse, heterogeneous data sources modeled as ontologies.

ASMOV: Ontology Alignment with Semantic Validation
Yves R. Jean-Mary, Mansur R. Kabuka. “ASMOV: Ontology Alignment with Semantic Validation“. Joint SWDB-ODBIS Workshop, September 2007, Vienna, Austria, 15-20.
Numerous ontology alignment algorithms have appeared in the literature in recent years, but only a few make use of the semantics enclosed within the ontologies in order to improve the accuracy. In this paper, we present ASMOV (Automated Semantic Mapping of Ontologies with Validation), a novel algorithm that expands upon the ideas presented by previous systems, while incorporating a semantic validation process. This process acts as a reasoner over the resulting alignment, ensuring that no inconsistencies have been introduced by the system, which increases the accuracy of the system. An implementation of ASMOV is compared to other systems using the benchmark tests of the Ontology Alignment Evaluation Initiative, a consensus for evaluation of ontology alignment systems.

Ontology Alignment with Semantic Validation: A Comparison using UMLS Metathesaurus and WordNet
Yves R. Jean-Mary, E. Patrick Shironoshita, Thomas J. Taylor, Michael T. Ryan, Ray M. Bradley , Mansur R. Kabuka: “Ontology Alignment with Semantic Validation: A Comparison using UMLS Metathesaurus and WordNet”. IEEE 7th International Symposium on Bioinformatics and Bioengineering, 2007.

ASMOV: Results for OAEI 2007
Yves R. Jean-Mary, Mansur R. Kabuka: “ASMOV: Results for OAEI 2007,” 2nd International Workshop on Ontology Matching (OM 2007), Busan, Korea.
Numerous ontology alignment algorithms have appeared in the literature in recent years, but only a few make use of the semantics enclosed within the ontologies in order to improve the accuracy. In this paper, we present the Automated Semantic Mapping of Ontologies with Validation (ASMOV) algorithm for ontology alignment. We first provide a brief overview of the algorithm followed by an analysis of its results on the 2007 Ontology Alignment Evaluation Initiative tests. We conclude the paper by identifying the specific strengths and weaknesses of ASMOV, while pointing out the necessary improvements that need to be made.

Viability of Mental Health Assessment Software in Diverse Settings
Thomas J. Taylor, Mansur R. Kabuka, E. Patrick Shironoshita, Michael T. Ryan, Akmal A. Younis, Nigel M. John, John R. McQuaid, M. H. Trivedi, G. W. Currier, R. A. McKinney, B. D. Grannemann, C. Claassen: “Viability of Mental Health Assessment Software in Diverse Settings”. 45th Annual NCDEU (New Clinical Drug Evaluation Unit), Boca Raton, Florida, June 6-9, 2005.
An effective, accurate algorithm for cardinality estimation of queries on ontology models of data is presented. The algorithm relies on the decomposition of queries into query pattern paths, where each path produces a set of values for each variable within the result form of the query. In order to estimate the total number of result set parameters for each path, a set of statistics is compiled on the properties of the ontology. Experimental analysis has shown that the algorithm produces estimates with high accuracy and with high correlation to actual values. Thus, this algorithm can be used as the cornerstone of an effective optimization strategy for queries on diverse, heterogeneous data sources modeled as ontologies.