Internet search engines like Google allow us to search and parse the collective knowledge of the world—they anticipate the user’s questions, remember preferences, deliver information quickly and clearly. Why can’t researchers trying to discover the most effective disease treatment employ the same analytical power to the knowledge discovery challenges of their work?
To a team of biologists, computer scientists, and bioinformaticians from the University of Illinois at Urbana-Champaign and Mayo Clinic, this question sounds like an exciting opportunity. Supported by one of the first awards from a recently established National Institutes of Health initiative, the group has begun work on a tool, the Knowledge Engine for Genomics (KnowEnG), that interprets new results by leveraging community knowledge of how genes interact with each other.
Converting Big Data to better knowledge
The Illinois-Mayo effort aims to address one component of a broad and pressing issue faced by the biomedical research field. Rapid advances in laboratory technology have made it possible for investigators to collect once unheard-of amounts of data in a single experiment, whether those data are high-resolution brain images, interactions within a social network, or expression levels of thousands of genes.
The problem with such large-scale experiments is that the resulting data sets—termed “Big Data”—are only valuable if they can be translated into more knowledge, and the tools used to handle all this information haven’t improved quickly enough.
To address this issue, the National Institutes of Health created the Big Data to Knowledge (BD2K) initiative in 2012. The goals of the initiative are to promote the development of better methods to manage very large data sets, develop better analytical tools, train researchers in the use of those tools, and shift the culture of the scientific community to make these and related activities more successful.
As part of the first wave of BD2K funding announced in October 2014, the University of Illinois at Urbana-Champaign and Mayo Clinic received a $9.34M, 4-year award to create one of 11 new Centers of Excellence for Big Data Computing. The Illinois-Mayo Center focuses on a specific class of Big Data: the rapidly growing body of genomic and transcriptomic data produced by genome-wide, high-throughput experimental technologies.
Computer scientist and IGB affiliate Jiawei Han is the Center’s Program Director. Other Principal Investigators are computer scientist and IGB member Saurabh Sinha; physicist, bioengineer and IGB member Jun Song; and Richard Weinshilboum, M.D., interim director of the Mayo Clinic Center for Individualized Medicine and director of the center’s Pharmacogenomics Translational Program. IGB and NCSA Director of Bioinformatics and Director of the High-Performance Biological Computing Group C. Victor Jongeneel is Executive Director.
“By integrating multiple analytical methods derived from the most advanced data mining and machine learning research, KnowEnG will transform the way biomedical researchers analyze their genome-wide data,” said Program Director Jiawei Han, describing the software tool under development. “The Center will leverage the latest computational techniques used to mine corporate or Internet data to enable the intuitive analysis and exploration of biomedical Big Data.”
The Center combines the expertise of many units across the U of I campus, including the Carl R. Woese Institute for Genomic Biology (IGB), the Department of Computer Science, the Coordinated Science Laboratory, the College of Engineering, and the National Center for Supercomputing Applications (NCSA). As a leader of biomedical research and structured data collection, Mayo Clinic plays a vital role in design, testing, and refinement. The Breast Cancer Genome-Guided Therapy (BEAUTY) study at the Mayo Clinic will be the first to benefit from the KnowEnG technology.
KnowEnG: a tool to harness community knowledge
Results of biomedical genomic studies often come in the form of a list of genes—genes that differ in sequence or activity in healthy and diseased individuals, for example. Researchers would like to translate that list of genes into a better understanding of how disease works: How does a particular disease compare to other diseases at a cellular level? Are there specific functions inside the cell that are most affected? How are they affected? This type of knowledge could help predict disease risk, or lead to new ideas for treatment.
Traditional tools used to interpret genomic data have relied on making comparisons between a researcher’s gene list, and a database containing a specific type of information about genes. It is up to the researcher to identify what comparisons will be helpful, and integrate the outputs of many different analytical tools into a coherent interpretation.
Newer tools aggregate different types of gene-related information from multiple sources, but the difficult task of synthesizing how these relate to a specific list of genes is still up to the researcher. When completed, KnowEnG will be unique in its integration of many disparate sources of gene-related data into one enormous network, a comprehensive guide against which a researcher’s specific results can then be compared.
“We'd like to take community knowledge and datasets in this richer representation, this format of a network of gene-gene relationships, gene-gene functional relationships, protein-protein relationships and so on, and allow the user to do their analysis in the context of that community knowledge,” said Sinha, who leads the research arm of the project. The team is also designing KnowEnG to accommodate future growth in size and scope of the network, as the scientific community continues to learn about the relationships among genes.
In addition to development of KnowEnG, the Center is developing a training framework that empowers researchers to use the new tool and engage in bioinformatics research, regardless of their prior computational knowledge. The Center has also begun participation in a nation-wide consortium, composed of all the BD2K Centers of Excellence established by the NIH initiative, to exchange insights, contribute to standards for tool development, and help set broad goals for the future of work on Big Data.
“Ideally, undergraduates would be trained in both biology and computer science” before engaging in biomedical genomics research, said Song, who leads the training and community activities of the Center. Because most of today’s biomedical researchers did not have access to extensive formal training in computation, Song explained, the Center’s training resources will be carefully designed to build users’ understanding of computational questions in a visual, intuitive way.
Strengthening interdisciplinary connections
The Center relies on communication between interface design experts at Illinois and biomedical researchers at Mayo Clinic, who represent KnowEnG’s intended users. Feedback among these Center members ensures that the completed tool will be valuable, intuitive, and customizable for use in a broad array of experimental contexts.
“A major challenge is to understand the language and culture of each group so that we can communicate effectively, and make the tools that are developed at Illinois accessible to the biomedical audience,” said Weinshilboum, who oversees the evaluation of KnowEnG’s functionality. “The biomedical staff will communicate back to Illinois about what is helpful in terms of advancing their research and their understanding.”
Mayo researchers will test the mettle of initial versions of KnowEnG, employing it in two large-scale investigations of the genomics of cancer treatment. KnowEnG’s success in constructing functional conclusions from patients’ genetic background, gene expression, responses to treatment, and many other measures will provide important benchmarks of performance during its development.
“All the institutional signals are agreeing; the computer scientists are really excited about doing this, and the biologists also behind it,” said Jongeneel, reflecting on the culture of collaboration that has made the Center’s conception possible. “We have a fantastic partnership with Mayo.”
Analysis of several biological experiments at Illinois will also be used to gauge performance. Cell and developmental biologist Lisa Stubbs, along with Sinha and Robinson, will use KnowEnG in a project investigating the relationship between gene regulation and social behavior in animal models and humans. Stubbs is also the leader of the Gene Networks of Developmental and Neural Plasticity Research Theme at the IGB, which is the official host of the Center.
Microbiologist Bill Metcalf, leader of the IGB’s Mining Microbial Genomes Research Theme, will work with Sinha to improve the ability to draw relationships between an organism’s genome sequence and its physical characteristics, providing a major plank of the evaluation strategy for the KnowEnG system.
The Center also represents another step forward for Illinois’ CompGen Initiative, a campus effort led by the Coordinated Science Laboratory and the IGB whose goal is to forge new connections between expertise in information technology and biological Big Data.
“Receiving this NIH BD2K Center of Excellence award from NIH is another feather in the cap of the CompGen Initiative,” said IGB Director Gene Robinson. “CompGen has enabled over 50 computer scientists, computer engineers, bioinformaticians and genomic biologists to come together and forge the close collaborative relationships necessary to spark the brilliant ideas that animate the proposal.”
Altogether, the project promises both incredible intellectual challenges, and the possibility of great advances in genomics and Big Data.
“There's a lot to do, and obvious challenges to overcome, and we’re looking forward to those challenges,” said Sinha. “What I'm most excited about is the actual possibility that this could be a tool which everybody uses in the world.”