Showing posts with label Health. Show all posts
Showing posts with label Health. Show all posts

Monday, 2 March 2015

Large-Scale Machine Learning for Drug Discovery



Discovering new treatments for human diseases is an immensely complicated challenge; Even after extensive research to develop a biological understanding of a disease, an effective therapeutic that can improve the quality of life must still be found. This process often takes years of research, requiring the creation and testing of millions of drug-like compounds in an effort to find a just a few viable drug treatment candidates. These high-throughput screens are often automated in sophisticated labs and are expensive to perform.

Recently, deep learning with neural networks has been applied in virtual drug screening1,2,3, which attempts to replace or augment the high-throughput screening process with the use of computational methods in order to improve its speed and success rate.4 Traditionally, virtual drug screening has used only the experimental data from the particular disease being studied. However, as the volume of experimental drug screening data across many diseases continues to grow, several research groups have demonstrated that data from multiple diseases can be leveraged with multitask neural networks to improve the virtual screening effectiveness.

In collaboration with the Pande Lab at Stanford University, we’ve released a paper titled "Massively Multitask Networks for Drug Discovery", investigating how data from a variety of sources can be used to improve the accuracy of determining which chemical compounds would be effective drug treatments for a variety of diseases. In particular, we carefully quantified how the amount and diversity of screening data from a variety of diseases with very different biological processes can be used to improve the virtual drug screening predictions.

Using our large-scale neural network training system, we trained at a scale 18x larger than previous work with a total of 37.8M data points across more than 200 distinct biological processes. Because of our large scale, we were able to carefully probe the sensitivity of these models to a variety of changes in model structure and input data. In the paper, we examine not just the performance of the model but why it performs well and what we can expect for similar models in the future. The data in the paper represents more than 50M total CPU hours.
This graph shows a measure of prediction accuracy (ROC AUC is the area under the receiver operating characteristic curve) for virtual screening on a fixed set of 10 biological processes as more datasets are added.

One encouraging conclusion from this work is that our models are able to utilize data from many different experiments to increase prediction accuracy across many diseases. To our knowledge, this is the first time the effect of adding additional data has been quantified in this domain, and our results suggest that even more data could improve performance even further.

Machine learning at scale has significant potential to accelerate drug discovery and improve human health. We look forward to continued improvement in virtual drug screening and its increasing impact in the discovery process for future drugs.

Thank you to our other collaborators David Konerding (Google), Steven Kearnes (Stanford), and Vijay Pande (Stanford).

References:

1. Thomas Unterthiner, Andreas Mayr, Günter Klambauer, Marvin Steijaert, Jörg Kurt Wegner, Hugo Ceulemans, Sepp Hochreiter. Deep Learning as an Opportunity in Virtual Screening. Deep Learning and Representation Learning Workshop: NIPS 2014

2. Dahl, George E, Jaitly, Navdeep, and Salakhutdinov, Ruslan. Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231, 2014.

3. Ma, Junshui, Sheridan, Robert P, Liaw, Andy, Dahl, George, and Svetnik, Vladimir. Deep neural nets as a method for quantitative structure-activity relationships. Journal of Chemical Information and Modeling, 2015.

4. Peter Ripphausen, Britta Nisius, Lisa Peltason, and Jürgen Bajorath. Quo Vadis, Virtual Screening? A Comprehensive Survey of Prospective Applications. Journal of Medicinal Chemistry 2010 53 (24), 8461-8467

Friday, 31 October 2014

Google Flu Trends gets a brand new engine



Each year the flu kills thousands of people and affects millions around the world. So it’s important that public health officials and health professionals learn about outbreaks as quickly as possible. In 2008 we launched Google Flu Trends in the U.S., using aggregate web searches to indicate when and where influenza was striking in real time. These models nicely complement other survey systems—they’re more fine-grained geographically, and they’re typically more immediate, up to 1-2 weeks ahead of traditional methods such as the CDC’s official reports. They can also be incredibly helpful for countries that don’t have official flu tracking. Since launching, we’ve expanded Flu Trends to cover 29 countries, and launched Dengue Trends in 10 countries.

The original model performed surprisingly well despite its simplicity. It was retrained just once per year, and typically used only the 50 to 300 queries that produced the best estimates for prior seasons. We then left it to perform through the new season and evaluated it at the end. It didn’t use the official CDC data for estimation during the season—only in the initial training.

In the 2012/2013 season, we significantly overpredicted compared to the CDC’s reported U.S. flu levels. We investigated and in the 2013/2014 season launched a retrained model (still using the original method). It performed within the historic range, but we wondered: could we do even better? Could we improve the accuracy significantly with a more robust model that learns continuously from official flu data?

So for the 2014/2015 season, we’re launching a new Flu Trends model in the U.S. that—like many of the best performing methods [1, 2, 3] in the literature—takes official CDC flu data into account as the flu season progresses. We’ll publish the details in a technical paper soon. We look forward to seeing how the new model performs in 2014/2015 and whether this method could be extended to other countries.

As we’ve said since 2009, "This system is not designed to be a replacement for traditional surveillance networks or supplant the need for laboratory-based diagnoses and surveillance." But we do hope it can help alert health professionals to outbreaks early, and in areas without traditional monitoring, and give us all better odds against the flu.

Stay healthy this season!

Wednesday, 30 July 2014

Facilitating Genomics Research with Google Cloud Platform



The understanding of the origin and progression of cancer remains in its infancy. However, due to rapid advances in the ability to accurately read and identify (i.e. sequence) the DNA of cancerous cells, the knowledge in this field is growing rapidly. Several comprehensive sequencing studies have shown that alterations of single base pairs within the DNA, known as Single Nucleotide Variants (SNVs), or duplications, deletions and rearrangements of larger segments of the genome, known as Structural Variations (SVs), are the primary causes of cancer and can influence what drugs will be effective against an individual tumor.

However, one of the major roadblocks hampering progress is the availability of accurate methods for interpreting genome sequence data. Due to the sheer volume of genomics data (the entire genome of just one person produces more than 100 gigabytes of raw data!), the ability to precisely localize a genomic alteration (SNV or SV) and resolve its association with cancer remains a considerable research challenge. Furthermore, preliminary benchmark studies conducted by the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) have discovered that different mutation calling software run on the same data can result in detection of different sets of mutations. Clearly, optimization and standardization of mutation detection methods is a prerequisite for realizing personalized medicine applications based on a patient’s own genome.

The ICGC and TCGA are working to address this issue through an open community-based collaborative competition, run in conjunction with leading research institutions: the Ontario Institute for Cancer Research, University of California Santa Cruz, Sage Bionetworks, IBM-DREAM, and Oregon Health and Sciences University. Together, they are running the DREAM Somatic Mutation Calling Challenge, in which researchers from across the world “compete” to find the most accurate SNV and SV detection algorithms. By creating a living benchmark for mutation detection, the DREAM Challenge aims to improve standard methods for identifying cancer-associated mutations and rearrangements in tumor and normal samples from whole-genome sequencing data.

Given Google’s recent partnership with the Global Alliance for Genomics and Health, we are excited to provide cloud computing resources on Google Cloud Platform for competitors in the DREAM Challenge, enabling scientists who do not have ready access to large local computer clusters to participate with open access to contest data as well as credits that can be used for Google Compute Engine virtual machines. By leveraging the power of cloud technologies for genomics computing, contestants have access to powerful computational resources and a platform that allows the sharing of data. We hope to democratize research, foster the open access of data, and spur collaboration.

In addition to the core Google Cloud Platform infrastructure, the Google Genomics team has implemented a simple web-based API to store, process, explore, and share genomic data at scale. We have made the Challenge datasets available through the Google Genomics API. The challenge includes both simulated tumor data for which the correct answers are known and real tumor data for which the correct answers are not known.
Genomics API Browser showing a particular cancer variant position (highlighted) in dataset in silico #1 that was missed by many challenge participants.
Although submissions for the simulated data can be scored immediately, the winners on the real tumor data will not immediately be known when the challenge closes. This is a consequence of the fact that current DNA sequencing technology does not provide 100% accurate data, which adds to the complexity of the problem these algorithms are attempting to tackle. Therefore, to identify the winners, researchers must turn to alternative laboratory technologies to verify if a particular mutation that was found in sequencing data is actually (or likely) to be true. As such, additional data will be collected after the Challenge is complete in order to determine the winner. The organizers will re-sequence DNA from the cells of the real tumor using an independent sequencing technology (Ion Torrent), specifically examining regions overlapping the positions of the cancer mutations submitted by the contest participants.

As an analogy, a "scratched magnifying glass" is used to examine the genome the first time around. The second time around, a "stronger magnifying glass with scratches in different places" is used to look at the specific locations in the genome reported by the challenge participants. By combining the data collected by those two different "magnifying glasses", and then comparing that against the cancer mutations submitted by the contest participants, the winner will then be determined.

We believe we are at the beginning of a transformation in medicine and basic research, driven by advances in genome sequencing and computing at scale. With the DREAM Challenge, we are all excited to be part of bringing researchers around the world to focus on this particular cancer research problem. To learn more about how to participate in the challenge register here.