This is the third blog in a three-part series. In Part 1: Zika and Big Data, I introduced the Zika virus that you’re likely familiar with and brainstormed a few ways big data could help fight it. Part 2: Learning from Ebola — How Big Data Can Be Applied to Viral Epidemiology, reflected on some personal learnings from last year’s Ebola crisis which have shaped thoughts around the impact Hadoop can potentially have during epidemic events like these. This blog dives into some detailed areas where Hadoop and big data technologies can be applied to support the fight against Zika.
One of the most common applications of big data and big data technology is in signal detection and surveillance at patient intake. Health systems today often use Cloudera to ingest real-time streams of data from electronic health record (E.H.R.) systems using the Health Level Seven (HL7) standard. Because big data technology is completely open to all types and formats of data, it is suddenly possible to analyze multiple E.H.R feeds with clinical text as they stream in, and apply intelligence to them. In epidemiology, this takes the form of identifying patients—as they report symptoms of Zika for the first time–that might otherwise escape detection at point of care due to caregiver error or other factors. In the case of Zika, a single error or delay in diagnosis has high consequences.
Health systems are happy to invest in applying real-time computer intelligence as a backup to their trained professionals—especially when they are seeing tens of thousands of new patients a day. In a crisis setting, as the World Health Organization or other groups ‘instrument’ Zika intakes or other ‘first visit’ signals, we want to get big data technologies applied immediately. In these cases, it is not so much the volume capabilities of big data, but the native strength at collecting and mining multi-structured data sets. In this emergency response mode, the only sane approach is applying intelligence about those data sets at the time of immediate analysis, i.e. ‘schema-on-read’.
A second, and perhaps even more important application of big data is in the genomic dimension. The reason why some babies will resist microcephaly and others won’t will be answered by the genome of the mother or the child or both. This is true even if environment or drugs play a significant role. In the past, the absence of big data technologies and limitations in genetic sequencing meant researchers often had to pick a small number of genes to analyze when a health problem arose. This might be akin to – when trying to find a needle in a haystack – having to pick a few ‘strategically guessed’ clumps of hay to search through for the needle. Today, with big data technology in place, researchers can look through the entire genome; this is whole exome or whole genome sequencing. Invariably, Zika research will take place mining each of the following interest areas, among others:
- Mother’s genome and epigenetic clues like methylation
- Mother’s, father’s and baby’s genome (together, a triple)
- Prenatal and postnatal genome of baby
- Longitudinal changes in mother and baby’s genomes
- Baby’s proteome, microbiome, and importantly epigenome
- Population controls and reference genomes new and old
- Mosquito reference genomes and carrier/case mosquito genomes
Some have said, “We now know cancer is not just a disease of the anatomy, it’s a disease of the genome”. It is no longer plausible that solving Zika will not happen without genomic analytics. In fact, the company most consider to be furthest along in development of a vaccine for Zika—Inovio Pharmaceuticals—uses a new approach called DNA-based vaccines. “The beauty of this technological platform is that the vaccine is simply a DNA sequence developed in water,” said Inovio’s CEO Dr. Joseph Kim. “It cuts through all the difficult handling and complex development times of traditional vaccine approaches.” In cases like Inovio’s, the typical approach is to complement molecular engineering with genomic science. Aside from optimization, the amazing work of the last decade is mostly complete in areas such as gene assembly, alignment and genotyping. Today, at the frontiers of innovation are faster and smarter downstream annotation and analysis of genomic data at scale, and the merging of genomic data with clinical and phenotype information. The latter, also known as precision medicine, must work together with understanding molecular pathways in any solution delivery. While the large scale, ‘dry’ science and statistics of genomics is appealing to big data professionals, we also need the chemical engineers and wet lab bench scientists to turn learnings into proteins and compounds.
Whether big data is used to listen to traditional media, social media and adverse event channels for early signals that something is wrong; whether it is calculating the risk that the patient sitting next to you in the emergency room has Zika; whether big data is collecting the whole exomes of every newborn; or whether it is looking at outcomes in the whole population treated with the vaccine ongoing, it is clear that leveraging big data technologies are and will continue to be crucial during times of crisis.
Hadoop-based technologies like Cloudera’s data platform can integrate detailed, complex and multi-structured data as it is generated from unlimited sources. It can identify patterns in that data and facilitate data discovery and analytics, ultimately helping to expedite the detection of the disease and outcome drivers, and enabling clinicians to deliver the best care based on real-time, data-driven decisioning. It can analyze whole genome data and merge multi-omic data with clinical/phenotype data quickly and efficiently, shortening R&D cycles and allowing us to find treatment and prevention methods faster. And it can measure and evaluate the impact of treatments at scale, suggesting improvements along the way to optimize outcomes and eventually, hopefully, eliminate viruses like Zika altogether, faster than we ever could before.