Kuang-Yao Lee, assistant professor of statistical science at the Fox School
More than 15 million adults struggle with alcohol addiction. In fact, according to the CDC, one in ten deaths of working-age adults in America is linked to alcohol. That’s one reason data on alcohol use has been chosen by researchers for study from the enormous data set from the U.S. Department of Veterans Affairs’ ambitious Million Veteran Program (MVP). The VA intends as the project’s name states, to gather data on an astonishing one million service members.
Kuang-Yao Lee, assistant professor of statistical science at the Fox School, sees a world of potential new knowledge in this vast cache of data. This is particularly true of alcohol use because the data from the MVP is longitudinal, which means the same measurements are tracked over time. Alongside the support from the VA, Lee’s project received funding through from Office for the Vice President of Research at Temple University.
Volunteers in the MVP each submit blood samples as well as health surveys, amassing a dataset that comprises both genetic data and behavioral patterns. Beginning in 2016 when he was a researcher at Yale University, Lee and his colleagues have been using this information-rich resource to search for the specific combination of genes that correspond to alcohol and other substance use.
“Previous studies have suggested [these genes exist], but mostly were only limited to small scales or restricted conditions,” says Lee. “We want to use statistical models to find out if this is really a valid assumption. Our results so far suggest a very strong association.”
While ample electronic health records and genetic data have long been available to researchers, only recently has the efficient computing power become available to slice and dice the information into accurate, usable new insights and discoveries. More sophisticated algorithms combined with larger-than-ever computer storage capacity, as well as parallel computation techniques, allow today’s researchers to make meaning from a huge amount of complex data.
How huge? “Depending on the facility, the whole genome sequencing [for one person] can produce hundreds of millions of variants,” says Lee. Questionnaires allow researchers to gather large amounts of information about each subject every time they are administered. Multiply that by one million veterans. “We’re talking about not just billions, but millions of millions of points of data,” he says.
Data with this level of complexity can lead to findings that are more nuanced and reliable than in the past. Previously statistics sometimes led to oversimplified and other not-quite-right conclusions. We’ve all heard the old axiom, “There are three kinds of lies: lies, damned lies, and statistics.” But as so-called big data increases in scope and complexity and the tools used to analyze this data become more sophisticated, statistics are becoming more honest than ever before. From projects such as the Million Veteran Program and other similarly vast datasets, new genetic truths may ultimately emerge.
There are many possible real-world applications for this research. For one thing, determining which specific genes are linked with alcohol and other substance abuse could lead to new and better medicines and treatments for the very veterans who have volunteered their most sensitive personal information for this work. A dialed-in genetic profile that indicates a vulnerability for substance abuse could be used to screen kids and even adults while there is still time for effective early interventions that can keep them on a healthy path. Given the current public health crisis around opioids, alcohol, and other substance use, a breakthrough of this kind could have far-reaching benefits.
Lee says that the knowledge gleaned from the Million Veteran Program about substance abuse may lead to similar projects that could help solve other vexing behavioral, health, and genetic puzzles. He also notes that the innovative statistical models and tools he’s used in this research could be applied in myriad ways to other complex datasets.
For example, online shopping platforms can easily observe huge amounts of individual consumers and, at the same time, collect data across large numbers of variables. “One of the core problems in business analytics is to use statistical models to study the inter-dependency between observed variables, for example, the dependency between decision making and consumer behavior,” Lee says.
“There are a surprising number of similarities between genomics and online shopping.”
This story was originally published in On the Verge, the Fox School’s flagship research magazine. For more stories, visit www.fox.temple.edu/ontheverge.