Each year, consumers create 16.3 zettabytes of information—enough to fill over 127 billion iPhones. Sorting through all this information is like trying to find a needle in a haystack the size of California.
Within these treasure troves of data are valuable insights waiting to be discovered. Data scientists use statistics, math, and information technology to sort through enormous datasets with millions of variables, looking for patterns. Yet combing through this information takes immense power, not to mention computer memory. So do they sort through it all?
That’s where people like Zhigen Zhao, associate professor of Statistical Science at the Fox School, come in.
Zhao and his statistician colleagues invent new ways to use statistics, overcome computation limitations, and see patterns through the noise. Their discoveries range from a newly patented methodology that enables users to analyze millions of data points in seconds to a new threshold for pinpointing statistical significance.
Deciphering Genetic Codes
Humans have 20,000 genes in our DNA. Much like data, decoding how each gene interacts with another can provide valuable insight, in this case into a person’s health. With over 190 million possible pairs, that’s a lot of variables to test.
“Years ago, 10,000 was considered a big data set, but not anymore,” says Zhao. When using standard algorithms like distance correlation, statisticians run into issues with computation speed, and the old algorithms can’t keep up with the large datasets available today.
Zhao and his colleagues devised a methodology that can analyze all of these variables in seconds. “Our method only takes two-tenths of a second to finish this kind of calculation,” says Zhao. His computer would crash when using older algorithm to analyze a dataset of that size.
“People’s health can depend on a specific combination of their genes,” says Zhao. This revolutionary methodology, which was recently approved for a patent, can identify certain combinations of genes that may help doctors understand medical issues ranging from heart disease and Alzheimer’s to obesity and alcoholism.
Discovering Differences in Education
With millions of pieces of information, statisticians and data scientists often grapple with the problem of false discoveries—inferring a pattern that is not truly significant. Statisticians try to account for these false discoveries, but this may lead to a less complete picture of the data.
Zhao and his colleagues created a new algorithm to reduce the number of false discoveries while keeping more pertinent patterns than other methods. Zhao’s team applied this algorithm to school districts in California, analyzing standardized test scores of students from over 4,000 elementary schools.
The researchers compared pass rates from two groups of students, the socioeconomically advantaged and the socioeconomically disadvantaged. Normally, the advantaged students will have higher scores than their disadvantaged counterparts. However, Zhao used his algorithm to identify schools that have unusually small or unusually large differences between the two populations—where the disadvantaged students were either significantly underperforming or overperforming in statewide math tests.
Their new algorithm found more schools whose populations have significant differences in test scores, providing a more complete understanding of the dataset. “The main idea for this method is to incorporate school district information to get a new threshold,” says Zhao. “The standard method, which doesn’t include this information, can be either overly conservative or overly liberal.” This kind of refined analysis can help district and state policymakers to reallocate resources to support underperforming schools or to imitate overperforming schools.
From education to healthcare and everything in between, Zhao and his fellow statisticians sort through enormous datasets, finding new ways to compute that better our everyday lives.
This story was originally published in On the Verge, the Fox School’s flagship research magazine. For more stories, visit www.fox.temple.edu/ontheverge.