Defense of a Ph.D. Dissertation- Yuping (Allan) Lu
Major Professor: Dr. Michael A. Langston
Title: Advances in Big Data Analytics: Algorithmic Stability and Data Cleansing
Analysis of what has come to be called “big data” presents a number of challenges as data continues to grow in size, complexity and heterogeneity. To help addresses these challenges, we study a pair of foundational issues in algorithmic stability (robustness and tuning), with application to clustering in high-throughput computational biology, and an issue in data cleansing (outlier detection), with application to pre-processing in streaming meteorological measurement. These issues highlight major ongoing research aspects of modern big data analytics. First, a new metric, robustness, is proposed in the setting of biological data clustering to measure an algorithm’s tendency to maintain output coherence over a range of parameter settings. It is well known that different algorithms tend to produce different clusters, and that the choice of algorithm is often driven by factors such as data size and type, similarity measure(s) employed, and the sort of clusters desired. Even within the context of a single algorithm, clusters often vary drastically depending on parameter settings. Empirical comparisons performed over a variety of algorithms and settings show highly differential performance on transcriptomic data and demonstrate that many popular methods actually perform poorly. Second, tuning strategies are studied for maximizing biological fidelity when using the well-known paraclique algorithm. Three initialization strategies are compared, using ontological enrichment as a proxy for cluster quality. Although extant paraclique codes begin by simply employing the first maximum clique found, results indicate that by generating all maximum cliques and then choosing one of highest average edge weight, one can produce a small but statistically significant expected improvement in overall cluster quality. Third, a novel outlier detection method is described that helps cleanse data by combining Pearson correlation coefficients, K-means clustering, and Singular Spectrum Analysis in a coherent framework that detects instrument failures and extreme weather events in Atmospheric Radiation Measurement sensor data. The framework is tested and found to produce more accurate results than do traditional approaches that rely on a hand-annotated database.
Monday, July 15 at 3:00pm to 5:00pm
Min H. Kao Electrical Engineering and Computer Science, 639
1520 Middle Drive, Knoxville, TN 37996