Software Seminar

Accelerating Model Selection in Advanced Analytics

Arun Kumar

Assistant Professor
University of California, San Diego
Tuesday, September 26, 2017
2:30pm - 4:00pm
BBB 4901

Add to Google Calendar

About the Event

Advanced analytics--the analysis of large and complex datasets using machine learning (ML)--is becoming ubiquitous, with a growing demand for advanced analytics tools in the enterprise domains. However, there exist several challenging bottlenecks for both system efficiency and data scientist productivity in the end-to-end process of building and deploying advanced analytics applications. My research focuses on abstractions, algorithms, and systems to mitigate such bottlenecks and accelerate advanced analytics from a data management standpoint. In this talk, I will focus on our recent work on mitigating such bottlenecks in the end-to-end process of model selection during model building, which encompasses the tasks of feature engineering, algorithm selection, and hyperparameter tuning. I will give an overview of this crucial ML process and highlight opportunities for optimizing this process using a data management-inspired lens. I will then dive deeper into one set of optimizations: ML over joins of multiple tables. Joins are useful to gather more features in multi-table data but they often cause the data to blow up in size, which slows down ML and increases costs. We show how to mitigate these issues by "avoiding joins physically," i.e., pushing ML down through joins. Inspired by a classical database query optimization idea, our approach reduces runtime without affecting accuracy. Going further, we apply statistical learning theory to show how one can often also "avoid joins logically," i.e., ignore entire tables outright without losing much accuracy, but significantly reducing runtime. I will also talk briefly about our ongoing work on generalizing these ideas to mitigate more bottlenecks in the model selection process and our work on managing the deployment of complex ML models.


Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and an affiliate member of the AI Group and CNS. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, an area that is increasingly called advanced analytics or data science. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. He is a recipient of the Best Paper Award at ACM SIGMOD 2014, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award. Webpage: http://cseweb.ucsd.edu/~arunkk/

Additional Information

Contact: Barzan Mozafari

Sponsor(s): Software

Faculty Sponsor: Barzan Mozafari

Open to: Public