GECCO 2025 | Automated Machine Learning Tools for Data Science, Modeling, and Algorithm Benchmarking

Automated Machine Learning Tools for Data Science, Modeling, and Algorithm Benchmarking

Description

Automated Machine Learning (AutoML) has emerged over the last decade as a subfield of artificial intelligence (AI) and machine learning (ML), focused on the automation of machine learning modeling and other key elements of a data science analysis pipeline in order to “relax the need for a user in the loop”. The primary goal of AutoML methods and software packages has been to make the application of ML easier, more accessible (to those with and without programming or ML experience), and more capable of optimizing ML model performance across a wide variety of ML algorithms, hyperparameters, and data processing options.

Notably, a number of available AutoML tools utilize evolutionary optimization strategies to drive search, and/or include evolutionary machine learning approaches in their repertoire of available ML modeling algorithms. Thus, apart from making ML modeling and data analytics applications easier, these frameworks can (and arguably should) be leveraged to conduct fairer, better standardized/reproducible, and more rigorous algorithm performance comparisons and benchmarking.

This tutorial will begin by broadly introducing participants to AutoML tools, discussing their scope, capabilities, and tradeoffs. Next, it will dive more deeply into two specific AutoML packages (STREAMLINE and TPOT) both developed by investigator teams working in the field of evolutionary computation. It will cover how these AutoML frameworks work, what they automate, their unique capabilities, installation and use, and how evolutionary computation relates to each. Lastly, using the STREAMLINE AutoML this tutorial will offer a practical demonstration of how this framework can be applied: (1) to model and evaluate real-world data as part of a comprehensive automated data science pipeline, and (2) to easily, fairly, rigorously, and reproducibly compare and benchmark the performance of new ML modeling approaches (evolutionary or other) to other established algorithms. An outline of this tutorial is detailed further below.

1. Provide an overview of the typical elements of a machine learning data science analysis pipeline.
2. Define and introduce AutoML in contrast with traditional approaches to data science and ML.
3. Briefly review 20+ currently available AutoML tools and packages, contrasting their scope and capabilities.
4. Take a closer look at the STREAMLINE and TPOT AutoML packages focusing on how they work.
5. Walk through an example of applying STREAMLINE to a real-world analysis of biomedical data with the goal of optimizing ML model predictive performance, conducting model interpretation/explanation, and evaluating the reproducibility of model performance on new replication data.
6. Walk through an example of adding a new scikit-learn compatible ML algorithm to the STREAMLINE algorithm repertoire, and utilizing the AutoML framework to benchmark and compare it’s performance to other established ML algorithms across a diversity of benchmark datasets, in a rigorous and reproducible manner.
7. Provide a hands-on demo for participants to try out STREAMLINE for themselves (via Google Colab) on their laptops or smartphones.

Organizers

Ryan Urbanowicz

Dr. Ryan Urbanowicz is an Assistant Professor of Computational Biomedicine at the Cedars Sinai Medical Center. His research focuses on the development of machine learning, artificial intelligence automation, data mining, and informatics methodologies as well as their application to biomedical and clinical data analyses. This work is driven by the challenges presented by large-scale data, complex patterns of association (e.g. epistasis and genetic heterogeneity), data integration, and the essential demand for interpretability, reproducibility, and efficiency in machine learning. His research group has developed a number of machine learning software packages including ReBATE, GAMETES, ExSTraCS, STREAMLINE, and FIBERS. He has been a regular contributor to GECCO since 2009 having (1) provided tutorials on learning classifier systems and the application of evolutionary algorithms to biomedical data analysis, (2) co-chaired the International Workshop on Learning Classifier Systems and a workshop on benchmarking evolutionary algorithms, and (3) co-chaired various tracks. He is also an invested educator, with dozens of educational videos and lectures available on his YouTube channel, and co-author of the textbook, `Introduction to Learning Classifier Systems'.