Tutorial: Doreen Jirak


Know Your Data, Know your Model: From Experiments to Machine Learning Applications

Dr. Doreen Jirak

The application of machine learning (ML) models has been simplified for years thanks to computational frameworks like sci-kit learn, Pytorch, or Tensorflow that help the implementation of popular (deep) learning architectures. However, model building is just one of the essential parts of the machine-learning application pipeline. Beyond toy examples like MNIST, many ML practitioners face problems using customized datasets provided by clients or from self-collected experimental data. Hence, conducting an exploratory data analysis (EDA) is essential before any machine learning model deployment to understand model biases and performance issues. In this tutorial, I will present the ML pipeline, starting from experiment design to the evaluation of the machine learning models. I will give an overview of possible experimental biases and specific examples of how to shape your data into meaningful input for a machine-learning model. I will also discuss different models applicable to diverse tasks and their performance metrics for a proper evaluation to obtain reasonable claims about machine learning applications. The tutorial covers important concepts like data cleaning, normalization, feature selection, unsupervised/supervised learning, and ML metrics.