How to Structure Your Machine Learning Code Repository

Have you ever trained a machine learning model and waited a whole night for it to finish? Have you accidentally lost your progress, lost your train/validation loss, or just had too many different versions of data, model, code lying around?

It’s time to take a step back. While most machine learning people focus on how to improve their model performance, we need to note that a little effort in proper software engineering practices goes a long way.

This post isn’t about how to push a model into production, which likely involves a gradual rollout and other infrastructure related work. Instead, let’s explore what we could do *before* modelling/training to make both the MLEs and the SWEs much happier.

Here’s an example from a school project I worked on during grad school. We have the following files/directories.

.
├── README.md
├── data
├── evaluation
├── models
├── notebook
├── requirements.txt
├── results
├── singularity_python.sh
├── src
└── test

Let’s have a look at the file structure and why we need them.

Documentation

Related files and/or directories:

  • README

Since this was a school project, we do not have a very elaborate documentation setup. However, a simple README with installation instructions and a project description goes a long way. Often, you would come across github links in research papers that don’t have a README.

Dependencies

Related files and/or directories:

  • requirements.txt
  • singularity_python.sh

The requirements are all the python packages needed in order to reproduce your experiments. Even without a production end goal in mind, you as a machine learning researcher should still dedicate some time to record your current dependencies. It’s very simple.

Just run

pip3 freeze > requirements.txt 

pip doesn’t get a lot of love like npm, but we have many alternative out there, such as pipreqs and pigar.

The `singularity_python.sh` was a very specific bash script only for the purpose of this project.

Data exploration

Related files and/or directories:

  • notebook

I don’t even like Jupyter Notebooks and you don’t have to hear it from me. Joel Grus gave a fantastic presentation on why it is terrible, especially when it’s mixed with git.

In my opinion, we should’ve avoided it as much as possible. I have worked at companies that specially ignore this directory in the gitignore to avoid unnecessary notebook conflicts.

In some cases, it could be useful to only commit the final version of well maintained notebooks. No out of order executions, please.

Source code with customized models

Related files and/or directories:

  • src

This deserves its own blog post. We build a generic framework which includes the following

./src
├── __init__.py
├── algorithms # model architecture
├── datasets # reads the data into a pytorch dataset object
├── experiments # the actual lifecycle with all the hyperparameters, checkpointing, versioning, etc
├── scripts # scripts to invoke the experiments
├── transforms # data transformations, e.g. Normalizations
└── utils # utility functions

The scripts directory provides entries points to all the experiments. The models are defined inside algorithms, and the experiments directory defines and stores the experiment hyperparameters. We could pair different hyperparameters with different models easily. If we lumped these together, we would end up with a lot of redundant code.

Datasets

Related files and/or directories:

  • data

This one is quite self explanatory. You might gasp and wonder how I live my life after committing a large file in a git repo. The answer is I did not. This folder has a script, which creates a tiny, dummy dataset using numpy’s random. It simply makes sure

1. we agree on the path of source data for downstream processing across different platforms/machines

2. run it for one iteration E2E test. We programmed most models from scratch in PyTorch, so it’s important to make sure the dimensions match before training it on a GPU cluster.

Model archiving, monitoring

Related files and/or directories:

  • models: the actual model binary files
  • evaluation: a course specific eval script
  • results: the metrics we monitored at each iteration. Archived in a log for plotting and analysis. We decoupled this process to make sure we don’t interfere with the model training.

Tests

This also deserves another blog post on its own. Yes, you need testing in your machine learning code. You should have unit tests and integration tests where you need them. In this case, we tested our model and some utility functions with funky reshaping to make sure we were processing the source image data correctly.

Speedround

Yes please have a gitignore file and don’t accidentally commit your data.

Conclusion

It’s always a good idea to organize your files, whether it’s code, model binary, dataset, etc. Good software engineering practices can help you supercharge your machine learning experiments.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Violet Guo

Violet Guo

machine learning, software, everything in between. also at https://violetguos.github.io/blog