Hyperparameter Tune XGBoost with 1,000 CPUs

In this example we:

  • Download this 1.4GB Kaggle Dataset of commercial flight delays.

  • Train 36 XGBoost models with different parameters using 13, 80-CPU machines.

  • Identify the best model from training results.

Step 1: Upload your data to the cluster

Download the following CSV file:

Then upload it to your Burla cluster filesystem:

(upload section is fast-forwarded)

Any files uploaded here will appear in a network attached folder at ./shared inside every container in the cluster. Conversely, any files your code leaves in this folder will appear in the "Filesystem" tab where you can download them later, or store for future work!

Step 2: Boot some VMs

In the "Settings" tab, select the hardware and quantity of machines you want, then hit ⏻ Start ! Here we boot 13, 80-CPU VM's, these VM's delete themself after 15min. of inactivity.

(Booting is fast-forwarded, this cluster actually took 1.5 min. to boot up)

Now that our machines are ready, we can call remote_parallel_map !

You may have noticed in the settings we're using the python:3.12 docker image. This is the image the code will run inside, and it doesn't come with any of the packages we need (like XGBoost, Pandas, etc). This is ok because Burla detect's local packages at runtime and quickly installs them in all containers, usually in just a few seconds.

Step 3: Write a function to train one model.

This function:

  • Loads Combined_Flights_2022.csv from the ./shared folder as a Pandas DataFrame.

  • Cleans and separates data into train / test sets.

  • Trains one XGBoost model using the provided params dict, and 80 CPUs.

  • Scores the model on the test set, then returns the AUC.

To test out the function we call it on just one machine, by passing it one set of parameters:

We also pass func_cpu=80 to tell Burla this function call should have 80 CPU's made available to it. We'll need this since we're passing n_jobs=80 to XGBoost inside the train_model function.

Step 4: Call the function in parallel, on 13 separate VMs!

Here we pass 36 sets of parameters to train_model. Because each function call requires 80 CPUs, and we have 13, 80CPU machines, this will immediately start 13 function calls, and queue the remaining 26. Burla can reliably queue up to 10 million inputs.

Once submitted, we can monitor progress and view logs from the "Jobs" tab in the dashboard:

Last updated