Page cover

Genomic alignment pipeline (Illumina)

In this example we:

  • Download raw Illumina sequencing data from GSE165845arrow-up-right.

  • Download the reference genome plus the Illumina BPM and EGT files.

  • Convert 360 samples in parallel, then merge everything into PGEN/PVAR/PSAM files.

This is a pretty normal genomics workflow. The hard part is usually not the science. The hard part is getting all the command-line tools, reference files, sample files, and CPUs lined up at the same time.

Step 1: Boot some VMs

For this run we boot 13 VMs with 80 CPUs each.

We use a custom Docker image:

That image has bcftools, plink, and plink2 installed. Burla runs our Python functions inside that image, so the worker can call the same tools I would call from a terminal.

Step 2: Download prerequisite data

First we download the reference genome, the BPM file, and the EGT cluster file. Everything goes into ./shared, which is backed by a Google Cloud Storage bucket.

chevron-rightImports and URL definitionshashtag

After that, the prerequisite files show up in the dashboard filesystem.

Step 3: Download the IDAT files

Each sample has a red and green IDAT file, so 360 samples means 720 files. This is exactly the kind of thing I don't want to do one at a time.

The sample folders now contain the red and green IDATs.

Step 4: Convert every sample

Now each worker takes one sample, converts the IDAT pair to GTC, aligns to the reference genome, filters biallelic variants, and writes PLINK BED/BIM/FAM files.

Each function call gets 8 CPUs and 32GB of RAM. That is enough for one sample, and it lets the whole cohort move at once.

Step 5: Merge the cohort

The final merge runs once, with a bigger worker.

After it finishes, the PGEN/PVAR/PSAM files are available in the dashboard.

Want to run this code yourself?

The demo is available as a Colab notebook:

https://colab.research.google.com/drive/1lEbeGOoowZ9FKA9yctziWyhH6TvLuxTi?usp=sharingarrow-up-right

What's the point?

This is the real reason I like this example: the pipeline stays boring.

The worker code is still Python calling bcftools and plink. The files still live in a normal shared folder. The only real change is that instead of nursing one sample through the pipeline, we run the cohort at the size it was meant to run.

If this were my analysis, I would not want to spend the day building a scheduler before I even know whether the genotype conversion works. I would want to run the thing, look at the outputs, and only optimize once there is a real bottleneck.

Last updated