Multi-Stage Genomic Pipeline

Using 1,300 CPUs and <100 lines of code.

In this example we:

  • Download raw Illumina genomic-sequencing data from this NCBI experiment.

  • Call and align each sample with a human reference genome.

  • Combine samples into a single large .BED file, then convert to PGEN/PVAR/PSAM files.

This is a typical workflow to prepare Illumina sequencing data for downstream analysis.

Step 1: Boot some VMs

In the "Settings" tab we select the hardware, container image, and quantity of VM's we want. Then hit ⏻ Start on the homepage!

Here we boot 13, 80-CPU VM's, these VM's delete themself after 15min of inactivity. We also specify a custom docker image: jakezuliani/idats_to_pgen:latest This image has bcftools, PLINK, and PLINK2 installed, this is the image our code will run inside.

Once our machines are have booted, we can call remote_parallel_map !

Step 2: download prerequisite data

This code downloads the reference genome, BPM / EGT files, and sample info, then saves it all to ./shared which is sync'd with the file explorer in the dashboard and Google Cloud Storage.

Imports and URL definitions

After downloading this data we can see it in the Filesystem tab in the dashboard:

~screenshot~

Step 3: Download IDAT files for all 360 samples in parallel

This code uses 720 parallel 1-CPU containers to download the red/green IDAT for each sample.

Folders with Red/Green IDAT's for each sample are now visible in the Filesystem UI:

~screenshot~

Step 4: Call and align all samples in parallel

This code uses 360 parallel containers each with 8 CPUs and 32G of RAM.

Step 5: Merge samples into a single PGEN/PVAR/PSAM file.

This code uses a single container with 80 CPUs and 320G of RAM.

After running the PGEN/PVAR/PSAM files are available for download in the Filesystem UI.

Last updated