Multi-Stage Genomic Pipeline

Using 1,300 CPUs and <100 lines of code.

In this example we:

  • Download raw Illumina genomic-sequencing data from this NCBI experimentarrow-up-right.

  • Call and align each sample with a human reference genome.

  • Combine samples into a single large .BED file, then convert to PGEN/PVAR/PSAM files.

This is a typical workflow to prepare Illumina sequencing data for downstream analysis.

Step 1: Boot some VMs

In the "Settings" tab we select the hardware, container image, and quantity of VM's we want. Then hit ⏻ Start on the homepage!

Here we boot 13, 80-CPU VM's, these VM's delete themself after 15min of inactivity. We also specify a custom docker image: jakezuliani/idats_to_pgen:latest This image has bcftools, PLINK, and PLINK2 installed, this is the image our code will run inside.

Once our machines are have booted, we can call remote_parallel_map !

Step 2: download prerequisite data

This code downloads the reference genome, and BPM / EGT files then saves it all to ./shared. This directory is network linked to a Google Cloud Storage bucket using GCSFuse.

chevron-rightImports and URL definitionshashtag

After downloading this data, it appears in the Filesystem tab in the dashboard: (GCS)

Step 3: Download IDAT files for all 360 samples in parallel

This code uses 720 parallel 1-CPU containers to download the red & green IDAT file for each sample.

Folders with Red/Green IDAT's for each sample are now visible in the Filesystem tab:

Step 4: Call and align all samples in parallel

This code uses 360 parallel containers each with 8 CPUs and 32G of RAM. For each pair of IDAT files this code:

  • Performs base calling and genotype clustering with bcftools +idat2gtc

  • Aligns to a reference genome with bcftools +gtc2vcf

  • Filters to retain only biallelic variants with bcftools view -m2 -M2

  • Converts the VCF into PLINK BED, BIM, and FAM files using plink

Each sample's folder now contains output from the above commands:

Step 5: Merge samples into a single PGEN/PVAR/PSAM file.

This code uses a single container with 80 CPUs and 320G of RAM.

After running the PGEN/PVAR/PSAM files are available for download in the Filesystem tab! (GCS)

Want to run this code yourself?

This demo is available as a Google Colab notebook here: https://colab.research.google.com/drive/1lEbeGOoowZ9FKA9yctziWyhH6TvLuxTi?usp=sharingarrow-up-right

The notebook contains instructions to get Burla up and running as well as run the demo. Don't hesitate to email me ([email protected]) if you get stuck! Thank you for trying Burla!

Last updated