Multi-Stage Genomic Pipeline
Using 1,300 CPUs and <100 lines of code.
In this example we:
Download raw Illumina genomic-sequencing data from this NCBI experiment.
Call and align each sample with a human reference genome.
Combine samples into a single large .BED file, then convert to PGEN/PVAR/PSAM files.
This is a typical workflow to prepare Illumina sequencing data for downstream analysis.
Step 1: Boot some VMs
In the "Settings" tab we select the hardware, container image, and quantity of VM's we want. Then hit ⏻ Start on the homepage!

Here we boot 13, 80-CPU VM's, these VM's delete themself after 15min of inactivity.
We also specify a custom docker image: jakezuliani/idats_to_pgen:latest
This image has bcftools, PLINK, and PLINK2 installed, this is the image our code will run inside.
Once our machines are have booted, we can call remote_parallel_map !
Step 2: download prerequisite data
This code downloads the reference genome, BPM / EGT files, and sample info, then saves it all to ./shared which is sync'd with the file explorer in the dashboard and Google Cloud Storage.
After downloading this data we can see it in the Filesystem tab in the dashboard:
~screenshot~
Step 3: Download IDAT files for all 360 samples in parallel
This code uses 720 parallel 1-CPU containers to download the red/green IDAT for each sample.
Folders with Red/Green IDAT's for each sample are now visible in the Filesystem UI:
~screenshot~
Step 4: Call and align all samples in parallel
This code uses 360 parallel containers each with 8 CPUs and 32G of RAM.
Step 5: Merge samples into a single PGEN/PVAR/PSAM file.
This code uses a single container with 80 CPUs and 320G of RAM.
After running the PGEN/PVAR/PSAM files are available for download in the Filesystem UI.
Last updated