Page cover

Align every FASTQ sample

In this example we:

  • Read a paired-end FASTQ manifest.

  • Run BWA-MEM and samtools in a custom worker image.

  • Produce one BAM per sample, with one sample per Burla worker.

  • Return one report row per sample for failures and runtime outliers.

One aligned sample proves the command works. It does not prove the cohort ran.

Dataset: paired-end FASTQ manifest

The manifest is a TSV with sample_id, fq1, and fq2. The FASTQ paths can be S3 URLs or any paths your worker image can read.

import os
import subprocess
import time
from pathlib import Path

from burla import remote_parallel_map

IMAGE = "us-docker.pkg.dev/test-burla/burla-demos/burla-bio-worker:latest"
REF_FASTA = "/refs/hg38.fa"
S3_OUT = "s3://my-bam-bucket"

Step 1: Use an image with the native tools

Bioinformatics tools need native binaries, so the worker image matters.

The image should contain bwa, samtools, the AWS CLI if you read/write S3, and the reference genome at REF_FASTA.

Step 2: Align one sample per worker

The worker downloads the FASTQs, runs the command-line tools, indexes the BAM, and writes the output to S3.

The command is exactly the command you would run in a terminal. Burla only changes how many samples can run at once.

Step 3: Smoke test one sample

Run one sample with the real image before launching the cohort.

Step 4: Run the cohort

Each sample gets 4 CPUs and 16GB of RAM.

What's the point?

The command is known. The pain is getting the same command, reference, binaries, and output path onto enough machines at once.

This is why I like one-sample-per-worker. The report gives sample-specific runtime and failures, and the output is already in S3. Once the smoke test works, run the cohort. That is where bad pairs, corrupt FASTQs, and mapping-rate outliers show up.

Last updated