
Audit 5,000 Parquet files
Dataset: partitioned event Parquet
s3://my-events-bucket/events/2025/...import io
import json
from pathlib import Path
import boto3
import pandas as pd
import pyarrow.parquet as pq
from burla import remote_parallel_map
BUCKET = "my-events-bucket"
PREFIX = "events/2025/"
REPORT_PATH = Path("/workspace/shared/parquet-audit/report.csv")Step 1: List the files
Step 2: Inspect one file
Step 3: Smoke test a few files
Step 4: Build the report
What's the point?
Last updated