Skip to main content

SEC EDGAR Financials Warehouse

Production-style lakehouse architecture processing SEC financial data with BigQuery, dbt, and automated data quality validation.

Role: Data Engineer2024
14/14
dbt Tests Passed
100% data quality validation
100%
GE Validations
All data quality checks passing
80-90%
Query Cost Reduction
Via partition/cluster optimization
Daily 06:00 UTC
Automation
Automated refresh schedule

Technology Stack

GCPBigQuerydbtGreat ExpectationsGitHub ActionsLooker StudioPython

Problem

Financial data analysis requires reliable, up-to-date SEC filing information that's expensive to query and difficult to maintain. Analysts need a fast, cost-effective way to access standardized financial metrics across thousands of companies with guaranteed data quality.

Constraints

  • SEC API rate limits requiring careful throttling
  • Raw filing data inconsistencies and format variations
  • BigQuery costs scaling with data volume scanned
  • Need for daily automated updates with zero manual intervention
  • Data quality requirements for financial accuracy

Architecture

Built a production-style lakehouse following the medallion architecture pattern:

SEC API → GCS (Raw NDJSON) → BigQuery Raw → dbt Transform → BigQuery Curated → Looker Studio

Data Flow

  1. Extract: Rate-limited SEC API calls → GCS storage (NDJSON format)
  2. Load: Batch loads to BigQuery raw tables with schema validation
  3. Transform: dbt models creating staging → intermediate → mart layers
  4. Serve: Curated tables optimized for BI consumption

Results & Impact

Performance Optimization

  • Query Cost Reduction: 80-90% reduction in scanned bytes through partition pruning and clustering
  • Query Speed: TTM trend queries execute in seconds vs. minutes
  • Looker Studio: Dashboard stays responsive with optimized data model

Data Quality

  • 100% Validation Success: All dbt tests and Great Expectations checks passing
  • Zero Data Issues: Automated quality gates prevent bad data propagation
  • Audit Trail: Complete lineage and validation history in BigQuery