NHL Analytics Data Warehouse

Python Scraping Pipeline + PostgreSQL Dimensional Model

Details

Role: Data Analyst / Analytics Engineer / Data Engineer

Domain: Sports Analytics (NHL)

Stack: Python · PostgreSQL · DBeaver · SQL · NHL API

🔗 [GitHub repo / Scripts]

🔗 [ERD Diagram]

🔗 [SQL DDL Scripts]

🔗 [Python Scraper Code]

Summary

Executive Summary

This project implements a full NHL analytics data warehouse by scraping official NHL data using Python and modeling it into a PostgreSQL dimensional warehouse.


The solution combines API ingestion, data normalization, surrogate key modeling, and multi-grain fact tables to enable advanced hockey analytics at player, team, game, and event levels.


The warehouse is designed to support both BI reporting and advanced analytical modeling.

Business Goal

Public NHL data is available, but:

  • It is fragmented across multiple API endpoints
  • Player identities are inconsistent between sources
  • Analytical queries become complex without proper modeling

This project solves those problems by:

  • Normalizing NHL API data into a structured warehouse
  • Preserving raw detail while enabling analytical performance
  • Enabling scalable analytics without re-scraping data

Architecture Overview

Data Pipeline

NHL API

Python Scrapers (nhlpy + psycopg2)

PostgreSQL Data Warehouse (nhl_dw schema)

SQL Analytics / BI / Python Analysis

Warehouse Design

The warehouse follows dimensional modeling best practices.

Dimensions

Table
Purpose
dim_date
Calendar attributes
dim_player
Player master data
dim_team
NHL teams
dim_season
NHL seasons
dim_venue
Game venues

Fact Tables

Table
Grain
fact_game
One row per NHL game
fact_team_game
One row per team per game
fact_skater_game
One row per skater per game
fact_goalie_game
One row per goalie per game
event_play
One row per play event

Modeling Principles

  • Surrogate keys for all dimensions
  • Natural keys enforced as UNIQUE
  • Referential integrity with foreign keys
  • Correct analytical grain per fact table
  • Event-level JSON retained for extensibility

Python Scraping Layer

The ingestion layer is built in Python using nhlpy and psycopg2.

Key design choices

  • Combines roster and stats endpoints to guarantee player completeness
  • Normalizes inconsistent API payloads
  • Uses UPSERT logic to prevent duplication
  • Preserves existing data while enriching missing attributes

Example logic

  • Roster data provides best biographical attributes
  • Stats data ensures coverage for all players with recorded games
  • Both feed into a single normalized player dimension

This ensures no players are lost due to API fragmentation.

Data Quality Strategy

  • Surrogate keys isolate warehouse logic from API volatility
  • Natural keys enforce uniqueness
  • Event data stored in JSON for future enrichment
  • Deduplication handled during ingestion

Analytical Use Cases

This warehouse enables:

  • Player performance analysis
  • Goalie vs skater efficiency comparison
  • Team trends by season and opponent
  • Event heatmaps and shot locations
  • Advanced derived metrics (xG, clutch, possession, etc.)
  • BI dashboards and Python analytics

Example Queries

Typical analytical questions supported:

  • Top goal scorers by season and team
  • Goalie save percentage trends
  • Player shot efficiency
  • Team power play efficiency
  • Event distribution by zone

Limitations

  1. Layer separation prevents “business logic drift”
    Raw ingestion stays stable while transformations evolve safely in Silver and Gold.
  2. Gold-layer star schema reduces downstream complexity
    BI and analysts share one consistent definition of metrics and dimensions.
  3. Standardization in Silver is the scaling lever
    Conformed keys and canonical attributes enable new marts with minimal rework.

Planned Extensions

Planned upgrades

  • Automated data quality checks (freshness, row counts, uniqueness, referential integrity)
  • Improved incremental loading patterns and change tracking
  • Orchestration/scheduling and version-controlled SQL migrations
  • Optional semantic layer standardization for BI metric governance