NHL Analytics Data Warehouse

Python Scraping Pipeline + PostgreSQL Dimensional Model

Details

Role: Data Analyst / Analytics Engineer / Data Engineer

Domain: Sports Analytics (NHL)

Stack: Python · PostgreSQL · DBeaver · SQL · NHL API

🔗 [GitHub repo / Scripts]

🔗 [ERD Diagram]

🔗 [SQL DDL Scripts]

🔗 [Python Scraper Code]

Summary

Executive Summary

This project implements a full NHL analytics data warehouse by scraping official NHL data using Python and modeling it into a PostgreSQL dimensional warehouse.

The solution combines API ingestion, data normalization, surrogate key modeling, and multi-grain fact tables to enable advanced hockey analytics at player, team, game, and event levels.

The warehouse is designed to support both BI reporting and advanced analytical modeling.

Business Goal

Public NHL data is available, but:

It is fragmented across multiple API endpoints
Player identities are inconsistent between sources
Analytical queries become complex without proper modeling

This project solves those problems by:

Normalizing NHL API data into a structured warehouse
Preserving raw detail while enabling analytical performance
Enabling scalable analytics without re-scraping data

Architecture Overview

Data Pipeline

NHL API

↓

Python Scrapers (nhlpy + psycopg2)

↓

PostgreSQL Data Warehouse (nhl_dw schema)

↓

SQL Analytics / BI / Python Analysis

Warehouse Design

The warehouse follows dimensional modeling best practices.

Dimensions

Table: Purpose
dim_date: Calendar attributes
dim_player: Player master data
dim_team: NHL teams
dim_season: NHL seasons
dim_venue: Game venues

Fact Tables

Table: Grain
fact_game: One row per NHL game
fact_team_game: One row per team per game
fact_skater_game: One row per skater per game
fact_goalie_game: One row per goalie per game
event_play: One row per play event

Modeling Principles

Surrogate keys for all dimensions
Natural keys enforced as UNIQUE
Referential integrity with foreign keys
Correct analytical grain per fact table
Event-level JSON retained for extensibility

Python Scraping Layer

The ingestion layer is built in Python using nhlpy and psycopg2.

Key design choices

Combines roster and stats endpoints to guarantee player completeness
Normalizes inconsistent API payloads
Uses UPSERT logic to prevent duplication
Preserves existing data while enriching missing attributes

Example logic

Roster data provides best biographical attributes
Stats data ensures coverage for all players with recorded games
Both feed into a single normalized player dimension

This ensures no players are lost due to API fragmentation.

Data Quality Strategy

Surrogate keys isolate warehouse logic from API volatility
Natural keys enforce uniqueness
Event data stored in JSON for future enrichment
Deduplication handled during ingestion

Analytical Use Cases

This warehouse enables:

Player performance analysis
Goalie vs skater efficiency comparison
Team trends by season and opponent
Event heatmaps and shot locations
Advanced derived metrics (xG, clutch, possession, etc.)
BI dashboards and Python analytics

Example Queries

Typical analytical questions supported:

Top goal scorers by season and team
Goalie save percentage trends
Player shot efficiency
Team power play efficiency
Event distribution by zone

Limitations

Layer separation prevents “business logic drift”
Raw ingestion stays stable while transformations evolve safely in Silver and Gold.
Gold-layer star schema reduces downstream complexity
BI and analysts share one consistent definition of metrics and dimensions.
Standardization in Silver is the scaling lever
Conformed keys and canonical attributes enable new marts with minimal rework.

Planned Extensions

Planned upgrades

Automated data quality checks (freshness, row counts, uniqueness, referential integrity)
Improved incremental loading patterns and change tracking
Orchestration/scheduling and version-controlled SQL migrations
Optional semantic layer standardization for BI metric governance