Big Picture
The Powerball lottery has changed its format multiple times. In 2015, it switched from 59 white balls to 69 white balls and reduced the Powerball range from 35 to 26. That change created a structural break in the data.
If you mix draws from different formats, your analysis will find patterns that are not real. A model trained on mixed data will learn the format change, not the lottery mechanics. This tutorial teaches you to detect structural breaks and filter to a single format.
1. The Problem with Historical Lottery Data
Powerball has existed since 1992. Over those decades, the game rules have changed. The number of white balls changed. The range of the Powerball changed. Each change created a new format.
If you download all historical draws and analyze them as one dataset, you are mixing formats. That mixing creates artificial patterns. For example, if you train a model on data from 2010 to 2025, the model will learn that after October 2015, certain numbers (60-69) started appearing. That is not a pattern in the lottery. That is a format change.
2. The Current Powerball Format (October 2015 to Present)
The current Powerball format has been stable since October 7, 2015. It works like this:
Any draw that does not match these rules belongs to a different format and should be excluded from our analysis. This is not "throwing away data." This is ensuring the data we keep is comparable and meaningful.
3. The Solution: Detect and Filter
The solution has two steps:
For each draw, check if the white balls are in the range 1-69 and if the Powerball is in the range 1-26. If yes, it matches the current format. If no, it belongs to an older format.
Keep only the draws that match the current format. Discard everything else. Save the clean data to a new file.
This process is mechanical. There is no judgment or guesswork. You are simply checking if each draw fits the rules of the current game.
4. Code Roadmap: What the Script Does (and Why This Order)
The script performs four main steps:
Read the CSV file containing all historical Powerball draws.
Write a function that takes one draw and returns True if it matches the current format, False otherwise.
Use pandas to run the validation function on every row of the dataframe. This creates a new column that marks each draw as valid or invalid.
Keep only the valid draws. Convert the date column to datetime format. Save the cleaned data to a new file.
5. Python Implementation
Here is the complete script. It loads raw Powerball data, detects the current format, and saves a cleaned dataset.
"""Tutorial 1: Detect structural breaks and filter to the current Powerball format"""
import pandas as pd
from datetime import datetime
# --- Load raw data ---
print("Loading raw Powerball data...")
df = pd.read_csv('data/raw/powerball_raw.csv')
print(f"Total draws in raw file: {len(df)}")
# --- Helper function: Validate one draw ---
def validate_one_draw(row):
"""
Check if a single draw matches the current Powerball format.
Returns True if valid, False otherwise.
Current format: 5 white balls (1-69) + 1 Powerball (1-26)
"""
# Extract the 5 white balls
white_balls = [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]
powerball = row['powerball']
# Check 1: All white balls must be between 1 and 69
if not all(1 <= ball <= 69 for ball in white_balls):
return False
# Check 2: Powerball must be between 1 and 26
if not (1 <= powerball <= 26):
return False
# Check 3: All white balls must be unique (no duplicates)
# A set is like a list that automatically removes duplicates
if len(set(white_balls)) != 5:
return False
return True
# --- Filter draws ---
print("\nFiltering draws to current format (5 white balls 1-69, Powerball 1-26)...")
# Apply validation to every row
df['is_valid'] = df.apply(validate_one_draw, axis=1)
# Keep only the valid draws
valid_draws = df[df['is_valid']].copy()
# Convert date column to proper datetime format
valid_draws['draw_date'] = pd.to_datetime(valid_draws['draw_date'])
# Sort by date (oldest first)
valid_draws = valid_draws.sort_values('draw_date')
# Drop the temporary validation column
valid_draws = valid_draws.drop(columns=['is_valid'])
# --- Summary statistics ---
print(f"\nValid draws (current format): {len(valid_draws)}")
print(f"Earliest valid draw: {valid_draws['draw_date'].min()}")
print(f"Latest valid draw: {valid_draws['draw_date'].max()}")
# --- Save cleaned data ---
valid_draws.to_parquet('data/processed/powerball_clean.parquet', index=False)
print("\nCleaned data saved to: data/processed/powerball_clean.parquet")
print("Ready for Tutorial 2 (Feature Engineering)")6. How to Run the Script
- Python 3.8 or higher
- pandas library installed (
pip install pandas) - Raw Powerball data in
data/raw/powerball_raw.csv
# Windows (PowerShell)
cd C:\path\to\tutorials
python tutorial1_structural_breaks.py# Mac / Linux (Terminal)
cd /path/to/tutorials
python3 tutorial1_structural_breaks.py7. Understanding the Code: What Each Part Does
The function validate_one_draw() checks three conditions:
- All white balls are between 1 and 69
- The Powerball is between 1 and 26
- All white balls are unique (no duplicates)
The third check uses a set. A set is like a list that automatically removes duplicates. If you put 5 numbers into a set and it only has 4 items, that means there was a duplicate. The function returns False in that case.
The line df.apply(validate_one_draw, axis=1) runs the validation function on every row of the dataframe. The axis=1 parameter means "apply this function to each row, not each column."
The line df[df['is_valid']] is pandas syntax for "keep only the rows where is_valid is True." This is how we filter out the old format draws.
8. Why This Matters for Later Tutorials
Tutorial 2 will build features from these cleaned draws. Tutorial 3 will visualize distributions. Tutorial 4 will run statistical tests. All of those steps assume the data is i.i.d. (independent and identically distributed).
If the data contains multiple formats, it is not identically distributed. A draw from 2010 and a draw from 2020 are not comparable. They come from different probability spaces. Any statistical test or model trained on mixed data will produce garbage results.
By filtering to a single format, you ensure that every draw in your dataset follows the same rules. That makes all future analysis valid and meaningful.
9. What You Now Understand (and why it matters later)
You know what a structural break is. You know why mixing data from different formats breaks statistical assumptions. You can write code that detects format changes and filters to a single format.
More importantly, you understand the principle: clean data is not about having the most data. It is about having comparable data. A smaller dataset with a consistent format is more valuable than a larger dataset with mixed formats.
In Tutorial 2, you will transform these cleaned draws into features. That tutorial assumes the data is already filtered to the current format. If you skip Tutorial 1, Tutorial 2 will produce meaningless features.