Tutorial MenuTutorial 1Structural Breaks

Tutorial 1: Structural Breaks and Clean Data

Why historical lottery data must be segmented correctly before any statistical analysis or modeling

Big Picture

The Powerball lottery has changed its format multiple times. In 2015, it switched from 59 white balls to 69 white balls and reduced the Powerball range from 35 to 26. That change created a structural break in the data.

If you mix draws from different formats, your analysis will find patterns that are not real. A model trained on mixed data will learn the format change, not the lottery mechanics. This tutorial teaches you to detect structural breaks and filter to a single format.

What you will be able to do by the end
You will understand what a structural break is and why it matters. You will be able to write Python code that detects format changes and filters data to a single, consistent format. This gives you a clean dataset for all future analysis.

1. The Problem with Historical Lottery Data

Powerball has existed since 1992. Over those decades, the game rules have changed. The number of white balls changed. The range of the Powerball changed. Each change created a new format.

If you download all historical draws and analyze them as one dataset, you are mixing formats. That mixing creates artificial patterns. For example, if you train a model on data from 2010 to 2025, the model will learn that after October 2015, certain numbers (60-69) started appearing. That is not a pattern in the lottery. That is a format change.

Definition: Structural Break
A structural break is a point in time when the rules of the system change. In lottery data, a structural break occurs when the number of balls or the range of values is modified. After a structural break, the old data and the new data are not directly comparable.
Analogy: Changing the rules mid-game
Imagine you are analyzing basketball stats. Halfway through the season, the league changes the rules so that three-pointers are now worth four points. If you train a model on the full season, it will learn that something weird happened at the midpoint, but it will not understand the real dynamics of basketball. You need to either analyze the two halves separately or only use data from after the rule change.

2. The Current Powerball Format (October 2015 to Present)

The current Powerball format has been stable since October 7, 2015. It works like this:

White balls: 5 numbers chosen from 1 to 69 (without replacement)
Powerball: 1 number chosen from 1 to 26
Draws per week: 3 (Monday, Wednesday, Saturday)

Any draw that does not match these rules belongs to a different format and should be excluded from our analysis. This is not "throwing away data." This is ensuring the data we keep is comparable and meaningful.

Does filtering reduce our accuracy?
A common fear in data science is that discarding data will hurt your results. However, 1,200 draws of clean data is statistically much more powerful than 2,000 draws of dirty or mixed data. Even after filtering, we still have over 1,200 draws, which is more than enough for reliable statistics. In future tutorials, we will prove mathematically that this filtered dataset is large enough to detect even tiny patterns of bias.

3. The Solution: Detect and Filter

The solution has two steps:

Step 1: Detect the format

For each draw, check if the white balls are in the range 1-69 and if the Powerball is in the range 1-26. If yes, it matches the current format. If no, it belongs to an older format.

Step 2: Filter to the current format

Keep only the draws that match the current format. Discard everything else. Save the clean data to a new file.

This process is mechanical. There is no judgment or guesswork. You are simply checking if each draw fits the rules of the current game.

4. Code Roadmap: What the Script Does (and Why This Order)

The script performs four main steps:

Step 1: Load the raw data

Read the CSV file containing all historical Powerball draws.

Why first: You need the data before you can process it.
Step 2: Define a validation function

Write a function that takes one draw and returns True if it matches the current format, False otherwise.

Why next: The function is reusable and testable. You can verify it works on sample draws before applying it to the full dataset.
Step 3: Apply the validation function to all draws

Use pandas to run the validation function on every row of the dataframe. This creates a new column that marks each draw as valid or invalid.

Why here: Bulk processing is efficient. Pandas vectorizes the operation so it runs fast even on thousands of draws.
Step 4: Filter and save

Keep only the valid draws. Convert the date column to datetime format. Save the cleaned data to a new file.

Why last: The cleaned data is now ready for Tutorial 2 (Feature Engineering). Saving it to a separate file protects the original raw data.

5. Python Implementation

Here is the complete script. It loads raw Powerball data, detects the current format, and saves a cleaned dataset.

tutorial1_structural_breaks.py
Python
"""Tutorial 1: Detect structural breaks and filter to the current Powerball format"""

import pandas as pd
from datetime import datetime

# --- Load raw data ---

print("Loading raw Powerball data...")
df = pd.read_csv('data/raw/powerball_raw.csv')
print(f"Total draws in raw file: {len(df)}")

# --- Helper function: Validate one draw ---

def validate_one_draw(row):
    """
    Check if a single draw matches the current Powerball format.
    Returns True if valid, False otherwise.
    
    Current format: 5 white balls (1-69) + 1 Powerball (1-26)
    """
    # Extract the 5 white balls
    white_balls = [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]
    powerball = row['powerball']
    
    # Check 1: All white balls must be between 1 and 69
    if not all(1 <= ball <= 69 for ball in white_balls):
        return False
    
    # Check 2: Powerball must be between 1 and 26
    if not (1 <= powerball <= 26):
        return False
    
    # Check 3: All white balls must be unique (no duplicates)
    # A set is like a list that automatically removes duplicates
    if len(set(white_balls)) != 5:
        return False
    
    return True

# --- Filter draws ---

print("\nFiltering draws to current format (5 white balls 1-69, Powerball 1-26)...")

# Apply validation to every row
df['is_valid'] = df.apply(validate_one_draw, axis=1)

# Keep only the valid draws
valid_draws = df[df['is_valid']].copy()

# Convert date column to proper datetime format
valid_draws['draw_date'] = pd.to_datetime(valid_draws['draw_date'])

# Sort by date (oldest first)
valid_draws = valid_draws.sort_values('draw_date')

# Drop the temporary validation column
valid_draws = valid_draws.drop(columns=['is_valid'])

# --- Summary statistics ---

print(f"\nValid draws (current format): {len(valid_draws)}")
print(f"Earliest valid draw: {valid_draws['draw_date'].min()}")
print(f"Latest valid draw: {valid_draws['draw_date'].max()}")

# --- Save cleaned data ---

valid_draws.to_parquet('data/processed/powerball_clean.parquet', index=False)
print("\nCleaned data saved to: data/processed/powerball_clean.parquet")
print("Ready for Tutorial 2 (Feature Engineering)")

6. How to Run the Script

Prerequisites:
  • Python 3.8 or higher
  • pandas library installed (pip install pandas)
  • Raw Powerball data in data/raw/powerball_raw.csv
Run the script:
Windows (PowerShell)
Python
# Windows (PowerShell)
cd C:\path\to\tutorials
python tutorial1_structural_breaks.py
Mac / Linux (Terminal)
Python
# Mac / Linux (Terminal)
cd /path/to/tutorials
python3 tutorial1_structural_breaks.py
What you should see:
Loading raw Powerball data... Total draws in raw file: 2847 Filtering draws to current format (5 white balls 1-69, Powerball 1-26)... Valid draws (current format): 1269 Earliest valid draw: 2015-10-07 Latest valid draw: 2024-12-14 Cleaned data saved to: data/processed/powerball_clean.parquet Ready for Tutorial 2 (Feature Engineering)

7. Understanding the Code: What Each Part Does

The validation function

The function validate_one_draw() checks three conditions:

  • All white balls are between 1 and 69
  • The Powerball is between 1 and 26
  • All white balls are unique (no duplicates)

The third check uses a set. A set is like a list that automatically removes duplicates. If you put 5 numbers into a set and it only has 4 items, that means there was a duplicate. The function returns False in that case.

The apply step

The line df.apply(validate_one_draw, axis=1) runs the validation function on every row of the dataframe. The axis=1 parameter means "apply this function to each row, not each column."

The filter step

The line df[df['is_valid']] is pandas syntax for "keep only the rows where is_valid is True." This is how we filter out the old format draws.

8. Why This Matters for Later Tutorials

Tutorial 2 will build features from these cleaned draws. Tutorial 3 will visualize distributions. Tutorial 4 will run statistical tests. All of those steps assume the data is i.i.d. (independent and identically distributed).

If the data contains multiple formats, it is not identically distributed. A draw from 2010 and a draw from 2020 are not comparable. They come from different probability spaces. Any statistical test or model trained on mixed data will produce garbage results.

By filtering to a single format, you ensure that every draw in your dataset follows the same rules. That makes all future analysis valid and meaningful.

Definition: i.i.d. (Independent and Identically Distributed)
A dataset is i.i.d. if every observation (1) comes from the same probability distribution (identically distributed) and (2) does not depend on any other observation (independent). Lottery draws within a single format are i.i.d. Lottery draws across multiple formats are not.

9. What You Now Understand (and why it matters later)

You know what a structural break is. You know why mixing data from different formats breaks statistical assumptions. You can write code that detects format changes and filters to a single format.

More importantly, you understand the principle: clean data is not about having the most data. It is about having comparable data. A smaller dataset with a consistent format is more valuable than a larger dataset with mixed formats.

In Tutorial 2, you will transform these cleaned draws into features. That tutorial assumes the data is already filtered to the current format. If you skip Tutorial 1, Tutorial 2 will produce meaningless features.