Big Picture
Tutorial 1 gave you clean data: 1,269 draws, all in the same format. But raw draws are not useful for modeling. A draw like [5, 12, 23, 45, 67] is just five numbers. It has no structure that a model can learn from.
This tutorial teaches you to transform raw draws into features. A feature is a numerical property that captures something meaningful about a draw. Instead of five disconnected numbers, you get 15 properties: the mean, the standard deviation, the number of odd balls, the spacing between consecutive numbers, and so on.
These features give models something to work with. They turn unstructured data into structured data. They make patterns visible.
1. What Is a Feature?
A feature is a measurable property of your data. In lottery analysis, a feature is any number you can calculate from a draw.
For example, given the draw [5, 12, 23, 45, 67], you can calculate:
Each of these is a feature. Each one gives you a different perspective on the draw.
2. Why Feature Engineering Matters
Raw lottery draws are categorical data. Each ball is just an identifier. The number 23 is not "larger" than the number 12 in any meaningful sense for prediction. They are just labels.
But if you transform the draw into features, you create numerical relationships. You can compare draws. You can say "this draw has a higher mean than that draw" or "this draw has more odd numbers than that draw." You create axes of variation that models can learn from.
Feature engineering is where domain knowledge enters the pipeline. You decide which properties are worth measuring. You design the lens through which the model sees the data.
3. The Five Feature Families
We will extract 15 features from each draw, organized into five families:
Mean, standard deviation, and range. These describe the central tendency and spread of the five white balls.
How many odd numbers and how many even numbers appear in the draw.
How many numbers are below 35 (low) and how many are 35 or above (high). This tests if draws favor certain regions of the number range.
How many numbers fall into each decade (0-9, 10-19, 20-29, etc.). This captures the spatial distribution across the full range.
The maximum gap between consecutive numbers. This measures clustering versus dispersion.
4. From Draw to Feature Vector
After feature engineering, each draw becomes a feature vector: a list of 15 numbers. That vector is a point in 15-dimensional space.
For example, the draw [5, 12, 23, 45, 67] becomes:
This vector is now ready for statistical analysis. You can compare it to other vectors. You can compute distances. You can train models.
5. Code Roadmap: What the Script Does (and Why This Order)
The script performs three main steps:
Read the parquet file created in Tutorial 1.
Write a function that takes one draw (one row) and returns a dictionary of features.
Loop through the dataframe and extract features for each draw. Collect the results into a new dataframe.
6. Python Implementation
Here is the complete script. It loads cleaned draws and builds a feature matrix.
"""Tutorial 2: Transform cleaned draws into a feature matrix"""
import pandas as pd
import numpy as np
# --- Load cleaned data ---
print("Loading cleaned Powerball data from Tutorial 1...")
draws = pd.read_parquet('data/processed/powerball_clean.parquet')
print(f"Total draws: {len(draws)}\n")
# --- Feature engineering function ---
def extract_features(row):
"""
Transform one draw into a feature vector.
Returns a dictionary with 15 features.
"""
# The 5 white balls (already sorted in the data)
nums = [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]
features = {}
# Family 1: Aggregate statistics
features['mean'] = np.mean(nums)
features['std'] = np.std(nums, ddof=1) # ddof=1 uses sample std, not population std
features['range'] = max(nums) - min(nums)
# Family 2: Parity (odd/even counts)
features['odd_count'] = sum(1 for n in nums if n % 2 == 1)
features['even_count'] = sum(1 for n in nums if n % 2 == 0)
# Family 3: High/Low split (using 35 as the midpoint of 1-69)
features['low_count'] = sum(1 for n in nums if n < 35)
features['high_count'] = sum(1 for n in nums if n >= 35)
# Family 4: Decade buckets
# Count how many balls fall into each decade (0-9, 10-19, ..., 60-69)
features['decade_00_09'] = sum(1 for n in nums if n <= 9)
features['decade_10_19'] = sum(1 for n in nums if 10 <= n <= 19)
features['decade_20_29'] = sum(1 for n in nums if 20 <= n <= 29)
features['decade_30_39'] = sum(1 for n in nums if 30 <= n <= 39)
features['decade_40_49'] = sum(1 for n in nums if 40 <= n <= 49)
features['decade_50_59'] = sum(1 for n in nums if 50 <= n <= 59)
features['decade_60_69'] = sum(1 for n in nums if 60 <= n <= 69)
# Family 5: Spacing and gaps
# Calculate the differences between consecutive balls
diffs = [nums[i+1] - nums[i] for i in range(4)]
features['max_gap'] = max(diffs)
return features
# --- Apply feature engineering to all draws ---
print("Extracting features from each draw...")
feature_list = []
for idx, row in draws.iterrows():
features = extract_features(row)
# Add the date and draw identifier
features['draw_date'] = row['draw_date']
feature_list.append(features)
# Convert list of dictionaries to a DataFrame
feature_df = pd.DataFrame(feature_list)
# --- Summary ---
print(f"\nFeature extraction complete!")
print(f"Total draws: {len(feature_df)}")
print(f"Total features per draw: {len(feature_df.columns) - 1}") # Minus 1 for draw_date
print(f"\nFeature columns: {list(feature_df.columns)}")
# --- Save feature table ---
feature_df.to_parquet('data/processed/features_powerball.parquet', index=False)
print("\nFeature table saved to: data/processed/features_powerball.parquet")
print("Ready for Tutorial 3 (EDA and Validation)\n")7. How to Run the Script
- You must have run Tutorial 1 first (to create
powerball_clean.parquet) - NumPy installed (
pip install numpy)
# Windows (PowerShell)
cd C:\path\to\tutorials
python tutorial2_feature_engineering.py# Mac / Linux (Terminal)
cd /path/to/tutorials
python3 tutorial2_feature_engineering.py8. Understanding the Code: What Each Feature Does
The line np.std(nums, ddof=1) computes the sample standard deviation. The ddof=1 parameter tells NumPy to use the sample formula (dividing by n-1) instead of the population formula (dividing by n). This is the standard in data science because we are working with a sample of draws, not the entire population of all possible draws.
The decade features use simple if-statements to count how many balls fall into each range. For example, sum(1 for n in nums if 20 <= n <= 29) counts how many numbers are between 20 and 29 (inclusive).
The max gap feature measures the largest spacing between consecutive balls. The code first computes all the differences: [nums[i+1] - nums[i] for i in range(4)]. Then it takes the maximum of those differences.
9. Why This Matters for Later Tutorials
Tutorial 3 will visualize these features to check if they look reasonable. Tutorial 4 will test if the distributions match random expectations. Tutorial 5 will use these features to train machine learning models.
All of those steps assume you have a feature matrix. Without feature engineering, you cannot do statistical analysis. Without features, you cannot train models. This tutorial is the bridge from raw data to structured data.
The 15 features you built here are not the only possible features. They are a starting point. As you learn more about lottery analysis, you can design new features that capture different aspects of the draws. Feature engineering is an iterative process.