Tutorial MenuTutorial 2Feature Engineering

Tutorial 2: Feature Engineering from Lottery Draws

How raw draws are transformed into a stable feature matrix that models can actually consume

Big Picture

Tutorial 1 gave you clean data: 1,269 draws, all in the same format. But raw draws are not useful for modeling. A draw like [5, 12, 23, 45, 67] is just five numbers. It has no structure that a model can learn from.

This tutorial teaches you to transform raw draws into features. A feature is a numerical property that captures something meaningful about a draw. Instead of five disconnected numbers, you get 15 properties: the mean, the standard deviation, the number of odd balls, the spacing between consecutive numbers, and so on.

These features give models something to work with. They turn unstructured data into structured data. They make patterns visible.

What you will be able to do by the end
You will understand what a feature is and why feature engineering matters. You will be able to write Python code that transforms lottery draws into a feature matrix. You will have a dataset with 1,269 rows (one per draw) and 15 columns (one per feature), ready for analysis and modeling.

1. What Is a Feature?

A feature is a measurable property of your data. In lottery analysis, a feature is any number you can calculate from a draw.

Definition: Feature
A feature is a numerical property derived from raw data. Features are the inputs to statistical tests and machine learning models. Good features capture relevant structure. Bad features add noise.

For example, given the draw [5, 12, 23, 45, 67], you can calculate:

Mean: (5 + 12 + 23 + 45 + 67) / 5 = 30.4
Standard deviation: How spread out the numbers are
Odd count: How many of the five numbers are odd (in this case, 3)
Max gap: The largest spacing between consecutive numbers (45 - 23 = 22)

Each of these is a feature. Each one gives you a different perspective on the draw.

Analogy: Describing a person
Imagine you are trying to describe a person to someone who has never met them. You would not just list their atoms. You would describe features: height, weight, eye color, age. Features are the properties that matter for understanding and comparison.

2. Why Feature Engineering Matters

Raw lottery draws are categorical data. Each ball is just an identifier. The number 23 is not "larger" than the number 12 in any meaningful sense for prediction. They are just labels.

But if you transform the draw into features, you create numerical relationships. You can compare draws. You can say "this draw has a higher mean than that draw" or "this draw has more odd numbers than that draw." You create axes of variation that models can learn from.

Feature engineering is where domain knowledge enters the pipeline. You decide which properties are worth measuring. You design the lens through which the model sees the data.

Feature engineering is the bottleneck
In most real-world data science projects, the quality of your features determines the quality of your results. A sophisticated model with bad features will fail. A simple model with good features can succeed. This tutorial teaches you how to design features systematically.

3. The Five Feature Families

We will extract 15 features from each draw, organized into five families:

Family 1: Aggregate Statistics

Mean, standard deviation, and range. These describe the central tendency and spread of the five white balls.

Family 2: Parity (Odd/Even)

How many odd numbers and how many even numbers appear in the draw.

Family 3: High/Low Split

How many numbers are below 35 (low) and how many are 35 or above (high). This tests if draws favor certain regions of the number range.

Family 4: Decade Buckets

How many numbers fall into each decade (0-9, 10-19, 20-29, etc.). This captures the spatial distribution across the full range.

Family 5: Spacing and Gaps

The maximum gap between consecutive numbers. This measures clustering versus dispersion.

Example: (5) ---7--- (12) ---11--- (23) ---22--- (45) ---22--- (67)
The gaps are 7, 11, 22, and 22. The max gap is 22.

4. From Draw to Feature Vector

After feature engineering, each draw becomes a feature vector: a list of 15 numbers. That vector is a point in 15-dimensional space.

Definition: Feature Vector
A feature vector is an ordered list of feature values. If you have 15 features, each draw is represented as a point in 15-dimensional space. This geometric view is the foundation of modern machine learning, where algorithms find patterns by measuring distances and similarities between points in high-dimensional space.

For example, the draw [5, 12, 23, 45, 67] becomes:

[30.4, # mean
24.1, # std
62, # range
3, # odd_count
2, # even_count
2, # low_count
3, # high_count
1, # decade_00_09
1, # decade_10_19
1, # decade_20_29
0, # decade_30_39
1, # decade_40_49
0, # decade_50_59
1, # decade_60_69
22] # max_gap

This vector is now ready for statistical analysis. You can compare it to other vectors. You can compute distances. You can train models.

5. Code Roadmap: What the Script Does (and Why This Order)

The script performs three main steps:

Step 1: Load the cleaned data

Read the parquet file created in Tutorial 1.

Why first: You need the filtered draws before you can engineer features.
Step 2: Define the feature extraction function

Write a function that takes one draw (one row) and returns a dictionary of features.

Why next: Encapsulating the logic in a function makes it reusable and testable. You can verify it works on sample draws before applying it to all 1,269 rows.
Step 3: Apply the function to all draws

Loop through the dataframe and extract features for each draw. Collect the results into a new dataframe.

Why last: After extraction, save the feature table to disk. This table is the input for all future tutorials.

6. Python Implementation

Here is the complete script. It loads cleaned draws and builds a feature matrix.

tutorial2_feature_engineering.py
Python
"""Tutorial 2: Transform cleaned draws into a feature matrix"""

import pandas as pd
import numpy as np

# --- Load cleaned data ---

print("Loading cleaned Powerball data from Tutorial 1...")
draws = pd.read_parquet('data/processed/powerball_clean.parquet')
print(f"Total draws: {len(draws)}\n")

# --- Feature engineering function ---

def extract_features(row):
    """
    Transform one draw into a feature vector.
    Returns a dictionary with 15 features.
    """
    # The 5 white balls (already sorted in the data)
    nums = [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]
    
    features = {}
    
    # Family 1: Aggregate statistics
    features['mean'] = np.mean(nums)
    features['std'] = np.std(nums, ddof=1)  # ddof=1 uses sample std, not population std
    features['range'] = max(nums) - min(nums)
    
    # Family 2: Parity (odd/even counts)
    features['odd_count'] = sum(1 for n in nums if n % 2 == 1)
    features['even_count'] = sum(1 for n in nums if n % 2 == 0)
    
    # Family 3: High/Low split (using 35 as the midpoint of 1-69)
    features['low_count'] = sum(1 for n in nums if n < 35)
    features['high_count'] = sum(1 for n in nums if n >= 35)
    
    # Family 4: Decade buckets
    # Count how many balls fall into each decade (0-9, 10-19, ..., 60-69)
    features['decade_00_09'] = sum(1 for n in nums if n <= 9)
    features['decade_10_19'] = sum(1 for n in nums if 10 <= n <= 19)
    features['decade_20_29'] = sum(1 for n in nums if 20 <= n <= 29)
    features['decade_30_39'] = sum(1 for n in nums if 30 <= n <= 39)
    features['decade_40_49'] = sum(1 for n in nums if 40 <= n <= 49)
    features['decade_50_59'] = sum(1 for n in nums if 50 <= n <= 59)
    features['decade_60_69'] = sum(1 for n in nums if 60 <= n <= 69)
    
    # Family 5: Spacing and gaps
    # Calculate the differences between consecutive balls
    diffs = [nums[i+1] - nums[i] for i in range(4)]
    features['max_gap'] = max(diffs)
    
    return features

# --- Apply feature engineering to all draws ---

print("Extracting features from each draw...")
feature_list = []

for idx, row in draws.iterrows():
    features = extract_features(row)
    
    # Add the date and draw identifier
    features['draw_date'] = row['draw_date']
    
    feature_list.append(features)

# Convert list of dictionaries to a DataFrame
feature_df = pd.DataFrame(feature_list)

# --- Summary ---

print(f"\nFeature extraction complete!")
print(f"Total draws: {len(feature_df)}")
print(f"Total features per draw: {len(feature_df.columns) - 1}")  # Minus 1 for draw_date
print(f"\nFeature columns: {list(feature_df.columns)}")

# --- Save feature table ---

feature_df.to_parquet('data/processed/features_powerball.parquet', index=False)
print("\nFeature table saved to: data/processed/features_powerball.parquet")
print("Ready for Tutorial 3 (EDA and Validation)\n")

7. How to Run the Script

Prerequisites:
  • You must have run Tutorial 1 first (to create powerball_clean.parquet)
  • NumPy installed (pip install numpy)
Run the script:
Windows (PowerShell)
Python
# Windows (PowerShell)
cd C:\path\to\tutorials
python tutorial2_feature_engineering.py
Mac / Linux (Terminal)
Python
# Mac / Linux (Terminal)
cd /path/to/tutorials
python3 tutorial2_feature_engineering.py
What you should see:
Loading cleaned Powerball data from Tutorial 1... Total draws: 1269 Extracting features from each draw... Feature extraction complete! Total draws: 1269 Total features per draw: 15 Feature columns: ['mean', 'std', 'range', 'odd_count', 'even_count', 'low_count', 'high_count', 'decade_00_09', 'decade_10_19', 'decade_20_29', 'decade_30_39', 'decade_40_49', 'decade_50_59', 'decade_60_69', 'max_gap', 'draw_date'] Feature table saved to: data/processed/features_powerball.parquet Ready for Tutorial 3 (EDA and Validation)

8. Understanding the Code: What Each Feature Does

Standard deviation (ddof=1)

The line np.std(nums, ddof=1) computes the sample standard deviation. The ddof=1 parameter tells NumPy to use the sample formula (dividing by n-1) instead of the population formula (dividing by n). This is the standard in data science because we are working with a sample of draws, not the entire population of all possible draws.

Decade buckets

The decade features use simple if-statements to count how many balls fall into each range. For example, sum(1 for n in nums if 20 <= n <= 29) counts how many numbers are between 20 and 29 (inclusive).

Max gap

The max gap feature measures the largest spacing between consecutive balls. The code first computes all the differences: [nums[i+1] - nums[i] for i in range(4)]. Then it takes the maximum of those differences.

9. Why This Matters for Later Tutorials

Tutorial 3 will visualize these features to check if they look reasonable. Tutorial 4 will test if the distributions match random expectations. Tutorial 5 will use these features to train machine learning models.

All of those steps assume you have a feature matrix. Without feature engineering, you cannot do statistical analysis. Without features, you cannot train models. This tutorial is the bridge from raw data to structured data.

The 15 features you built here are not the only possible features. They are a starting point. As you learn more about lottery analysis, you can design new features that capture different aspects of the draws. Feature engineering is an iterative process.