Tutorial MenuTutorial 7Dimensionality Crisis

Tutorial 7: The Dimensionality Crisis

When manual analysis becomes impossible: discovering why high-dimensional pattern search requires neural networks

Big Picture

Tutorials 4-6 tested individual aspects of randomness: aggregate frequencies (chi-squared), single probabilities (Beta-Binomial), and all 69 balls independently (Dirichlet-Multinomial). Each test passed. The lottery appears fair.

But these tests make a critical assumption: independence. They assume Ball 23's probability is unaffected by Ball 41. They assume the machine does not change over time. They assume patterns do not exist across sequences of draws.

This tutorial explores what happens when we try to test relationships, temporal patterns, and conditional dependencies. You will discover why the combinatorial explosion of possibilities makes manual analysis impossible, and why automated pattern search (neural networks) becomes necessary.

What you will be able to do by the end
You will understand the difference between testing specified hypotheses and searching high-dimensional spaces. You will see why classical methods (Bayesian or frequentist) cannot scale to millions of pattern combinations. You will understand why neural networks are not "better" than Bayesian models, but rather solve a fundamentally different problem: automated discovery in spaces too large for humans to specify.

1. Where We Are in the Journey

Tutorial 1-3 built clean data and features. Tutorial 4-6 tested whether individual balls follow random distributions. All tests passed. We have strong evidence that no single ball is biased.

But "no individual ball is biased" is not the same as "the lottery is random." There could be patterns we have not tested yet. Maybe Ball 23 and Ball 41 appear together too often. Maybe the machine changed in 2020. Maybe patterns exist across time.

This tutorial asks: can we test ALL possible patterns? The answer will lead us to neural networks.

2. Eliminating Simple Explanations

Before diving into complex patterns, we need to rule out two simple possibilities: temporal dependence and machine drift.

Definition: Runs Test
A runs test checks whether a sequence appears random or contains patterns. A "run" is a consecutive sequence of values above (or below) the median. If draws are independent, the number of runs should match the expected value. Too few runs suggests clustering. Too many runs suggests oscillation.

We test: Are consecutive draws independent? Or do high draws follow high draws (clustering)?

Definition: Changepoint Detection
Changepoint detection tests whether a data distribution shifts over time. If the lottery machine changed (new vendor, mechanical adjustment), ball frequencies in the early period should differ from the late period.

We test: Did the machine change between 2015 and 2024? Or have ball frequencies stayed stable?

These are quick tests (30-40 lines of code each). If they pass, we can focus on the harder problem: ball relationships.

Why passing these tests is the scariest result
If the runs test or changepoint test HAD failed, we would know what to look for: temporal clustering or machine drift. We could model those effects explicitly. But both tests PASS. The lottery is stable and independent at the aggregate level. This means if any signal exists, it is hidden deeper than simple statistics can detect. It is buried in the relationships between balls, in conditional dependencies, in high-dimensional patterns that no single test can isolate. Passing these tests does not prove randomness. It proves we need more sophisticated tools.

3. The Ball Pair Problem: Conditional Probability

Tutorial 6 tested whether Ball 23 appears too often. It does (102 times vs expected 92). But we declared it "not suspicious" because the credible interval barely excludes the expected value, and we expect 3-4 false positives with 69 balls.

But what if Ball 23 only appears when Ball 41 also appears? That would suggest a dependency: P(Ball 23 | Ball 41) ≠ P(Ball 23). This is a question about conditional probability.

Definition: Conditional Probability
Conditional probability P(A | B) is the probability of event A occurring given that event B has occurred. In lottery terms: what is the probability of Ball 23 appearing, given that Ball 41 was already drawn?

To test this, we build a co-occurrence matrix: how many times each pair of balls appeared together in the same draw.

Manual vs automated pattern search
Matrix size: 69 × 69 = 4,761 cells
Unique pairs: C(69,2) = 2,346 combinations
The problem: Can you interpret 2,346 correlations by hand?

This is already at the limit of human analysis. A heatmap with 4,761 cells is barely readable. But pairs are just the beginning.

Correlations vs causality: the multiple testing problem
With 2,346 unique pairs and a significance threshold of α = 0.05, we expect about 117 pairs (5% of 2,346) to show "significant" correlations purely by chance, even if the lottery is perfectly random. This is the multiple testing problem from Tutorial 6. Finding a few suspicious pairs proves nothing. We would need to see hundreds of correlations to conclude the lottery is biased. But we cannot check all 2,346 manually. And this is just pairs. Triplets give us 50,116 combinations. The problem compounds exponentially.

4. The Combinatorial Explosion

We just analyzed pairs. But the lottery draws 5 balls simultaneously. What about triplets? Quadruplets? All 5-ball combinations?

How complexity grows with dimensions
Pairs (2 balls): 2,346 combinations
Triplets (3 balls): 50,116 combinations
Quadruplets (4 balls): 1,348,621 combinations
Quintuplets (5 balls): 11,238,513 combinations

If we could check 1 million combinations per second, testing all 5-ball patterns would take 11 seconds. That sounds manageable. But we also have time.

We have 1,269 draws. Each combination could appear at any point in time. That is 11,238,513 × 1,269 = 14,261,669,097 possible patterns.

And we have not even considered:

  • Order of appearance within a draw
  • Sequences across multiple draws (Ball A in draw N, Ball B in draw N+1)
  • Interactions with the Powerball number
  • Machine/vendor effects
  • Seasonal patterns, day-of-week effects
The dimensionality crisis
The lottery is not a collection of 69 independent probabilities. It is a single event in a 11-million-dimensional space, repeated across time, with unknown dependencies. Manual analysis cannot explore this space. We need automated search.

5. Why Manual Models Fail

The limitation of every model we have built so far (chi-squared, Beta-Binomial, Dirichlet-Multinomial, runs test, changepoint detection) is that they require us to SPECIFY the pattern we are testing.

Examples of patterns humans cannot check manually
Pattern we can test: "Is Ball 23 biased?"
Pattern we cannot test: "Are there ANY biased 5-ball combinations among 11 million possibilities?"

Bayesian models are excellent for testing specified hypotheses with proper uncertainty quantification. They are not designed for exploratory search across millions of patterns.

Analogy: Looking for a needle in 11 million haystacks
Imagine you are looking for a specific needle in a haystack. Bayesian methods tell you: "Give me a haystack to search, and I will tell you the probability the needle is there, with a credible interval." But if you have 11 million haystacks and do not know which one contains the needle, you need a different tool. You need something that can search ALL haystacks simultaneously.

6. Code Walkthrough: Testing the Limits

The script performs four main analyses:

Part 1: Runs test (temporal independence)

Tests whether consecutive draws are independent or show clustering/oscillation patterns.

Part 2: Changepoint test (machine stability)

Compares ball frequencies in early vs late draws to detect distribution shifts.

Part 3: Co-occurrence matrix (pairwise correlations)

Builds a 69×69 matrix showing which ball pairs appear together more/less than expected.

Part 4: Combinatorial explosion (impossibility proof)

Calculates how many patterns exist and why manual testing is infeasible.

7. Python Implementation

The script is split into 4 parts for clarity. Here is the complete implementation:

Part 1: Runs Test
Python
"""Tutorial 7: The Dimensionality Crisis - When Manual Analysis Becomes Impossible"""

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from itertools import combinations

# --- Load cleaned data ---

print("Tutorial 7: The Dimensionality Crisis")
print("=" * 60)
print("\nLoading cleaned Powerball data from Tutorial 1...")
draws = pd.read_parquet('data/processed/powerball_clean.parquet')
print(f"Loaded {len(draws)} draws\n")

# --- PART 1: Eliminate Simple Explanations ---

print("\n" + "=" * 60)
print("PART 1: Eliminating Simple Explanations")
print("=" * 60)

# --- Test 1: Runs Test (Temporal Independence) ---

print("\n[Test 1: Runs Test - Are draws independent over time?]\n")

# Extract mean values from each draw
draw_means = []
for _, row in draws.iterrows():
    balls = [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]
    draw_means.append(np.mean(balls))

draw_means = np.array(draw_means)
overall_median = np.median(draw_means)

# Count runs (sequences above/below median)
above_median = draw_means > overall_median
n_runs = 1
for i in range(1, len(above_median)):
    if above_median[i] != above_median[i-1]:
        n_runs += 1

n_above = np.sum(above_median)
n_below = len(above_median) - n_above

# Expected runs under independence
expected_runs = ((2 * n_above * n_below) / (n_above + n_below)) + 1
variance_runs = (2 * n_above * n_below * (2 * n_above * n_below - n_above - n_below)) / \
                ((n_above + n_below)**2 * (n_above + n_below - 1))
std_runs = np.sqrt(variance_runs)

# Z-test
z_score = (n_runs - expected_runs) / std_runs
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

print(f"  Observed runs: {n_runs}")
print(f"  Expected runs (if independent): {expected_runs:.2f}")
print(f"  Z-score: {z_score:.4f}")
print(f"  P-value: {p_value:.4f}")

if p_value >= 0.05:
    print(f"  ✓ PASS: Draws appear independent over time (p >= 0.05)")
else:
    print(f"  ✗ FAIL: Evidence of temporal dependence (p < 0.05)")
Part 2: Changepoint Detection
Python
# --- Test 2: Changepoint Detection (Temporal Stability) ---

print("\n[Test 2: Changepoint Detection - Did the machine change?]\n")

# Split data into two halves and compare ball frequency distributions
midpoint = len(draws) // 2
early_draws = draws.iloc[:midpoint]
late_draws = draws.iloc[midpoint:]

# Count ball frequencies in each period
def count_ball_frequencies(draw_data):
    counts = np.zeros(69)
    for _, row in draw_data.iterrows():
        for ball in [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]:
            counts[ball - 1] += 1
    return counts

early_counts = count_ball_frequencies(early_draws)
late_counts = count_ball_frequencies(late_draws)

# Chi-squared test for distribution change
chi2_stat, p_value_change = stats.chisquare(late_counts, early_counts)

print(f"  Early period: draws 1-{midpoint}")
print(f"  Late period: draws {midpoint+1}-{len(draws)}")
print(f"  Chi-squared statistic: {chi2_stat:.4f}")
print(f"  P-value: {p_value_change:.4f}")

if p_value_change >= 0.05:
    print(f"  ✓ PASS: No evidence of machine change (p >= 0.05)")
else:
    print(f"  ✗ FAIL: Ball frequencies shifted between periods (p < 0.05)")

print("\n" + "-" * 60)
print("Conclusion: The lottery is stable and independent.")
print("The problem isn't time or drift. It's something else...")
print("-" * 60)
Part 3: Ball Pair Correlations
Python
# --- PART 2: The Ball Pair Problem ---

print("\n\n" + "=" * 60)
print("PART 2: The Ball Pair Problem - Conditional Probabilities")
print("=" * 60)

print("\nWe've proven individual balls are fair.")
print("But what if certain balls appear TOGETHER more often than they should?")

# Build co-occurrence matrix
print("\n[Computing all pairwise co-occurrences...]\n")

cooccurrence_matrix = np.zeros((69, 69))

for _, row in draws.iterrows():
    balls = [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]
    # For each pair of balls in this draw
    for i, ball_a in enumerate(balls):
        for ball_b in balls[i+1:]:
            cooccurrence_matrix[ball_a - 1, ball_b - 1] += 1
            cooccurrence_matrix[ball_b - 1, ball_a - 1] += 1  # Symmetric

# Compute expected co-occurrences and correlations
ball_frequencies = np.zeros(69)
for _, row in draws.iterrows():
    for ball in [row['ball1'], row['ball2'], row['ball3'], row['ball4'], row['ball5']]:
        ball_frequencies[ball - 1] += 1

expected_cooccurrence = np.outer(ball_frequencies, ball_frequencies) / (len(draws) * 5)
expected_cooccurrence = expected_cooccurrence * 4
correlation_matrix = cooccurrence_matrix - expected_cooccurrence

print(f"  Matrix size: 69 × 69 = {69*69} cells")
print(f"  Unique pairs: {69*68//2} combinations")

# Save correlation heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(correlation_matrix, cmap='RdBu_r', center=0, 
            xticklabels=range(1, 70), yticklabels=range(1, 70),
            cbar_kws={'label': 'Deviation from Expected Co-occurrence'})
plt.title('Ball Pair Correlation Matrix\n(69×69 = 4,761 relationships)', 
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('outputs/tutorial7/correlation_matrix.png', dpi=150)
plt.close()

print("\n  Saved: correlation_matrix.png")
Part 4: Combinatorial Explosion
Python
# --- PART 3: The Combinatorial Explosion ---

print("\n\n" + "=" * 60)
print("PART 3: The Combinatorial Explosion")
print("=" * 60)

from math import comb

pairs = comb(69, 2)
triplets = comb(69, 3)
quadruplets = comb(69, 4)
quintuplets = comb(69, 5)

print(f"\n  Pairs (2 balls):       {pairs:,}")
print(f"  Triplets (3 balls):    {triplets:,}")
print(f"  Quadruplets (4 balls): {quadruplets:,}")
print(f"  Quintuplets (5 balls): {quintuplets:,}")

print(f"\n  Total 5-ball combinations: {quintuplets:,}")
print(f"  If we check 1 million combinations per second:")
print(f"    Time required: {quintuplets / 1_000_000:.1f} seconds")

print("\n  But we also have TIME.")
print(f"  We have {len(draws)} draws.")
print(f"  That's {quintuplets:,} × {len(draws)} = {quintuplets * len(draws):,} patterns.")

print("\n" + "-" * 60)
print("The dimensionality is IMPOSSIBLE for manual analysis.")
print("-" * 60)

print("\n" + "=" * 60)
print("THE SOLUTION: Automated High-Dimensional Search")
print("=" * 60)

print("\nThis is why neural networks exist.")
print("\nNeural networks solve a DIFFERENT problem:")
print("  - Bayesian models: Test specified hypotheses")
print("  - Neural networks: Search high-dimensional spaces")

print("\nTutorial 8 will build networks to search this space.")
print("If they find nothing, the lottery is provably random.")
print("If they find patterns, we have evidence of bias.\n")

8. How to Run the Script

Prerequisites:
  • You must have run Tutorial 1 first (to create powerball_clean.parquet)
  • Create the output folder:
Create output folder
Python
mkdir outputs/tutorial7
Run the script:
Windows (PowerShell)
Python
# Windows (PowerShell)
cd C:\path\to\tutorials
python tutorial7_dimensionality_crisis.py
Mac / Linux (Terminal)
Python
# Mac / Linux (Terminal)
cd /path/to/tutorials
python3 tutorial7_dimensionality_crisis.py

9. The Bridge to Neural Networks

We have now exhausted manual analysis. Runs test: PASS. Changepoint test: PASS. Individual balls: PASS. Pairwise correlations: at human limit.

But we cannot test all 11 million 5-ball combinations. We cannot check temporal sequences. We cannot explore conditional dependencies automatically.

Neural networks solve this problem. They do not replace Bayesian models. They serve a different purpose:

Bayesian models: Test specified hypotheses with uncertainty quantification
Neural networks: Search high-dimensional spaces for ANY learnable pattern

A neural network trained on lottery data will try to predict future draws based on past draws. If it succeeds (accuracy significantly above random baseline), we have evidence of patterns. If it fails (accuracy = 1/69 ≈ 1.4%), we have strong evidence the lottery is truly random.

Different architectures for different patterns
Tutorial 8 will use two complementary architectures: LSTMs (Long Short-Term Memory networks) will search for temporal sequences like those shown in Pattern Type 2 from our reference images. They excel at detecting "Ball 23 appears every 5 draws" or "Ball 41 follows Ball 23 with a lag." Meanwhile, Transformers with attention mechanisms will search for multi-factor interactions like Pattern Type 4, where Ball A, Ball B, and draw parity all interact to predict Ball C. Between these two architectures, we can search the entire 11-million-dimensional pattern space. If both fail to beat random baseline, we have computationally exhausted the search for learnable patterns.
Why this matters
This is not about "using AI because it is trendy." This is about using the right tool for the right problem. You cannot manually search 11 million dimensions. You need automation. Neural networks are dimension-searching machines. That is their purpose.

10. What You Now Understand (and the path forward)

You understand the difference between testing individual hypotheses and searching high-dimensional spaces. You know why the combinatorial explosion makes manual analysis infeasible. You see why classical statistical methods (Bayesian or frequentist) cannot scale to millions of pattern combinations.

More importantly, you understand the intellectual progression that leads to neural networks. It is not "old methods bad, new methods good." It is "manual specification works for simple problems, automated search required for complex problems."

We have identified three distinct types of patterns that could exist in lottery data:

Spatial patterns: Do the 108 static features (means, sums, odd/even counts) contain predictive signal?
Temporal patterns: Does Ball 23 appear in cycles? Do sequences emerge across consecutive draws?
Interaction patterns: Does Ball 23 + Ball 41 together predict Ball 58? Do conditional dependencies exist?

Tutorials 8-10 will systematically test each pattern type using the appropriate neural architecture:

Tutorial 8 - Neural Baseline (MLPs): Can static spatial features predict draws?
Tutorial 9 - Sequence Modeling (LSTMs): Can temporal sequence patterns predict draws?
Tutorial 10 - Attention Mechanisms (Transformers): Can conditional interaction patterns predict draws?

Each architecture is designed to solve a specific problem. MLPs process spatial features. LSTMs add temporal memory. Transformers add relational attention. By testing all three systematically, we ensure comprehensive coverage of the pattern space.

The ultimate test of randomness
If all three architectures fail to beat random baseline (1/69 ≈ 1.4% accuracy), we will have computationally exhausted the search for learnable patterns. This is not "we tried a few things and gave up." This is "we systematically tested spatial, temporal, and relational patterns using state-of-the-art machine learning, and found nothing." That is as close to proof of randomness as modern data science can provide.

Tutorial 8 begins with the simplest case: can a neural network learn to predict draws using only the static features we engineered in Tutorial 2?