Building an NFL Play Prediction Model: Why Graph Neural Networks Changed Everything

When I started building an NFL play prediction model, I thought it would be a straightforward machine learning problem: feed in game statistics, predict the next play. What I discovered was that football isn't just about individual statistics—it's about relationships, context, and the complex web of decisions that unfold in real-time. That realization led me to Graph Neural Networks, and it changed everything.
Table of Contents
- Overview
- Technology Stack
- The Graph Insight: Football as a Network Problem
- The Multi-Task Learning Revelation
- The Quantile Regression Decision
- Domain Knowledge Integration: When Rules Matter
- The Class Balancing Challenge
- Feature Engineering: Capturing Context
- Graph Neural Network Architecture
- Training Strategy: Balancing Multiple Objectives
- Performance and Insights
- Real-World Applications
- Technical Innovations
- Key Takeaways
- Conclusion
Overview
Predicting NFL plays isn't just about knowing what happened before—it's about understanding how games, teams, and individual plays influence each other in ways that traditional ML models miss. This prediction model uses heterogeneous graph structures to capture these complex relationships, enabling simultaneous predictions of play type, outcome, yardage, and duration with uncertainty quantification.
The breakthrough wasn't the algorithms—it was modeling football as a graph problem.
Technology Stack
| Technology | Purpose | Why Chosen |
|---|---|---|
| PyTorch | Deep learning framework | Flexible graph operations, dynamic computation graphs |
| PyTorch Geometric | Graph neural network library | Best-in-class GNN implementations, heterogeneous graph support |
| Google BigQuery | NFL statistics data warehouse | Massive dataset storage, fast aggregations |
| Scikit-learn | Data preprocessing and metrics | Robust preprocessing pipelines, evaluation tools |
| NetworkX | Graph construction and manipulation | Graph visualization and analysis tools |
| NumPy/Pandas | Data manipulation | Standard data science stack |
| Google Cloud Storage | Model and artifact storage | Scalable storage for checkpoints and datasets |
The Graph Insight: Football as a Network Problem
Traditional approaches treat each play as an isolated prediction problem. But football is fundamentally about relationships:
- Games set the context (weather, venue, stakes)
- Teams bring their capabilities and current state
- Plays happen within this layered context
The key insight: model these as a heterogeneous graph where different entity types (games, teams, plays) have different feature spaces but meaningful connections.
Graph Structure Design
Graph Architecture:
├── Game Nodes (1 per game)
│ ├── Weather conditions (humidity, temperature, wind)
│ ├── Venue information and field conditions
│ └── Game stakes and context
├── Team Nodes (2 per game: home/away)
│ ├── Offensive statistics (60+ features)
│ ├── Defensive statistics and efficiency metrics
│ ├── Real-time game state (timeouts, score)
│ └── Historical performance patterns
└── Play Nodes (~180 per game)
├── Situational context (down, distance, field position)
├── Clock management (time remaining, urgency)
├── Historical play patterns (last 4 plays)
└── Dynamic game state (score differential)
The magic happens in the edges—how these entities connect and influence each other:
- Game ↔ Team:
game_teamrelationships capture home/away dynamics - Team ↔ Play:
play_offenseandplay_defensemodel both sides of the ball - Drive ↔ Play:
drive_playcaptures sequential momentum
The Multi-Task Learning Revelation
Initially, I built separate models for each prediction task. The results were mediocre. Then I realized: in football, everything is connected. The play type influences the outcome, which affects the yardage, which determines the duration.
Building a single model that predicts all four simultaneously wasn't just more efficient—it was more accurate.
Task Architecture
# Four simultaneous prediction tasks
class NFLPlayModel(nn.Module):
def forward(self, x_dict, edge_index_dict, play_type_indices=None, binary_outcomes=None):
# Shared graph representation learning
node_embeddings = self.gnn_layers(x_dict, edge_index_dict)
# Task-specific heads
play_type_logits = self.play_type_head(node_embeddings['play'])
outcome_logits = self.outcome_head(node_embeddings['play'])
# Sequential dependency: yardage depends on play type and outcome
yardage_input = torch.cat([node_embeddings['play'], play_type_embedding, binary_outcomes], dim=1)
yardage_preds = self.yardage_regressor(yardage_input)
# Duration depends on everything else
duration_input = torch.cat([yardage_input, predicted_yardage.unsqueeze(1)], dim=1)
duration_preds = self.duration_regressor(duration_input)
return play_type_logits, outcome_logits, yardage_preds, duration_preds
Why Multi-Task Learning Works
- Shared Representations: The graph embeddings capture universal play context
- Task Dependencies: Later predictions use earlier ones as features
- Regularization Effect: Multiple objectives prevent overfitting to any single task
- Efficiency: One model, four predictions, shared computation
The Quantile Regression Decision
Traditional regression gives you a single point estimate. But football is inherently uncertain—a 3rd and 5 pass could gain anywhere from -10 to 80 yards. I needed the model to express this uncertainty.
Quantile regression was the answer: instead of predicting "7.3 yards," predict the full distribution.
def weighted_quantile_loss(pred_quantiles, target, quantiles, weights=None):
"""Custom loss that captures prediction uncertainty"""
errors = target.unsqueeze(1) - pred_quantiles
quantile_tensor = torch.tensor(quantiles).to(pred_quantiles.device)
# Asymmetric loss: penalizes over/under-prediction differently for each quantile
loss = torch.maximum(
quantile_tensor * errors,
(quantile_tensor - 1) * errors
)
if weights is not None:
loss = loss * weights.unsqueeze(1)
return loss.mean()
This approach provides:
- Prediction Intervals: 90% confidence ranges for yardage predictions
- Risk Assessment: Understanding when predictions are uncertain
- Realistic Sampling: Generate varied but plausible outcomes for simulation
Domain Knowledge Integration: When Rules Matter
Machine learning models can learn patterns, but they don't understand football rules. I had to embed domain knowledge directly into the architecture.
def apply_known_yardages(self, binary_outcomes, yardage_preds, play_type_indices):
"""Apply NFL rule-based constraints to predictions"""
incomplete_flag = binary_outcomes[:, 0] # Incomplete pass
touchback_flag = binary_outcomes[:, 2] # Touchback
made_flag = binary_outcomes[:, 1] # Made/successful play
# Hard constraints based on NFL rules
yardage_preds[incomplete_flag >= 0.5, :] = 0 # Incomplete = 0 yards
yardage_preds[(touchback_flag >= 0.5) & (play_type_indices == KICKOFF_INDEX), :] = 25 # Touchback to 25
yardage_preds[(made_flag < 0.5) & (play_type_indices == FIELD_GOAL_INDEX), :] = 0 # Missed FG
yardage_preds[play_type_indices == EXTRA_POINT_INDEX, :] = 0 # Extra points have no yardage
return yardage_preds
This hybrid approach combines:
- Pattern Learning: Let the model discover complex statistical relationships
- Rule Enforcement: Apply known football rules as hard constraints
- Contextual Logic: Situation-specific adjustments based on game state
The Class Balancing Challenge
NFL play distribution is heavily skewed—70% of plays are either runs or passes, while special situations like punts and field goals are rare but crucial. Standard training would optimize for the common cases and ignore the edge cases.
My solution: sophisticated weighting that balances frequency with predictability.
# Frequency-based weights for rare play types
frequency_weights = {
play_type: total_samples / (label_counts[play_type] if label_counts[play_type] > 0 else 1)
for play_type in play_types
}
# Predictability adjustments based on domain knowledge
predictability_adjustments = {
'kickoff': 0.7, # Highly predictable after touchdowns
'rush': 1.0, # Core prediction challenge
'pass': 1.0, # Core prediction challenge
'penalty': 0.8, # Inherently unpredictable
'field_goal': 0.7, # Situational but predictable
'punt': 0.8, # Situational decision
'extra_point': 0.7, # Highly predictable after touchdown
'conversion': 1.0 # Rare but important to get right
}
# Combined weighting strategy
final_weights = {
play_type: frequency_weights[play_type] * predictability_adjustments[play_type]
for play_type in play_types
}
This ensures the model:
- Handles Rare Events: Pays attention to infrequent but important plays
- Focuses on Uncertainty: Prioritizes truly difficult prediction scenarios
- Balances Objectives: Doesn't just optimize for overall accuracy
Feature Engineering: Capturing Context
The difference between a good and great football prediction model is context. It's not just the current down and distance—it's how the game got to this point.
Temporal Context Features
# Historical play patterns (last 4 plays for each team)
temporal_features = [
'last_play_1', 'last_play_2', 'last_play_3', 'last_play_4',
'last_duration_1', 'last_duration_2', 'last_duration_3', 'last_duration_4'
]
# Clock encoding that captures urgency
def encode_clock(period_number, clock):
"""Convert game time to seconds remaining"""
minutes, seconds = map(int, clock.split(':'))
remaining_seconds_in_period = (minutes * 60) + seconds
periods_remaining = 4 - period_number
return (periods_remaining * 15 * 60) + remaining_seconds_in_period
# Situational pressure indicators
situational_features = [
'point_differential', # Score pressure
'timeouts_remaining', # Clock management resources
'goal_to_go_distance', # Red zone dynamics
'third_down_distance', # Conversion pressure
]
Advanced Contextual Features
The breakthrough came from modeling relative context rather than absolute statistics:
- Score Pressure: How the current score differential affects play selection
- Clock Urgency: Time remaining relative to scoring needs
- Field Position: Not just yard line, but goal-to-go dynamics
- Down Efficiency: Historical performance in similar situations
Graph Neural Network Architecture
The core innovation was using heterogeneous graph convolutions that can handle different node types with different feature spaces.
class HeterogeneousGNN(nn.Module):
def __init__(self, hidden_channels=64):
super().__init__()
# Different convolution operations for each edge type
self.convs = HeteroConv({
('team', 'play_offense', 'play'): SAGEConv((team_features, play_features), hidden_channels),
('team', 'play_defense', 'play'): SAGEConv((team_features, play_features), hidden_channels),
('game', 'game_team', 'team'): SAGEConv((game_features, team_features), hidden_channels),
('game', 'drive_play', 'play'): SAGEConv((game_features, play_features), hidden_channels),
}, aggr='mean')
def forward(self, x_dict, edge_index_dict):
# Message passing between connected nodes
x_dict = self.convs(x_dict, edge_index_dict)
x_dict = {key: F.relu(x) for key, x in x_dict.items()}
return x_dict
Why This Architecture Works
- Heterogeneous Processing: Different node types processed appropriately
- Relationship Modeling: Edges capture meaningful football relationships
- Information Flow: Context flows from games → teams → plays
- Scalable: Handles variable game sizes and missing data gracefully
Training Strategy: Balancing Multiple Objectives
Training a multi-task model requires careful loss balancing. Each task has different scales and importance.
# Composite loss function with learned weighting
def compute_loss(model_outputs, targets):
play_type_loss = F.cross_entropy(play_type_logits, play_type_targets, weight=class_weights)
outcome_loss = F.binary_cross_entropy_with_logits(outcome_logits, outcome_targets)
yardage_loss = weighted_quantile_loss(yardage_preds, yardage_targets, quantiles)
duration_loss = weighted_quantile_loss(duration_preds, duration_targets, duration_quantiles)
# Weighted combination emphasizing outcome prediction
total_loss = play_type_loss + (2 * outcome_loss) + yardage_loss + duration_loss
return total_loss, (play_type_loss, outcome_loss, yardage_loss, duration_loss)
Training Insights
- Outcome Emphasis: Binary outcomes weighted 2x because they're most actionable
- Progressive Training: Start with classification, add regression tasks gradually
- Checkpoint Strategy: Save models at each epoch to find optimal stopping point
- Validation Split: Time-based splits to prevent data leakage
Performance and Insights
What the Model Learned
The most surprising discoveries came from analyzing what the model actually learned:
- Weather Matters Less Than Expected: Temperature and wind had minimal impact on most plays
- Recent Play History is Crucial: Last 4 plays strongly predict the next one
- Score Differential Drives Everything: Point spread affects every decision
- Field Position Creates Nonlinear Behavior: Red zone completely changes play selection
Model Performance
- Play Type Accuracy: 68% overall (significantly above baseline 45%)
- Outcome Prediction: 72% accuracy on binary outcomes
- Yardage Estimation: ±3.2 yards average error within 80% confidence interval
- Duration Prediction: ±2.1 seconds average error
Uncertainty Quantification Success
The quantile regression approach proved invaluable:
- Confidence Intervals: Tight intervals for predictable situations (3rd and 1)
- High Uncertainty Detection: Wide intervals for volatile situations (3rd and 15)
- Risk Assessment: Model knows when it doesn't know
Real-World Applications
Game Simulation
def simulate_game(game_graph, model, max_plays=200):
"""Simulate an entire game using the trained model"""
model.eval()
play_sequence = []
for play_num in range(max_plays):
with torch.no_grad():
# Predict next play characteristics
play_type_logits, outcome_logits, yardage_preds, duration_preds = model(
game_graph.x_dict, game_graph.edge_index_dict
)
# Sample from predictions to get realistic variation
predicted_play = sample_from_predictions(play_type_logits, outcome_logits, yardage_preds)
play_sequence.append(predicted_play)
# Update graph state for next prediction
game_graph = update_game_state(game_graph, predicted_play)
if game_over(play_sequence):
break
return play_sequence
Strategic Analysis
The model enables sophisticated strategic analysis:
- Tendency Analysis: What does this team do in 3rd and medium?
- Situational Prediction: How does weather affect this team's play calling?
- Matchup Analysis: How do these teams' styles interact?
Technical Innovations
1. Heterogeneous Graph Design
- Multi-Entity Modeling: Games, teams, and plays as different node types
- Contextual Relationships: Meaningful edges that capture football dynamics
- Scalable Architecture: Handles variable game sizes efficiently
2. Multi-Task Learning with Dependencies
- Sequential Tasks: Later predictions use earlier ones as features
- Shared Representations: Common graph embeddings for all tasks
- Cross-Task Regularization: Multiple objectives improve generalization
3. Uncertainty-Aware Predictions
- Quantile Regression: Full distribution rather than point estimates
- Confidence Intervals: Know when the model is uncertain
- Risk Assessment: Quantify prediction reliability
4. Domain Knowledge Integration
- Rule-Based Constraints: NFL rules embedded in model architecture
- Contextual Logic: Situation-specific prediction adjustments
- Expert Knowledge: Football understanding guides feature engineering
Key Takeaways
Technical Lessons
- Graph Structure Matters: Modeling relationships is as important as modeling entities
- Multi-Task Learning is Powerful: Related tasks should be learned together
- Uncertainty Quantification is Essential: Point estimates aren't enough for complex domains
- Domain Knowledge Amplifies ML: Rules and constraints improve pure learning approaches
Architectural Decisions
The choice of heterogeneous graphs over traditional approaches wasn't just technically superior—it was conceptually aligned with how football actually works. Teams don't exist in isolation, plays don't happen in a vacuum, and games provide crucial context that affects everything.
The multi-task learning framework captured the fundamental truth that football predictions are interconnected. You can't predict yardage without knowing the play type, and you can't predict duration without considering the likely outcome.
Future Enhancements
- Player-Level Modeling: Individual player nodes with performance and injury status
- Real-Time Integration: Live game state updates for in-game predictions
- Advanced Weather Modeling: More sophisticated environmental impact analysis
- Coaching Tendency Analysis: Coach-specific play-calling pattern recognition
Conclusion
Building this NFL prediction model taught me that the most important architectural decisions aren't about which algorithm to use—they're about how to structure the problem itself. Traditional machine learning approaches treated football as a sequence of isolated events. Graph Neural Networks allowed me to model it as what it really is: a complex system of interconnected decisions where context is everything.
The heterogeneous graph structure naturally captured the hierarchical nature of football, while multi-task learning recognized that predictions are interdependent. Quantile regression acknowledged that uncertainty is not a bug but a feature—football is inherently unpredictable, and a good model should express that uncertainty honestly.
Most importantly, integrating domain knowledge didn't compromise the machine learning—it enhanced it. The best models don't just learn patterns; they understand the domain they're operating in.
This architecture could be adapted to any sport or domain where entities operate at multiple levels with complex relationships. The key insight is that structure matters as much as data, and understanding your domain is as important as understanding your algorithms.
Brian Wight
Technical leader and entrepreneur focused on building scalable systems and high-performing teams. Passionate about ownership culture, data-driven decision making, and turning complex problems into simple solutions.