The Sports Computer Vision Stack: How Video Becomes Structured Data
The AI pipeline that converts football video into structured game state, from player detection and tracking through tactical reasoning and natural language queries.
A football match generates 90 minutes of continuous video. Twenty-two players, three officials, and one ball move through a space 105 metres long. From that raw footage, a modern sports intelligence system needs to produce structured output: who was where, when, doing what, in relation to whom. The system that converts video into that structured "game state" is not a single model. It is a pipeline of interconnected AI models, each solving a specific perceptual or reasoning problem, each depending on the outputs of the layers beneath it.
This is the sports computer vision stack. Understanding it as a layered system, rather than as a collection of isolated models, is the key to building video intelligence that actually works in production.
The perception layer
The bottom half of the stack solves perception: what is happening in the frame, where, and to whom.
At the base, object detection identifies players, referees, and the ball in each frame. Modern football detection runs on transformer-based architectures that have evolved significantly from general-purpose detectors. The Football Transformer (FoT), published in 2025, targets the specific challenges of broadcast football: small objects, motion blur, and multi-view switching. It improves mAP by 3.0% over prior baselines on the Soccer-Det benchmark while maintaining real-time inference speed. But detection difficulty varies dramatically with camera angle. TeamTrack benchmarks show that pre-trained detector mAP can be as low as 1.4 on top-down views versus 52.7 on side-views of the same sport after fine-tuning. The camera determines the detection problem.
Detection feeds into multi-object tracking (MOT), which associates detections across frames to maintain consistent player trajectories through the match. ByteTrack and BoT-SORT provide strong baselines, but football breaks the assumptions of generic trackers. SportsMOT benchmarks reveal a counterintuitive finding: removing Kalman filter motion prediction actually improves tracking accuracy (HOTA from 64.1 to 71.5), because linear motion models fail when players change direction suddenly in clustered play. In sports tracking, association quality matters more than motion prediction.
Running in parallel, field registration maps pixel coordinates to pitch coordinates through camera calibration and homography estimation. This layer is invisible to end users but is the highest-leverage component in the entire pipeline. SoccerNet's Game State Reconstruction benchmark demonstrates why: minor calibration inaccuracies can cause major pitch registration errors, collapsing system-level metrics to near zero even when all other models perform well. TVCalib and PnLCalib represent the current frontier, moving from brittle keypoint matching to robust camera parameter estimation via differentiable reprojection.
On top of tracking sits the identity layer, which resolves who each tracked player actually is. This combines re-identification embeddings, jersey number OCR, and team affiliation clustering. SoccerNet ReID provides 340,993 player thumbnails from 400 games for benchmarking, and CLIP-based jersey labelling achieves approximately 90-91% accuracy in challenge settings. But broadcast footage routinely yields small, blurred jersey digits and heavy label noise. The gap between "tracking a blob" and "knowing who it is" remains one of the hardest unsolved problems in the field.
The intelligence layer
The upper layers are where perception becomes reasoning.
Event detection identifies discrete actions within the continuous stream: goals, passes, tackles, fouls, cards, substitutions. SoccerNet-v2 provides over 300,000 annotations across broadcast matches and has become the standard benchmark. ASTRA, a transformer-based event model, achieves a tight Average-mAP of 66.82 on the test set and explicitly integrates audio to detect actions not visible in the frame: off-camera fouls signalled by a whistle, or incidents indicated by crowd reaction. Events are where raw tracking data starts to become the language of match analysis.
Tactical reasoning treats the match as a multi-agent system. Graph neural networks model player interactions as network structures, with players as nodes and spatial relationships as edges. TacticAI, published by DeepMind in Nature Communications in 2024 and validated with Liverpool FC coaching staff, combines predictive and generative components to let coaches sample alternative corner-kick setups and evaluate possible outcomes. Shape graphs, a 2025 contribution published in npj Complexity, represent team formations as subgraphs of Delaunay triangulations: an interpretable geometric structure that coaches can reason about and analysts can compute over efficiently.
At the top of the stack, delivery interfaces present structured game state to users. Increasingly, this means natural language. Rather than dashboards of isolated metrics and heatmaps, analysts can query match data conversationally, asking questions about positioning, tactical patterns, and match events, and getting answers grounded in structured video evidence.
Why the stack view matters
The critical insight from recent benchmarks is that individual model accuracy is necessary but not sufficient. Pipeline quality is a system property, not a model property.
SoccerNet's Game State Reconstruction task, introduced in 2024, formalises this directly. GSR defines the target output as a minimap with tracked, identified athletes and their attributes, then evaluates the complete pipeline with a unified metric called GS-HOTA. The results are instructive. Processing a 30-second clip takes 18 minutes in the baseline configuration. A clip can score high GS-HOTA in one sequence and collapse to near zero in the next because a single calibration failure propagates through every downstream layer.
This is why it matters to think about sports CV as a stack. Errors cascade upward. A missed detection becomes a broken track. A broken track becomes a misidentified player. A misidentified player becomes a wrong tactical attribution. The architecture of the pipeline determines output quality at least as much as any individual model's accuracy.
What we are building
At Beach, we built MatchGraph to implement this stack from first principles. MatchGraph converts single-camera football video into structured, queryable match intelligence: object detection for players, ball, and referees; homography for mapping 2D footage to pitch coordinates; event detection for passes, tackles, shots, and goals; positional data generation; and natural language queries over the resulting structured data.
The design is deliberately AI-native. Rather than bolting machine learning onto a manual tagging workflow, MatchGraph treats the full pipeline as an automated system where each layer feeds the next. The output is not a highlight reel. It is structured game state that analysts, coaches, and scouts can interrogate programmatically or conversationally.
The commercial landscape validates this direction. Hawk-Eye covers 25 sports across 100,000+ event days in 100+ countries using multi-camera installations. Second Spectrum provides tracking data for the Premier League, NBA, and MLS. Veo has recorded more than four million games with automated single-camera capture. Hudl serves 315,000+ teams across 40+ sports. The market is large, the technology is maturing, and the competitive frontier is shifting from "can we track players?" to "can we reconstruct full game state cheaply and reliably from any camera?"
What this series covers
This post introduces the stack. The rest of this series goes deep on each layer.
We will cover detection architectures and the small-object problem for ball tracking. We will explain why multi-object tracking breaks in football, why player identity remains unsolved, and why calibration errors dominate pipeline failures. We will move up the stack into event detection, graph-based tactical reasoning, trajectory prediction, multimodal approaches that fuse audio and video, and the natural language interfaces that make game state accessible to non-technical users.
We will also cover the engineering that makes deployment possible: synthetic data for training without real footage, sim-to-real transfer, edge systems for grassroots clubs, and what commercial providers reveal about where the market is heading. The series closes with applied lessons from building MatchGraph and a practical playbook for sports organisations building video intelligence capabilities.
If your organisation works with sports video and is evaluating what AI can extract from it, our Sports Performance playbook covers the assessment frameworks and engagement models. Start with a Performance Data Assessment to understand your current data infrastructure, or explore our Football Video Intelligence engagement for teams ready to build.