Game State Reconstruction: The North Star of Sports AI
Game state reconstruction is converging as the organising goal of sports AI, shifting evaluation from isolated models to system-level pipeline metrics.
The field of sports computer vision has a measurement problem. Every year, detection accuracy improves. Tracking benchmarks improve. Re-identification benchmarks improve. Yet anyone who has tried to build a full match analysis system knows that individual model improvements do not translate predictably into better system outputs. A better detector does not guarantee better tactical data. A better tracker does not guarantee correct player attribution. Something gets lost between component accuracy and pipeline quality.
That gap has a name now: game state reconstruction.
What game state reconstruction is
Game state reconstruction (GSR) is the goal of producing a complete, structured representation of a match at any given moment: who is where on the pitch, what they are doing, and how those positions and actions relate to each other. The output is not video highlights or summary statistics. It is a minimap, a frame-level representation of all players, the ball, and their attributes, placed in pitch coordinate space.
This sounds like what sports AI has always been trying to do. The difference is formalisation. In 2024, SoccerNet published Game State Reconstruction as a standalone task with an open dataset and baseline pipeline. Rather than benchmarking detection, tracking, or identity in isolation, GSR evaluates the complete output: can your system produce an accurate minimap, with correctly identified athletes, in the right positions, with reliable attributes?
The baseline pipeline required to attempt this task makes the complexity visible. GSR combines multi-object tracking, player re-identification, team affiliation clustering, jersey number recognition, pitch localization, and camera calibration into a single coherent system. Every layer depends on every other layer. There is no clean separation between perception subtasks and output quality.
The trouble with component benchmarks
The standard approach to improving sports CV systems has been to improve each component separately. Better detection benchmarks, better tracking benchmarks, better calibration benchmarks. The assumption is that better components add up to a better system.
GSR demonstrates that this assumption is wrong, at least as a planning model.
The SoccerNet-GSR paper introduces a unified metric, GS-HOTA, that evaluates the complete pipeline rather than any single stage. GS-HOTA adapts HOTA (Higher Order Tracking Accuracy) to the game-state setting, measuring tracking consistency, identity accuracy, and spatial localization together. A system that is excellent at tracking but poor at identity will score lower than a system that balances both. A system with near-perfect detection but unreliable calibration can score near zero on certain clips.
That last point is worth sitting with. The benchmark explicitly shows cases where minor calibration inaccuracies cause major pitch registration errors that collapse GS-HOTA to near zero, even when other pipeline components are performing well. One layer fails, and the failure propagates through every downstream stage. The output is wrong not because the models are bad but because the error budget in a sequential pipeline is not additive. It compounds.
This is not a solvable problem by optimising individual models in isolation. It is a systems engineering problem.
What the baseline numbers reveal
SoccerNet-GSR provides a baseline pipeline that runs tracking, identity, and calibration in sequence. The processing time for a 30-second clip in one baseline configuration is 18 minutes. That is not a deployable system. It is a research baseline that makes the computational scope of the problem clear: full game state reconstruction, done correctly, requires significantly more compute than any single-model task benchmark suggests.
The variability in GS-HOTA scores across clips is equally instructive. The same pipeline, run on different 30-second sequences, can produce dramatically different results. High GS-HOTA in one clip, near zero in the next, because a single camera shot with limited pitch line visibility triggers a calibration failure that corrupts the minimap for that sequence. The metric is sensitive to the real failure modes of real broadcast video, rather than curated benchmark subsets where lighting and coverage are controlled.
This is the signal that evaluation frameworks built around isolated tasks miss. A tracker benchmarked on clean, well-lit, side-view footage is being benchmarked on a different problem than a tracker embedded in a full GSR pipeline processing 90 minutes of broadcast video with camera cuts, zooms, and partial pitch visibility.
Why this changes the development frame
Accepting GSR as the organising goal changes how a sports AI team should prioritise work.
The question is no longer "which model is best at tracking?" It becomes "where does our pipeline fail, and what kind of failure causes the largest drop in system-level output?" Calibration errors dominate pipeline failures in the GSR baseline. That means camera calibration is not a supporting subtask, it is the highest-leverage investment in a GSR system, because its failures are catastrophic rather than merely degrading.
The same logic applies to identity. Tracking without identity produces trajectories for anonymous blobs. Analysts cannot use anonymous trajectories to understand positioning, pressing, or off-ball movement by specific players. The gap between "tracking a blob" and "knowing who it is" is the gap between raw signal and actionable intelligence. GSR makes that gap explicit in its metric structure.
The benchmarking shift also changes how you think about trade-offs across the pipeline. Investing heavily in detection while underfitting calibration produces a system that scores well on detection benchmarks but poorly on what actually matters for match analysis. GS-HOTA penalises that kind of uneven development directly.
The democratisation argument
The SoccerNet-GSR paper frames its purpose with a specific goal: the benchmark was introduced "aiming to democratize access to this valuable game state data for all leagues." That framing is pointed.
Game state data of the quality that top leagues use, structured player positions, ball trajectories, tactical attributes, event timestamps, exists today behind significant infrastructure investment. Hawk-Eye runs multi-camera installations across 25 sports in 100+ countries. Second Spectrum operates proprietary tracking for the Premier League, NBA, and MLS. Accessing that quality of data requires a contract with a top-tier provider or membership in an elite league.
The GSR benchmark exists because the research community believes that single-camera, AI-native pipelines can close the quality gap significantly. Not match elite multi-camera installations in raw precision, but produce game state data good enough to be analytically useful, from cameras that already exist at every level of the game.
That is the commercial opportunity that companies like Veo have begun to address at scale, with over four million games recorded through automated single-camera capture. The missing piece is not the camera. It is the pipeline that converts that footage into structured game state reliable enough to support coaching decisions, scouting workflows, and performance analysis at clubs that have never had access to this kind of data before.
What this means for building sports AI
The practical implication of GSR as the north star is that building sports AI well means building a system, not a collection of models.
That requires evaluating pipeline outputs rather than model outputs. It requires understanding failure propagation, not just individual component accuracy. It requires treating calibration as a first-class engineering concern rather than a pre-processing step. And it requires having a clear definition of "game state" before writing a line of code, because GS-HOTA is measuring something specific: tracked identities in spatial correspondence with the real pitch.
At Beach, this framing is central to how we approached MatchGraph. The target was not better component benchmarks. It was reliable game state from a single camera, at quality useful for analysts, coaches, and scouts who currently have no access to structured match data at all.
The GSR benchmark provides an honest evaluation framework for that goal. It does not let you claim success by improving one layer while ignoring the rest. It measures what actually matters for football intelligence: complete, consistent, spatially accurate game state, produced from the video that is actually available.
The rest of this series works through each layer of the stack in depth. We start with why football has become the dominant proving ground for sports CV research, and what that concentration of work means for the field beyond the game. If your organisation is evaluating what structured match data could enable, our Sports Performance playbook covers the assessment frameworks and engagement models.