Tracking 22 Players Through 90 Minutes of Continuous Play
Multi-object tracking in football breaks linear motion assumptions. Association quality, not detection, is the real engineering bottleneck.
Detection tells you where players are. Tracking tells you who they were and where they went.
Player detection produces a set of bounding boxes for each frame, independent of every other frame. Tracking takes those per-frame detections and connects them across time into consistent trajectories: this bounding box in frame 1247 belongs to the same player as this bounding box in frame 1248, and in frame 1302 after an occlusion, and in frame 1467 after a camera cut. Without tracking, you have a sequence of snapshots. With tracking, you have the continuity of identity that makes tactical analysis possible.
Multi-object tracking in football is harder than it looks from the outside, for reasons that are different from what the mainstream tracking literature would suggest.
Tracking-by-detection and how pipelines are actually built
Modern sports tracking pipelines follow the tracking-by-detection paradigm. A detection model runs on each frame, producing candidate bounding boxes. A separate association module links those detections to existing tracks. When a detection cannot be matched to a known track, a new track is created. When a track goes unmatched for enough frames, it is terminated.
This separation of concerns has practical advantages. The detection and tracking components can be developed, evaluated, and optimised independently. Detection quality directly determines the ceiling of tracking quality: if a player is not detected, tracking cannot maintain continuity through that gap by association alone.
The most widely deployed tracking algorithm in current sports CV pipelines is ByteTrack, published at ECCV 2022. Its core insight: low-confidence detections, which most trackers discard, carry useful information about objects that are partially occluded or temporarily unclear. ByteTrack associates high-confidence detections first, then matches remaining unassigned tracks against low-confidence detections in a second pass. This recovered many partially occluded objects that simpler thresholding approaches lost, producing HOTA and IDF1 improvements across standard benchmarks without changing the underlying detector at all.
BoT-SORT extends ByteTrack with two additions that matter considerably in football: camera motion compensation, which adjusts bounding box predictions to account for global camera movement before running the Kalman state update, and an optional appearance re-identification module that uses visual embeddings to support association when motion-based matching is ambiguous. Both prove valuable in broadcast conditions where the camera follows play continuously and occlusion events are frequent.
Why the Kalman filter hurts in sports
The standard motion model in tracking pipelines is the Kalman filter: a framework that predicts where an object will appear in the next frame based on its current position and velocity. For pedestrians or vehicles moving in roughly consistent directions at consistent speeds, this prediction is usually accurate enough to support reliable association.
For football players, it is not. Players stop, pivot, and accelerate in ways that violate the linear motion assumption the Kalman filter encodes. A player decelerating from a full sprint to change direction occupies a trajectory that a Kalman predictor will have placed several metres from their actual next position. In a crowded frame during a pressing sequence or a corner, those prediction errors compound across multiple players simultaneously, and the association module mismatches them.
SportsMOT, published at ICCV 2023, makes this concrete with a striking ablation. The dataset covers 240 sequences across football, basketball, and volleyball, totalling more than 150,000 frames and 1.6 million annotated bounding boxes. Running IoU-based association without any Kalman filter prediction produced a HOTA score of 71.5 on the SportsMOT benchmark. The same pipeline with Kalman filter prediction included dropped to 64.1. The motion model that helps in generic tracking actively damaged association accuracy in sports, because linear motion assumptions fail at precisely the moments when players are most densely grouped and physically interactive.
The practical implication is direct: trackers designed for pedestrians need modification before deployment on football footage. The Kalman filter is not categorically wrong for sports, but its parameters and assumptions require adjustment for the motion distributions that athletes actually produce. Several sport-specific trackers have addressed this by using higher-order motion models or by downweighting the Kalman prediction during high-density sequences where the linear assumption breaks down most severely.
What the benchmark numbers show
HOTA (Higher-Order Tracking Accuracy) decomposes into two components: detection accuracy (DetA) and association accuracy (AssA). The geometric mean of the two gives an overall tracking quality score that rewards systems performing well on both, rather than allowing high detection scores to mask poor association.
On the TeamTrack dataset, evaluated across football sequences captured from fisheye and drone cameras, current trackers achieve HOTA in the range of 53 to 59 on the soccer views. The best models reach approximately 60% MOTA on football sequences. These figures are meaningfully lower than performance on pedestrian benchmarks like MOT17, where leading trackers exceed 70% HOTA, and the gap reflects the specific difficulties of football: similar player appearances within a team, non-linear motion, and broadcast cameras that introduce global apparent motion into every frame prediction.
SoccerNet-Tracking, developed at CVPR 2022 and maintained as an ongoing benchmark, evaluates tracking across 200 broadcast sequences of 30 seconds each, plus a complete 45-minute half for long-term tracking assessment. It covers multiple object classes: players from each team, goalkeepers, referees, and the ball, reflecting the full set of tracked entities a production system must handle. The SoccerNet team has documented that tracking under fast motion and severe occlusion in broadcast football is "far from being solved." That characterisation remains accurate in 2026. Recent methods such as Deep HM-SORT have pushed SportsMOT performance above 80 HOTA and SoccerNet-Tracking above 85 HOTA on test sets, but broadcast conditions and challenging sequences continue to expose meaningful failure rates in practice.
Camera motion compensation in practice
Broadcast cameras in football move constantly. They follow the ball, zooming into dangerous attacks and pulling back for wide-angle views of defensive shape. The camera operator introduces motion that affects every player in the frame simultaneously: when the camera pans right, all players appear to move left in pixel space, even those standing still on the far side of the pitch.
A Kalman filter modelling each player's motion independently will interpret global apparent camera motion as real player movement. When the camera moves fast, this creates large prediction errors that break association reliability across the whole frame at once. BoT-SORT addresses this by estimating global camera motion using sparse optical flow and compensating the Kalman state updates accordingly before running the association step. Players are predicted relative to their actual motion in world coordinates rather than their apparent motion in camera coordinates.
The technique matters more for football than for many other sports because broadcast cameras are designed for viewer narrative, not metric tracking. They zoom into the player about to shoot, cutting off the defensive shape. They switch to close-up replays and return to wide-angle mid-sequence. Each of these events represents a frame where camera-motion-unaware trackers generate incorrect predictions and trigger unnecessary track terminations, producing fragmented trajectories that later analysis layers cannot repair.
Appearance features and their limits
Short-term re-identification using visual appearance features supports tracking through occlusion. When a player disappears behind a cluster and re-emerges, a cosine distance comparison between the current detection's embedding and the stored embedding for the lost track can recover the correct match. This is one of the additions BoT-SORT makes over ByteTrack, and it is useful for recovering tracks across the brief occlusion events that are common in broadcast football.
The limitation specific to football is that within-team appearance is nearly identical. Players on the same team wear the same shirt, the same colour, at the same resolution, at the same scale as they appear in the frame. Generic appearance embeddings trained on pedestrian data, where people wear distinct clothing combinations, do not discriminate reliably between players on the same team. Sport-specific re-identification training using jersey-level features and jersey number recognition is required to get discriminative signal for within-team tracking.
This is why the association problem in football is harder than the raw detection numbers suggest. Motion models are unreliable because of non-linear athlete movement. Appearance models are unreliable because of visual similarity within a team. The tracking system must manage both failure modes simultaneously, across 22 players, for 90 minutes.
What this means for game state
Tracking quality determines what downstream analysis is possible. A broken track during a pressing sequence means the pressing metrics for that phase cannot be attributed correctly. A track swap during a corner kick means defensive assignments are misrecorded for all subsequent analysis of that set piece. Game state reconstruction depends on continuous, correctly attributed tracks from the detection layer all the way through to the tactical reasoning layer. Tracking errors do not stay contained; they propagate into every analysis built above them.
The field is making progress. Specialised sports trackers, better appearance models, and camera-motion-aware architectures have produced genuine benchmark improvements. The gap between research benchmarks and production broadcast conditions remains significant, and "far from solved" is an honest summary of where even strong systems still fail.
MatchGraph treats tracking as a primary engineering concern rather than a solved component. The current benchmark numbers are good enough for many analytical applications, but understanding the specific failure modes for the camera setups and match conditions that will actually be encountered in deployment requires careful characterisation beyond headline HOTA figures. For football organisations evaluating what video analytics can extract from their existing infrastructure, our Football Video Intelligence engagement covers the full pipeline, from tracking quality assessment through to the event-level and tactical outputs that coaching staff can act on.