The Tiny Object Problem: Why Detecting a Football Is Harder Than You Think

Ball detection is a deceptively hard problem in sports computer vision. Motion blur, occlusion, and pitch line confusion push specialised models far beyond what general-purpose detectors can handle.

Player detection gets most of the attention in sports computer vision research. The papers are longer, the benchmarks are richer, and the downstream applications, tracking, identity, tactical analysis, all depend on getting player positions right. But there is a smaller problem in every literal sense that turns out to be surprisingly resistant to standard methods: finding the ball.

A football in wide-angle broadcast footage can occupy fewer than five pixels across. It is spherical, so it lacks the distinctive edges and corners that help detectors find other objects. It moves faster than players, producing more severe motion blur at the same shutter speed. It is frequently occluded by the players fighting for it, precisely at the moments that matter most analytically. And the pitch is covered in circular markings, white lines, and specular highlights from stadium lighting that generate false positives in the same size and shape class as the ball itself.

These constraints compound into a detection problem that sits outside the comfortable performance envelope of general-purpose object detectors, and has driven a specialised research thread running alongside the mainstream sports CV literature.

Why general-purpose detectors struggle with the ball

The difficulty starts with scale. YOLO-family models and their transformer counterparts were developed on benchmarks like COCO where the target distribution spans a wide range of object sizes. A football in broadcast footage falls at the extreme low end of that range, occupying pixel regions that many detection architectures simply do not represent well in their feature maps.

The standard response, increasing input resolution to give small objects more pixels, creates its own problem. Processing a 4K frame through a heavy detector at 25 frames per second demands significant compute, and the operational constraint for live sports analysis is often tight. A system that correctly detects the ball but runs at one tenth of real-time throughput is not deployable.

Motion blur introduces a second challenge that affects the ball disproportionately. A player at full sprint produces blur across a relatively large pixel region; the shape remains broadly recognisable. A football moving at 80 kilometres per hour during a driven pass produces a smear of perhaps two to three pixels that bears almost no resemblance to the circular object a detector was trained to find. The sharpest image in a sequence of frames capturing a shot on goal may be the one just before the ball has left the foot, not the ball in flight.

The confusion problem is more specific to football than to other sports. Penalty arc markings, corner circles, and the D-shape at each end of the pitch all create circular white features on a green background. Specular highlights on a wet pitch can produce bright spots in the same size range as the ball. A detector trained on standard football footage will encounter these as regular false positive candidates, requiring careful training data curation and post-processing to suppress them reliably.

FootAndBall: specialised architecture for a specialised problem

The FootAndBall detector, published by Komorowski et al. in 2020, takes the opposite approach from scaling up a general-purpose model. The architecture is a fully convolutional network with a Feature Pyramid Network structure, specifically designed to detect both players and the ball in high-resolution, wide-angle footage.

The efficiency numbers make the design philosophy clear. FootAndBall has approximately 199,000 parameters, two orders of magnitude fewer than Faster R-CNN and around 200 times fewer than general-purpose detectors like SSD or YOLO at comparable input sizes. This reduction in model complexity is not a compromise: it is the design target. The Feature Pyramid design combines lower-level features with higher spatial resolution with higher-level features with larger receptive field, improving discriminability for small objects by ensuring the model attends to context around the object rather than the object pixel region alone.

At 37 frames per second on high-definition video, FootAndBall runs at real-time speeds on commodity hardware. For grassroots deployments where compute is constrained and the primary analytical goal is understanding ball possession and movement patterns, this efficiency-first design is more appropriate than a heavier architecture that achieves higher benchmark mAP but cannot sustain the throughput required.

TrackNet: heatmap-based detection for high-speed objects

Where FootAndBall treats ball detection as a variant of the standard object detection problem, the TrackNet family frames it as a different problem entirely. TrackNet, introduced by Huang et al. in 2019, predicts the ball position as a heatmap over the input frame, using multiple consecutive frames stacked as input to provide temporal context.

The motivation is that a single-frame detector sees only a blurred smear during high-speed ball flight. Multiple frames together encode motion direction, which constrains where the ball plausibly is even when direct visual evidence is degraded. The temporal input converts what is visually ambiguous in one frame into something geometrically constrained across several.

TrackNetV4, published in September 2024, extends this approach with learnable motion attention maps. The method fuses high-level visual features with motion-derived signals via a motion-aware fusion mechanism, using frame differencing to highlight regions of fast motion and modulating attention toward those regions. The result is a lightweight, plug-in module that can be added on top of existing TrackNet architectures rather than requiring a full model redesign. Experimental results across tennis and shuttlecock tracking datasets show consistent improvements, and the underlying principle transfers directly to football.

TOTNet: occlusion-aware temporal tracking

Occlusion is the hardest case in ball detection, and it is the case that matters most. A ball blocked from view by a player during a tackle or a contested header is almost always in a situation that will produce an event worth logging. The system's inability to locate the ball at that moment is not a minor accuracy loss: it is a gap in the analytical record precisely where the analytical record is most needed.

TOTNet (Temporal Occlusion Tracking Network) addresses this directly with three design choices: 3D convolutions to process temporal sequences rather than single frames, a visibility-weighted loss function that upweights gradient signal from occluded frames during training, and an occlusion augmentation strategy that forces the model to develop robust representations for the occluded case rather than learning to avoid it.

The results show what targeted architecture design can achieve. Compared to prior state-of-the-art methods, TOTNet reduces RMSE from 37.30 to 7.19 overall, a reduction of 80 percent. For fully occluded ball positions, accuracy improves from 0.63 to 0.80. These are not marginal improvements from a new training recipe. They represent a qualitative change in what the model can infer when the ball is hidden from view.

The mechanism is essentially a learned physics prior. Given the ball's known last position and velocity, the 3D convolutional architecture develops internal representations that model plausible ball trajectories through the occluded period. It does not see the ball, but it reasons about where the ball must be, constrained by what it observed before and after the occlusion. Human analysts perform similar reasoning when tracking play; the model is acquiring the same capability from temporal pattern recognition rather than explicit physics simulation.

Temporal smoothing and the online versus offline trade-off

Temporal reasoning comes at a cost: it requires buffering multiple frames before producing a detection. This creates a practical fork in the design space.

Offline processing, where a full match is analysed after the fact, can buffer as many frames as the model needs and apply smoothing across the entire sequence. Post-hoc trajectory smoothing using physics priors, for example enforcing ball positions that satisfy reasonable velocity and acceleration constraints, can resolve ambiguities that no frame-by-frame detector would catch. The analytical output is more accurate because the model has access to future context.

Online processing, required for live broadcasting overlays or real-time coaching feedback during training sessions, cannot look ahead. Detection quality is bounded by what the model can infer from past and present frames only. The occlusion-aware temporal approach of TOTNet is well-suited here because the 3D convolutional stack operates on a fixed-length window of past frames rather than the full sequence, keeping latency bounded while providing meaningful temporal context.

Most production deployments in sport analysis actually operate in the offline regime: match footage is uploaded after the game, processed overnight, and presented to analysts the next morning. The real-time constraint is real but narrower than it appears in research framing.

What this means for pipeline design

Ball detection errors have different downstream consequences than player detection errors. A missed player detection creates a gap in tracking that later association logic may or may not bridge. A missed ball detection during a key sequence removes the ability to attribute a pass, shot, or clearance to the correct player and phase of play. Event detection and tactical analysis layers depend on continuous ball trajectory, not approximate position.

The implication for game state reconstruction is that ball detection should be treated as its own subsystem within the pipeline, with its own architecture choices, training data strategy, and performance characterisation, rather than bundled into the player detection problem and addressed with the same model.

MatchGraph treats ball detection as a distinct layer with its own inference path. The temporal approaches described in this post are part of how the system maintains ball position through the occlusion events that matter most. If you are assessing the feasibility of ball-level analytics from your existing video infrastructure, our Football Video Intelligence engagement covers the full pipeline, from camera constraints through to event-level outputs that coaching staff can act on.

Want to learn more?

We write about AI, product strategy, and the future of building. Get in touch to continue the conversation.

Start a conversation