Recursive Reasoning for Video: Do Models Need to Be Bigger to Perform Better?

By Arun RavindranEren AldisTom Berg, and Adel Abdalla
Blog Post

Across industries, organizations capture vast amounts of video, but most of it is never used. From retail floors and factory lines to hospitals, farms, and energy infrastructure, turning that footage into timely, reliable insight remains difficult. That is beginning to change as organizations use AI to interpret video as it is captured.

But the economics of video AI are brutal. Models are large, inference is expensive, and the hardware required to run them at the edge often can’t support today's state-of-the-art architectures.

This leads to a tradeoff between accuracy and deployability, slowing adoption and limiting where video intelligence can run. In response, much of the field has opted to scale models larger. Yet scaling often shifts the tradeoff instead of solving it, improving accuracy but making deployment more difficult.

This creates an opportunity to rethink how models use their existing capacity.

At BCG X’s AI Science Institute, we work at the intersection of foundational AI research and enterprise deployment. Our teams partner directly with organizations navigating real constraints: hardware budgets, latency SLAs, privacy requirements, and limited connectivity.

In this work, we focus on video understanding models—specifically those used to recognize actions in video—and apply recursive reasoning, a technique in which a model iteratively refines its understanding using shared parameters. Instead of getting bigger, the model revisits and sharpens its interpretation of each frame.

The result is an 11 percentage-point improvement in accuracy without increasing model size. Both models remain at roughly six million parameters—the improvement is driven entirely by architecture, not scale.

That combination—higher accuracy at the same size—has immediate real-world implications. Models become cheaper, faster, and viable on constrained hardware—such as self-checkout terminals, drones, factory cameras, and mobile devices—where video data is actually generated. It also opens the door to bringing the same approach to on-device language models, a direction we are actively exploring.

In this post, we’ll look at why this matters, where recursive reasoning can be applied, and how it works.

Performance Is More Than Accuracy

These improvements change the economics of video AI. They affect what models cost to run, where they can operate, and how their behavior can be understood:

This approach delivers the most value in systems operating locally under real-world constraints. In those settings, performance is defined not just by accuracy, but by cost, latency, and whether the model can be deployed at all.

Where Recursive Reasoning Delivers Value

The advantage of recursive reasoning is straightforward: higher accuracy with a smaller parameter footprint. This makes it well suited for use cases where models need to run directly on constrained hardware, particularly in scenarios such as:

On-device language models

The same recursive reasoning principle that improves vision models applies to language. The weight-sharing architecture allows small models to perform better than their size would suggest by iterating over representations rather than scaling depth. This is directly relevant to the push toward on-device LLMs—running assistants, classifiers, and summarizers locally on phones, tablets, or embedded systems where latency, privacy, and connectivity constraints rule out cloud inference.

Retail: self-checkout video analysis

Self-checkout systems generate continuous video streams that need to be analyzed in real time for loss prevention, product recognition, and customer assistance. These systems run on compact, store-level hardware with limited compute budget. A 6M-parameter model that achieves 69% action recognition accuracy without needing a cloud round-trip fits this deployment profile directly. Multiply that across thousands of checkout lanes, and the cost and latency savings become substantial.

Agriculture: drone-based vision

Precision agriculture increasingly relies on drone-mounted cameras for crop monitoring, disease detection, and yield estimation. Drones operate under strict weight, power, and compute constraints. There is no room for a large model or a reliable cloud connection in the field. A lightweight recursive vision model running directly on the drone's onboard processor can classify crop conditions, detect irrigation issues, and identify pest damage in real time during a flyover, rather than requiring footage to be processed after the fact.

Industrial inspection: real-time quality control

Manufacturing quality inspection often relies on cameras mounted directly on production equipment. These edge devices need to classify defects, detect anomalies, and verify assembly in milliseconds. Recursive reasoning's ability to achieve higher accuracy at a fixed model size means better detection rates on the same hardware, without requiring infrastructure upgrades or introducing cloud dependencies.

Infrastructure and energy: remote monitoring

Pipeline inspection drones, solar panel monitoring systems, and remote wind farm cameras all operate in environments with limited connectivity and compute. Running a capable vision model on-device enables real-time alerting for damage, corrosion, or equipment failure, reducing the time between detection and response from hours to seconds.

What Happens When Models Refine Instead of Scale

Instead of scaling models larger, we tested what happens when they use existing capacity more effectively. We applied recursive reasoning to a standard video classification pipeline and compared it to a baseline under identical conditions.

The problem

Most video AI systems follow a straightforward pipeline: extract features from each frame, aggregate them over time, and classify. Frame-level understanding is shallow—typically reduced to a single summary vector—and the system compensates by stacking temporal layers to capture patterns across time.

This works, but it leaves spatial information on the table. Frame features are taken at face value, with no mechanism to revisit or refine them before they're passed downstream. The dominant response has been to scale models larger, but that only deepens the tradeoff between accuracy and deployability.

Recent research has shown that architectural efficiency can match or exceed the gains from raw scale. The Tiny Recursive Model (TRM), originally developed at Samsung SAIL Montreal, demonstrated that a 7M-parameter model with a recursive structure could achieve 45% on the ARC-AGI-1 benchmark—a test of abstract reasoning—and remain competitive with far larger systems.

We set out to test whether recursive reasoning could improve video classification accuracy without increasing model size or compute requirements.

What we changed

We modified the standard video classification pipeline in one key way. Instead of extracting a single summary from each video frame and moving on, we applied iterative reasoning cycles over the full spatial representation of each frame before aggregating across time.

In a standard model, one summary token is extracted per frame, and a deeper temporal module (typically three layers) is used to model relationships across frames. In our approach, all spatial detail is retained and passed through a recursive reasoning module—a shared-weight transformer that runs multiple refinement cycles over the same data.

Each cycle sharpens the model's understanding of spatial relationships within the frame. Since the weights are shared across cycles, the model does not grow larger with additional iterations.

Results

We evaluated both models on HMDB51, a widely used benchmark for recognizing 51 human actions (e.g., running, jumping, climbing) across varied video conditions. All experiments used identical training configurations to isolate the architecture’s impact.

The results show that stronger frame-level representations reduce reliance on temporal depth. Our model uses a single temporal layer rather than three, yet still outperforms the baseline.

We also found that recursive reasoning amplifies the impact of pretrained features. With pretrained weights, the recursive improves by 29 percentage points over its non-pretrained version, compared to 20 points for the baseline. Without pretraining, both models perform similarly at around 40 percent. This confirms that recursive reasoning strengthens strong representations rather than replacing them.

What Makes This Approach Distinct

Recursive reasoning itself is not new. What’s unique here is how it’s applied to video models and how the model is designed and tested:

Implications for Model Design

These findings extend beyond a single experiment. They point to a broader set of implications for how models are designed and computation is used:

Practical Takeaways for Deployment

For teams evaluating AI architectures for video or edge deployment, our research points to a few takeaways:

Extending Recursive Reasoning Beyond Video

Video classification is one entry point, but the recursive reasoning architecture is not limited to a single modality. We are already applying the same approach in other settings, with early results that suggest it carries over.

In earth observation, we combined recursive reasoning with Adaptive Fourier Neural Operators (AFNO) for satellite image reconstruction. The model achieves strong reconstruction quality (89.2% perceptual score, 29.0 dB PSNR) on the GID-15 land cover dataset using only 4M parameters, compared to 20M+ typically required. This has clear relevance for applications such as climate monitoring, agricultural land-use classification, and infrastructure change detection from satellite data.

In biomedical imaging, we evaluated the approach on PatchCamelyon, a histopathology benchmark for detecting cancer metastases in lymph node tissue. The model delivered competitive performance at a fraction of the parameter count of standard approaches—highlighting its potential in clinical environments where compute, privacy, and turnaround time matter.

We are also beginning to apply the same idea to audio, treating spectrograms as visual inputs and refining them iteratively.

Across these domains, the common thread is not the data itself, but how the model uses its capacity. Repeated refinement over the same representation can be more effective than simply increasing model size.

The next question we want to answer is one of scale. Our current results use a 6M-parameter model on HMDB51—one of the most challenging action recognition benchmarks due to its fine-grained categories and variability in viewpoint, lighting, and video quality.

The natural follow-up is: can a scaled recursive model match or exceed the best reported results on HMDB51, and how many fewer parameters would it require compared to current leading approaches? If the 11-point gain we observed at 6M parameters holds as models scale, the implications for parameter efficiency in production video systems would be significant.

Beyond vision, we are exploring whether the same principle applies to large language models (LLMs). The core idea—reusing shared weights across iterative refinement cycles instead of scaling depth—maps directly to the challenge of building capable LLMs that can run on-device. If recursive reasoning can deliver meaningful quality gains at fixed parameter counts for language tasks, it could provide a path toward smaller, faster models that retain performance without relying on compression techniques that introduce tradeoffs.

Our work at the BCG X AI Science Institute points to a different direction for model design. Progress does not need to come only from increasing model size. It can also come from how computation is applied within the model itself. Recursive reasoning shows that stronger results can come from making better use of existing model capacity, rather than expanding parameter count. As this approach extends across domains, it offers a way to build systems that are not only more capable, but also better aligned with the constraints they operate under.

This research was conducted by the BCG X AI Science Institute as part of an ongoing study into efficient model architectures for visual understanding.