Visual understanding: Unlocking the next frontier in AI

August 4, 2025
3:04 pm

[[{“value”:”

At the NYC AIAI Summit, Joseph Nelson, CEO & Co-Founder of Roboflow, took the stage to spotlight a critical but often overlooked frontier in AI: vision.

In a field dominated by breakthroughs in language models, Nelson argued that visual understanding – or how machines interpret the physical world – is just as essential for building intelligent systems that can operate in real-world conditions.

From powering instant replay at Wimbledon to enabling edge-based quality control in electric vehicle factories, his talk offered a grounded look at how visual AI is already transforming industries – and what it will take to make it truly robust, accessible, and ready for anything.

Roboflow now supports a million developers. Nelson walked through what some of them are building: real-world, production-level applications of visual AI across industries, open-source projects, and more. These examples show that visual understanding’s already being deployed at scale.

Three key themes in visual AI today

Nelson outlined three major points in his talk:

The long tails of computer vision. In visual AI, long-tail edge cases are a critical constraint. These rare or unpredictable situations limit the ability of models, including large vision-language models, to fully understand the real world.What the future of visual models looks like. A central question is whether one model will eventually rule them all, or whether the future lies in a collection of smaller, purpose-built models. The answer will shape how machine learning is applied to visual tasks going forward.Running real-time visual AI at the edge. Nelson emphasized the importance of systems that run on your own data, in real-time, at the edge. This isn’t just a technical detail; it’s foundational to how visual AI will be used in the real world.

Where AI meets the real world: The role of computer vision

Joseph Nelson framed computer vision as the point where artificial intelligence intersects directly with the physical world. “At Roboflow, we think that computer vision is where AI meets the real world,” he explained.

He emphasized that vision is a primary human sense, predating even language, and pointed out that some civilizations thrived without a written system, relying instead on visual understanding. That same principle applies when building software systems: giving them vision is like giving them read access to the physical world.

This visual “read access” enables software to answer practical questions:

How many people are in a conference room?Were a set of products manufactured correctly?Did objects make their way from point A to point B?

Nelson illustrated this with a range of real-world examples:

Counting candies produced in a shopValidating traffic flows and lane usageEvaluating a basketball player’s shotMeasuring field control in a soccer match

Despite the diversity of these applications, the common thread is clear: each uses visual understanding to generate actionable insights.

How Roboflow powers visual AI at scale

Roboflow’s mission is to provide the tools, platform, and solutions that enable enterprises to build and deploy visual AI. According to Nelson, users typically approach Roboflow in one of two ways:

By creating open-source projects, much like publishing code on GitHubBy building private projects for internal or commercial use

On the open-source front, Roboflow has become the largest community of computer vision developers on the web. The scale is significant:

Over 500 million user-labeled and shared imagesMore than 200,000 pre-trained models available

This ecosystem provides Roboflow with a unique insight into how computer vision is being applied, where it encounters challenges, and where it delivers the most value.

Roboflow also serves a wide range of enterprise customers. Nelson shared that more than half of the Fortune 100 have built with Roboflow, especially in domains grounded in the physical world, such as:

Retail operationsElectric vehicle manufacturingProduct logistics and shipping

Backed by a strong network of investors, Roboflow continues to grow its platform and support the expanding needs of developers and businesses working at the intersection of AI and the real world.

AI-powered healthcare, with Archie Mayani [Video]

How AI is tackling healthcare supply chain disruptions—featuring GHX’s Archie Mayani on predictive tools, alternatives, and patient care.

AI Accelerator InstituteAnthony Witherspoon

Enabling the future of visual AI

Joseph Nelson outlined Roboflow’s vision for the future of visual AI: empowering developers to build models tailored to the specific scenes and contexts they care about. For this future to materialize, he emphasized two critical factors:

Making it easy to build models that understand the right contextMaking it just as easy to deploy those models into larger systems

“When you do that,” Nelson said, “you give your software the ability to have read access to the real world.”

Open source as the backbone of progress

Roboflow’s commitment to open source is a core part of that strategy. One of their flagship open-source tools is a package called Supervision, designed to help developers process detections and masks and integrate them into broader systems.

“We’re big believers in open source AI,” Nelson explained, “as a way by which the world understands everything this technology can do.”

To illustrate adoption, he compared Supervision’s usage with that of PyTorch Vision, a widely recognized set of tools in the machine learning community. “The purple line is Supervision. The red line is PyTorch Vision. And we’re proud that some of the tools that we make are now as well recognized across the industry.”

A growing ecosystem of visual AI tools

Beyond Supervision, Roboflow maintains a suite of open-source offerings, including:

Trackers for following objects across scenesNotebooks for visual understanding tasksInference tools for deploymentMaestro, a package for fine-tuning domain-specific language models (DLMs)

Together, these tools aim to simplify the entire visual AI pipeline, from training and understanding to deployment and fine-tuning, anchored in a philosophy of accessibility and openness.

The range and reality of visual AI in action

Joseph Nelson’s perspective on visual AI is grounded in scale. Roboflow has visibility into hundreds of millions of images, hundreds of thousands of datasets, and hundreds of thousands of models. This massive footprint offers deep insight into how visual AI is actually being applied in the real world.

These applications span industries and creativity levels alike, ranging from aerial understanding to pill counting in pharmaceutical settings and defect detection during product manufacturing. “And a whole host of things in between,” Nelson added.

From silly to serious: A spectrum of use cases

Roboflow, as an open-source platform, supports a wide range of users: from hobbyists working on whimsical experiments to enterprises solving mission-critical problems.

Nelson highlighted this spectrum with a few vivid examples:

Dice roll trackers for D&D (Dungeons and Dragons): Systems that visually detect dice results during tabletop gaming sessions.Roof damage detection: Insurance companies use drone imagery to assess hailstorm-related roof damage.Vehicle production monitoring: Manufacturers inspect parts, such as those produced by stamping presses on car doors, to detect defects in real-time.

Flamethrowers and laser pointers: A creative approach

Among the more eccentric community creations:

Flamethrower weed-killing robots. Two different developers utilized Roboflow to design robots that identify and eliminate weeds using flamethrowers. “Why pull them out,” Nelson recounted, “when I can use a flamethrower to eliminate said weeds?”Cat exercise machine. A developer, during COVID lockdowns, built an exercise tool for his cat by mounting a laser pointer on a robotic arm. A computer vision model tracked the cat’s position and aimed the laser ten feet away to keep the cat moving. “This one does come with a video,” Nelson joked, “and I’m sure it’s one of the most important assets you’ll get from this conference.”

These projects highlight accessibility. “A technology has really arrived,” Nelson noted, “when a hacker can take an idea and build something in an afternoon that otherwise could have seemed like a meme or a joke.”

Nelson continued with more creative and practical applications of Roboflow-powered models:

Drone-based asset detection. Models identify swimming pools and solar panels in aerial imagery, which is useful for projects like LEED (Leadership in Energy and Environmental Design) certification that measure the renewable energy contributions to buildings.Zoom overlays via OBS. By integrating Roboflow with Open Broadcast Studio (OBS), users can overlay effects during video calls. For instance, specific hand gestures can trigger on-screen animations, like “Lenny,” Roboflow’s purple raccoon mascot, popping into view.Rickroll detection and prevention. For April Fools’ Day, the Roboflow team built a browser-based model that detects Rick Astley’s face and automatically blacks out the screen and mutes the sound, preventing the infamous “Rickroll.” This demo uses Inference JS, a library that runs models entirely client-side in the browser. It even works offline.

While playful, this approach also has commercial applications, such as content moderation, page navigation, and web agent vision for interpreting and interacting with on-page elements.

Some of Roboflow’s early inspiration came from augmented reality. Nelson shared an example of an app that uses Roboflow tools to solve AR Sudoku puzzles. Here, the front end is AR, but the back end is computer vision, which detects puzzle edges, identifies number placements, and performs a breadth-first search to solve the board.

“It’s an example of using models in production in a fun way,” Nelson said.

How large language models are transforming pediatric healthcare

Discover how the NHS’s GOSH is using large language models and AI to transform pediatric healthcare, improving outcomes for rare disease treatments.

AI Accelerator InstituteShiren Patel

From demos to deployment: Roboflow in the enterprise

On the enterprise side, Roboflow supports large-scale deployment at companies like Rivian, the electric vehicle manufacturer. Over the last three years, Rivian has used Roboflow to deploy approximately 300 models into production at its Normal, Illinois, general assembly facility.

The use cases are essential to ensuring product quality:

Verifying the correct number of screws in a batteryDetecting bolts left on the factory floor (which can cause vehicles to jam or collide)Checking that paint is applied evenly and correctly

“There are hundreds and hundreds of checks that need to take place,” Nelson explained, “to ensure that a product is made correctly before it makes its way to an end user.”

Roboflow’s tools are designed not just for experimentation, but for deployment at scale. Joseph Nelson emphasized how companies like Rivian are taking advantage of this flexibility by pairing manufacturing engineers with technology teams. The result? Rapid iteration:

“I see a problem, I capture some data, I train the model, and I deploy it in an afternoon.”

This agility enables teams to solve a wide variety of small but critical manufacturing problems without waiting for long development cycles.

Self-checkout at scale: Visual AI at Walmart

Roboflow also powers instant checkout kiosks at Walmart. Describing the company playfully as an “up-and-coming retailer,” Nelson explained that Walmart’s kiosks use cameras to detect every item in a shopping cart as it enters a designated scanning zone. The system then generates a receipt automatically.

This challenge is far from trivial. Walmart stores carry hundreds of thousands of SKUs, making this a true test of model generalization and accuracy in real-world conditions. Despite the complexity, Nelson noted with pride that these systems have made it into production at multiple locations, with photo evidence to prove it.

Instant replay at major tennis events

Another high-profile deployment: instant replay at Wimbledon, the US Open, and the French Open (Roland Garros). Roboflow’s computer vision models are used to:

Detect the positions of tennis players and the ballMap out the court in real timeEnable automated camera control and clip selection for ESPN+ replays

This use case reinforces the versatility of visual AI. Nelson even drew a humorous parallel:

“Remember that cat workout machine? Turns out it’s actually not so different from the same things that people enjoy at one of the world’s most popular tennis events.”

From hacks to healthcare: More unconventional use cases

Nelson wrapped this section with two more examples that demonstrate visual AI’s creative range, from civic life to scientific research.

Tennis court availability alerts. A hacker in San Francisco set up a camera on his windowsill to monitor public tennis courts. Using a custom-trained model, the system detects when the court is empty and sends him a text, so he can be the first to play.Accelerating cancer research. At a university in the Netherlands, researchers are using computer vision to count neutrophils, a protein reaction observed after lab experiments. Previously, this required manual analysis by graduate students. Now, automation speeds up the review process, allowing faster iteration and contributing, albeit in a small way, to accelerating treatments for critical diseases.

What comes next for visual AI?

With a wide array of applications, ranging from retail to robotics to research, visual AI is already shaping various industries. As Joseph Nelson transitioned to the next part of his talk, he posed the central question:

“So what gives? What’s the holdup? What happens next, and where is the field going?”

That’s the conversation he takes on next.

As visual AI moves from research labs into production, it must contend with the messy, unpredictable nature of the real world. Joseph Nelson explained this challenge using a familiar concept: the normal distribution.

When it comes to visual tasks, the distribution is far from tidy. While some common categories, like people, vehicles, and food, fall within the high-density “center” of the curve (what Nelson calls “photos you might find on Instagram”), there’s a very long tail of edge cases that AI systems must also learn to handle.

These long tails include:

Hundreds of thousands of unique SKUs in retail environments like WalmartNovel parts are produced in advanced manufacturing settings like Rivian’s facilitiesVariable lighting conditions for something as niche as solving a Sudoku puzzle

AI and the future of international student outreach

How AI can help UK universities compete globally by streamlining admissions and supporting international students more effectively.

AI Accelerator InstituteVahid Mohammadi

Why visual data is uniquely complex

Nelson also emphasized a fundamental difference between visual and textual data: density and variability. While all of the text can be encoded compactly in Unicode, even a single image contains vastly more data, with RGB channels, lighting nuances, textures, and perspective all influencing interpretation.

“From a very first principles view, you can see that the amount of richness and context to understand in the visual world is vast.”

Because of this complexity, fully reliable zero-shot or multi-shot performance in visual AI remains aspirational. To achieve dependable results, teams must often rely on:

Multi-turn promptingFine-tuningDomain adaptation

Put simply, Nelson said, “The real world’s messy.”

When small errors become big problems

The stakes for accuracy grow when visual AI is used in regulated or high-risk environments. Nelson shared an example from a telepharmacy customer using Roboflow to count pills.

This customer is accountable to regulators for correctly administering scheduled substances. A count that’s off by just one or two pills per batch may seem minor, but when that error scales across all patients, it creates major discrepancies in inventory.

Worse, Nelson noted a surprising discovery: telepharmacists with painted long nails occasionally caused the model to misclassify fingers as pills. This kind of visual anomaly wouldn’t appear in a clean training dataset, but it’s common in real-world deployments.

This example illustrates a critical principle in visual AI:

Real-world reliability requires robust systems trained on representative data and a readiness for edge cases that defy textbook distributions.

Why existing datasets aren’t enough

Joseph Nelson pointed to a key limitation in visual AI: many models are trained and evaluated on narrow, conventional datasets that don’t reflect the full diversity of the real world.

A common reference point is the COCO (Common Objects in Context) dataset, created by Microsoft between 2012 and 2014. It includes 80 familiar classes such as person, food, and dog. Nelson described this as data that falls “right up the center of the bell curve.”

But the world is far messier than that. To push the field forward, Nelson argued, we need more than common object detection; we need better evaluation frameworks that test whether models can adapt to novel domains.

Introducing RF100VL: A new benchmark for visual AI

To address this, Roboflow introduced RF100VL, short for Roboflow 100 Vision Language. While the name may not be flashy, the purpose is clear: provide a more accurate view of model performance in the real world.

RF100VL consists of:

100 datasets100,000 imagesCovering diverse domains, such as:Microscopic and telescopic imageryUnderwater environmentsDocument analysisDigital worldsIndustrial, medical, aerial, and sports contextsFlora and fauna categories

This dataset reflects how developers and companies are actually using computer vision. Because Roboflow hosts the largest community of computer vision developers, RF100VL is built from real-world production data and maintained by the community itself.

“If COCO is a measurement of things you would commonly see,” Nelson said, “RF100VL is an evaluation of everything else, the messy parts of the real world.”

Testing zero-shot performance

Nelson highlighted the limitations of current models by sharing evaluation results. For example, the best-performing zero-shot model on RF100VL (Grounding DINO) achieved just 15.7% accuracy. That’s with no fine-tuning, meant to simulate a model’s raw ability to generalize across unseen tasks.

“Even the best models in the world today still lack the ability to do what’s called grounding and understanding of relatively basic things.”

RF100VL aims to fill that gap. It allows developers to hold large models accountable and measure whether they generalize well across domains.

Scaling like language models

So how can we improve visual understanding? Nelson pointed to the same approach used in language AI:

Pre-training at a large scaleExpanding datasets to include the long tailProviding rich, diverse context for model adaptation

In the same way large language models improved through massive, inclusive training sets, visual AI must follow suit. Nelson emphasized that context is king in this process.

Tools for diagnosing model gaps

To make evaluation accessible, Roboflow launched visioncheckup.com, a playful but practical tool for visual model assessment. “It’s like taking your LLM to the optometrist,” Nelson said. The site simulates a vision test, showing where a model struggles and where it succeeds.

While some visual tasks, such as counting or visual question answering, still trip up models, Nelson noted that progress is accelerating. And he made his bet clear:

“I would bet on the future of open science.”

Why real-time performance at the edge matters

As the talk turned toward deployment, Nelson highlighted one major difference between language and vision models: where and how they run.

Language tasks often benefit from:

Cloud hostingAccess to extensive compute resourcesRoom for delayed, test-time reasoning (e.g., web lookups)

But visual systems frequently need to operate:

In real timeOn edge devicesWith no reliance on cloud latency or large compute pools

“Models need to run fast, real time, and often at the edge,” Nelson said.

This makes deployment requirements for visual AI fundamentally different and much more constrained than those for language-based systems.

Why edge performance matters

Joseph Nelson closed his talk by returning to a central constraint in visual AI: latency.

Unlike many language-based AI tasks, visual systems often have to operate under real-time conditions, especially in settings like live sports broadcasts. For example, Wimbledon doesn’t have the luxury of cloud processing, even with high-speed internet. Every frame must be processed live, with sub–10 millisecond latency.

This creates a pressing requirement: models that run efficiently on constrained hardware, without sacrificing accuracy.

“You need to have models that can perform in edge conditions… on smaller or constrained compute and run in real time.”

Edge deployment isn’t just for industrial hardware, like NVIDIA Jetsons. Nelson highlighted another vision: empowering individual developers to run models on their own machines, locally and independently.

“We can actually kind of power the rebels so that anyone can create, deploy, and build these systems.”

Introducing RF-DETR: Transformers at the edge

To address the challenge of bringing large-model performance to real-time use cases, Roboflow developed RF-DETR, short for Roboflow Detection Transformer. This model is designed to combine the contextual strength of transformers with the speed and deployability needed at the edge.

Nelson contrasted it with existing models, like the YOLO family, which are built around CNNs and optimized for speed. RF-DETR aims to bring the pre-training depth of transformers into the real-time performance zone.

“How do we take a large transformer and make it run in real time? That’s what we did.”

RF-DETR was benchmarked on both:

Microsoft COC for conventional object detectionRF100VL to measure real-world adaptability

Edge AI vs Cloud AI — which one’s better for business?

The differences between Edge AI and Cloud AI come into play primarily for machine learning and deep learning use cases.

AI Accelerator InstituteMark Crosling

Bridging the gap to visual AGI

In wrapping up, Nelson tied together the major themes of his talk:

Better datasets (like RF100VL)Better models (like RF-DETR)Deployment flexibility, including on constrained and local hardware

Together, these advancements move us beyond the metaphor of “a brain in a jar.” Instead, Nelson described the vision for a true visual cortex, a key step toward real-world AI systems that can see, reason, and act.

“When we build with Roboflow… you’re a part of making sure that AI meets the real world and delivers on the promise of what we know is possible.”

Final thoughts

Joseph Nelson closed his talk at the NYC AIAI Summit with a clear message: for AI to meet the real world, it must see and understand it. That means building better datasets, such as RF100VL, creating models that generalize across messy, real-world domains, and ensuring those models can run in real-time, often at the edge.

From live sports broadcasts to pharmaceutical safety checks, and from open-source cat toys to advanced vehicle assembly lines, the breadth of visual AI’s impact is already vast. But the work is far from over. As Nelson put it, we’re still crossing the bridge, from large models as “brains in a jar” to intelligent systems with a working visual cortex.

By contributing to open-source tools, adapting models for deployment in the wild, and holding systems accountable through realistic evaluations, developers and researchers alike play a crucial role in advancing visual understanding. Roboflow’s mission is to support that effort, so that AI not only thinks, but sees.

“}]]

How Roboflow is advancing real-world visual AI through open tools, edge deployment, and models like RF-DETR and RF100VL.