The Path to Physical AI on the Factory Floor

published May 27, 2026 In

Digital & AI

Byron Winn

Digital & AI The Path to Physical AI on the Factory Floor

Digital & AI The Path to Physical AI on the Factory Floor

The Path to Physical AI on the Factory Floor

In previous explorations of agentic AI in manufacturing, I’ve traced the evolution of intelligence from scheduling (S&OP) and planning down to the digital nervous systems of networked machinery on the factory floor. Yet the ultimate horizon isn’t just a smarter spreadsheet or more communicative machinery. It’s physical AI.

Physical or “embodied” AI refers to systems that can perceive, reason, and act within the three-dimensional, physical world. These are the robots and autonomous machines that move with the nuance of a human and the precision of a computer.

While autonomous humanoid robots are not available yet, enormous investment is going into foundational models, compute/chips, and robot bodies or platforms by companies like Figure AI, Tesla, Google, Boston Dynamics, NVIDIA, and more. This is the ultimate vision, but a quiet, fundamental bottleneck has emerged: to move from a digital brain to a physical body, AI requires a different kind of fuel.

Foundational physical AI data is required to train the models and fine-tune the robots.

Understanding the data bottleneck

Large language models (LLMs) lean heavily on the internet. They learn from symbolic content like text, code, and images. Physical AI, however, learns from sensorimotor experience: what was seen, what action was taken, what happened next, and whether the action succeeded in the real or simulated world. It needs to understand that when a gripper applies X amount of force, Y happens to the material.

This requires time-aligned, multimodal, action-labeled data — a granular record of observations, actions, context, and outcomes.

This data does not exist on the internet, and it’s what’s necessary to train the foundational models of physical AI. It is expensive to harvest and even harder to scale.

The rise of VLA models

Vision-Language-Action (VLA) models are the current frontier for robotics. These systems take visual inputs and language instructions (e.g., “Tighten the bolt on the interior housing”) and translate them directly into robotic motor commands. They map robot observations to actions while retaining the benefits of large-scale vision-language pretraining.

That pretraining is the challenge. VLA models are typically trained on large robot demonstration datasets gathered across many tasks and robot types, like the Open X-Embodiment dataset, which contains more than one million real robot trajectories spanning 22 robot embodiments contributed by 34 labs. While datasets like this are promising, there still aren’t enough of them. As NVIDIA’s Jim Fan has noted, data is the primary constraint for robot foundation models.

Current data strategies

Recognizing the need for this data, industry leaders are experimenting with four primary methods to build the type of training data repositories they need to train physical models. However, each of these current strategies comes with its own set of compromises:

Teleoperation/demonstrations: Collecting and aggregating data from robot embodiments. This method produces high-quality data but is slow, labor-intensive, and expensive to scale to the levels required for general-purpose intelligence. It is also inherently limited by what robots can currently actually do.
Simulation: Creating digital twins of robots that can practice millions of times every second. While this can be an appealing solution to the constraints of collecting physical data, the “sim-to-real gap” — the subtle differences between digital physics and the messy real world — is a persistent hurdle, and it remains costly to build simulators.
Synthetic data: Using AI to dream up new scenarios based on a small seed of real-world data. This method depends on real teleoperated data to post-train the model, and the seed data must be high-fidelity, physics-aware, diverse, and representative, which means results are only as good as the initial seed. It also requires extensive computational resources.
World models: Predicting robot observations and task outcomes. While world models show promise, they often struggle with unfamiliar objects and navigation errors over longer periods of time.

AI leaders are working urgently to overcome this data problem, but the expense, speed, and lack of scalability continue to hold organizations back.

Breaking through the data bottleneck

If robots and simulations can’t yet provide enough data, why not look at the experts already on the floor?

Companies are using human video capture — often from the worker’s perspective — to train their models. By recording a veteran technician performing a complex assembly, it’s possible to capture the gold standard of physical action. Figure, for example, has done this to build training data for their Helix VLA model, and they’ve reported zero-shot human-video-to-robot transfer for navigation.

However, raw video is just noise until it’s labeled and searchable for operational knowledge.

Historically, this meant tedious manual annotation. That process was expensive and difficult to scale — so much so that there have been attempts to crowdsource the work. Now, AI is capable of largely taking it over.

Modern systems can now pre-label video, chunking it into task-sized segments and labeling what task is being performed, what equipment is used, what actions are taken by the worker, what anomalies or deviations occurred, and the outcome of the task. AI systems are increasingly capable of pre-labeling data and selecting only the most informative examples for human review. This makes building physical AI datasets easier, faster, and more manageable at scale.

SOLUTION SPOTLIGHT

AI ambition is easy. AI ROI isn’t. Catalant Forward Deployed Experts close the gap.

Learn More

Setting the foundation today for physical AI tomorrow

Even when it’s possible to start building these datasets, there is often pushback about building datasets for robots that don’t yet exist. But if the training doesn’t exist, it’s impossible for the robot builders to make robots that are fit for the tasks at hand.

Don’t wait for the robot to arrive to begin documenting the work. Physical AI data — specifically labeled, point-of-view video — is immensely valuable today for human-centric operations, especially in areas where fixed cameras cannot reach. It presents opportunities to make work processes safer and more productive today while also being useful in training the robots of the future.

Consider these high-stakes environments:

Aerospace: Fastening and sealing inside fuel tanks or closed structures
Medical devices: Assembling catheters or tubing inside lumens or hub connections
Automotive: Installing wiring harnesses, connectors, or clips inside body cavities or dash structures
Shipbuilding: Welding in confined spaces or fitting inside hull sections or tanks

These activities are common and often critical. When you document and label these activities, you aren’t just training a future robot; you are creating searchable operational knowledge that can unlock improvements in:

Production operations: Efficiency, throughput, cycle time, schedule adherence
Safety and regulatory compliance: Compliance improvements, identification and remediation of safety risks
Asset management: Downtime, maintenance execution, reliability

The immediate impact

Labeled physical data allows you to document how work is being done, rather than just logging the final result. This shifts the focus from post-mortem analysis to real-time variance reduction and compliance.

What does this look like? Consider an aerospace final assembly line. The ability to see exactly how a fastener was installed in a blind cavity could save millions per year. It avoids non-conformance investigations, reduces out-of-station rework, and mitigates the risk of high-consequence safety incidents.

Furthermore, this data can feed into existing agentic systems, providing a better understanding of what’s really happening. Conventional agents know what should happen based on ERP, MRP, and MES data, along with other forms and logs, but can still miss what is actually happening. Direct, contextual shop floor data collection improves visibility into actual execution and gives AI systems a more accurate basis for analysis, orchestration, and decision-making.

Selecting a starting point

This type of data collection requires coordinated effort, and most manufacturers start with a pilot program to prove value.

To select the best-fit processes or activities to prioritize, first compile a set of activities that are difficult to observe, yield variable outcomes, and put value at risk. Then, score them against each other for:

Value at stake: What is the economic exposure (volume X deviation rate X cost per deviation, including the cost of downtime)?
Variance: How much do outcomes vary across different operators or crews?
Task-boundary clarity: Can you clearly and effectively identify tasks and deviations to label actions and objects?
Correctability: Can systemic deviations be reliably and cost-effectively corrected?
Time-to-value: Can a pilot or demonstration show measurable benefit in 90 to 180 days?
Automation adjacency: Does this process lend itself to agentic automation or assistance?
Change management feasibility: How difficult will it be to deploy wearables, gain workforce acceptance, and integrate data into QA?

How you weigh these elements depends on your objectives and resources. For initial projects, I would recommend prioritizing areas with high potential impact — those where deviations could be corrected with training, aids, or process redesign or where value could be generated quickly via rework reduction or yield improvement — and areas that are natural fits for automation and robotics.

Constructing a new industrial foundation

Building physical AI datasets is the foundational work of the next decade. It is a rare “no regrets” move: short-term, it secures your operations, improves safety, and slashes the cost of quality; long-term, it feeds the proprietary brain that will eventually inhabit an autonomous factory.

The path to robotics starts with capturing intelligence about the work already being done on the factory floor. Agentic management of these activities isn’t just a side benefit — it’s the first economically defensible step into the future of manufacturing.

Take the next step toward the factory of the future.

Get in touch

Meet the Author

A principal at eos consulting, Byron Winn has worked with global OEMs such as John Deere and Cummins (as well as industrial suppliers deep in the value chain) to build and execute growth strategies (where to compete, how to win), manufacturing strategies (what to build where, and how), and go to market programs (including market mapping and segmentation; value propositions; channel and marketing strategies). He is a former fighter pilot and a Harvard Ph.D.

What is the primary strategic bottleneck preventing the transition from digital AI to physical AI in manufacturing?

The lack of sensorimotor, multimodal, action-labeled data constitutes the fundamental constraint for training foundational physical AI models. Unlike large language models that ingest symbolic internet text, embodied AI requires data that aligns visual observations with specific motor commands and outcomes. According to Catalant consultant Byron Winn, this fuel for physical AI does not exist on the open internet. Consequently, organizations must develop proprietary strategies to harvest expensive, high-fidelity experience data to move intelligence from digital brains to physical robotic bodies.

How do Vision-Language-Action (VLA) models redefine robotic motor command execution?

VLA models act as the translation layer that maps high-level language instructions and visual inputs directly into precise robotic actions. These systems allow robots to understand complex commands, such as tightening specific interior bolts, by pretraining on massive demonstration datasets like Open X-Embodiment. While VLA models retain the benefits of large-scale vision-language pretraining, their scalability remains hindered by a shortage of diverse robot trajectories. Bridging this gap is essential for developing robots that move with human-like nuance.

Why is human-centric video capture considered the “gold standard” for training future autonomous systems?

Point-of-view video of veteran technicians captures the expert physical actions and situational context that simulations often fail to replicate. By recording high-stakes tasks in aerospace or medical device assembly, firms create a high-fidelity seed for physical AI training. When this video is pre-labeled by AI into task-sized segments, it becomes searchable operational knowledge. This approach bypasses the labor-intensive limitations of teleoperation and the “sim-to-real gap” inherent in digital twins, providing a scalable pathway to zero-shot robot transfer.

How can documentation of blind-cavity assembly tasks provide immediate ROI before robots are deployed?

Labeling physical data for complex assemblies reduces out-of-station rework and non-conformance investigations, saving millions in annual operational costs. In environments like aerospace final assembly, having a record of how a fastener was installed in a blind cavity mitigates high-consequence safety risks and reduces reliance on post-mortem analysis. This data improves real-time variance reduction and compliance. This “no regrets” move secures current operations and slashes the cost of quality while simultaneously building the proprietary datasets required for future autonomous agents.

What criteria should executives use to prioritize pilot programs for physical AI data collection?

Manufacturers should score activities based on value at stake, variance in outcomes, and automation adjacency to ensure a high probability of success. Select processes that are difficult to observe but put significant economic volume at risk, such as confined-space welding or dash structure wiring. Organizations should prioritize tasks where systemic deviations can be cost-effectively corrected through training or process redesign. By focusing on pilots with a 90-to-180-day time-to-value, leaders can prove the economic defensibility of capturing shop floor intelligence.

Measuring AI: Hours, Tokens, Dollars, and Results

Why Your Organization Can’t Embrace AI Yet

The Expertise Ceiling: How Leading LLMs Perform on Specialized Consulting Work