Autonomous laboratory research requires more than just plausible protocol text. For real-world biology, scientific intent must be hard-coded to physical constraints, hardware-level validity, and instrument feedback. A team from the Shanghai AI Lab and Genoria AI has introduced ProtoPilot—a multi-agent system designed to bridge this gap. Led by Yankai Jiang and Meng Yang, the researchers correctly point out that an idea only becomes an experiment when translated into the language of specific reagents, volumes, and manipulations that won't "brick" expensive equipment. The line between a natural-language goal and executable code is typically too fragile, as a draft protocol is merely an intermediate step, not a finished product.
Layered Verification and Skill Libraries
ProtoPilot addresses the complexity of the "wet lab" by implementing layered verification. The system doesn't just spit out a block of code; it deploys Standard Operating Procedures (SOPs) and synthesizes SDK-compatible commands, accounting for specific labware and incubation conditions. The Shanghai AI Lab report explains that different system levels eliminate different types of uncertainty: while protocol logic handles the biological essence, SOPs ground that logic in concrete volumes and concentrations. An updatable skill library allows the agent to adjust actions based on real-world feedback. This is a self-evolution mechanism: instead of endless "hallucinations in a test tube," the model learns from its own mistakes in a physical environment.
ProtoPilot combines layered verification, agent orchestration, and a real-time updatable skill library to generate protocols, expand SOPs, and refine workflows based on laboratory data.
The real secret lies in the formalization of the feedback loop. If a run fails or produces anomalous data, the system doesn't blindly repeat the task; it triggers a plan revision process. This transforms the LLM from a chatty bot into a reliable controller of physical processes. In tests involving Polymerase Cycling Assembly (PCA), the system successfully used lab data to refine procedures, proving that the gap between digital instructions and biological results can be closed iteratively.
Benchmarking Against Physical Reality
To test reliability, the researchers developed an expert benchmark of 294 tasks in synthetic and molecular biology. These tasks are based on 98 "gold standard" protocols and were evaluated against the criteria of expert biologists and technical hardware validity gates. Data shows that ProtoPilot achieved an expert preference level (Top@3) of 90.2%. However, another metric is more vital for the industry: the overall success rate from "protocol to code" reached 96.6%. This stands in sharp contrast to existing workarounds; for instance, on Opentrons hardware, the system scored 88.2%, while the specialized OpenTrons-AI stalled at 32.4%.
The framework covers 294 tasks based on reference protocols, expert assessments, and real experimental tests with hardware-level validation.
System validation included products confirmed by Sanger sequencing, providing a verified path toward truly autonomous research. By treating protocol-to-code conversion as a multi-stage automation task rather than simple translation, ProtoPilot avoids the pitfalls of less advanced models. The researchers confirmed the system handles scientific reasoning and sample tracking even in complex molecular work, adhering to strict pipetting constraints. The bottleneck in AI science has shifted from pure reasoning to the precision of translating intent into physical action. ProtoPilot's success depends directly on infrastructure and manufacturer-specific SDKs, but the signal for tech leads is clear: effective agents in biology aren't those that write well, but those that can work with verified feedback.