An AI benchmark where agents act as food safety investigators — tracing contaminated food through a live supply chain under noisy sensors, delayed reports, and limited budgets.
True contamination is hidden. Agents only see noisy sensor readings that can spike on clean nodes and miss real sources.
Consumer illness reports arrive hours after exposure. By the time signals are obvious, contamination may already be downstream.
Finite lab tests and recall budget — inspecting every node or recalling every batch is not an option.
False quarantines and unnecessary alerts damage trust. Overreaction has real cost just like underreaction.
Every observation includes a natural language summary. Drop in any LLM as an agent with our prompt template.
Episodes scored on containment, precision, speed, and trust preservation. Reproducible with seed control.
Start an episode. The supply chain spawns with hidden contamination sources and batches already moving downstream.
Choose from 7 actions — INSPECT, QUARANTINE, LIFT, RECALL, TRACE, ALERT, or WAIT — each with real consequences.
Find the source, block the spread, recall contaminated batches. Your score reflects containment, precision, speed, and trust.
Lab test a node. Exact contamination result, costs 1 lab token.
Block outbound spread from node. +4.0 for source, −2.0 if wrong.
Remove quarantine, restore flow. Builds trust if node was clean.
Remove batch from chain. +1.5 for contaminated, −1.0 if clean.
Trace batch path upstream. Low-cost intel to find the true source.
Retailer warning. Slows exposure but permanently reduces trust.
Let the system evolve. Useful when waiting for lab results.
Launch the interactive simulation — control the environment manually or let an LLM agent decide.