Brent Haskins / Applied AI
The Interface Contract Is the Product: Why AI Features Fail Before the Model Runs
Most AI product failures aren't model errors—they're contract violations between what the UI promises and what the inference pipeline delivers. Drawing from real-world retrofit patterns in physical security and production AI systems, this post explains why latency budgets, failure modes, and honest loading states matter more than accuracy scores. Written May 2026, after shipping AI features in mortgage, dashboard, and security products.
The short answer
Every AI feature I've seen fail in production—including a few I shipped myself—looked like a model problem but was actually a contract problem. The UI claimed the system would return an answer in under a second. The backend, under real load, took three seconds plus a retry. Users didn't blame the model; they blamed the product. Trust, once broken, takes four correct interactions to regain.
This isn't about accuracy scores or benchmarking. It's about what the interface promises versus what the inference pipeline can actually deliver. When you retrofit AI into an existing product—whether it's a physical security system with ONVIF cameras or a B2B dashboard with batch data—the architectural pattern you choose determines the user experience more than any prompt or fine-tuning run. Latency budgets, failure modes, and honest loading states are the real product decisions.
Key takeaways
- Define the interface contract before writing any API route: what latency is acceptable, what happens on timeout, what the user sees while waiting.
- Build failure modes before model selection. A fallback to a simpler model, a cached response, or an explicit "I don't know" is a product quality choice, not a technical detail.
- Streaming is a UX pattern, not just a protocol. If the output must be coherent before display, batch inference with a progress indicator beats partial token streaming.
- The five architectural patterns for AI retrofits—edge-only, cloud-only, edge-inference with cloud fallback, cloud-inference with edge buffering, and hybrid—each impose a different contract on the UI. Pick based on latency budget, not hype.
- Trust is earned by predictable behavior under failure. A spinning loader that never resolves is worse than a fast "Couldn't process this request. Try rephrasing?"
The unobserved constraint: latency budgets
The team that designed the retrofit architecture for physical security systems didn't start with model accuracy. They started with the latency budget: how many milliseconds can pass between a camera trigger and a UI update before security operators stop relying on the tool. That number—often 200 ms for real-time alerts, up to 5 seconds for batch analytics—dictates whether inference runs on the camera, on an edge server, or in the cloud. It also dictates whether the UI shows a live stream with overlaid bounding boxes or a delayed log entry.
Most product engineers skip this step. They prototype with a cloud API, get sub-second results in testing, and ship. Under production load, the same API returns 3-second averages. The UI, built for instant response, now shows stale data or hangs. The fix isn't a better model; it's an interface that distinguishes "live inference" from "recorded inference" and sets user expectations accordingly.
Honest loading and the "I don't know" state
Source 1's frameworks emphasize making AI "trustworthy and controllable for users." That's not about explainability widgets. It's about the loading state. When the backend is still computing, the UI should say something honest: "Analyzing frame 7 of 30…" or "Waiting for secondary model." The worst pattern is a generic spinner that covers the uncertainty.
But honest loading requires that you know your model's failure profile. Does it return low-confidence results that should be flagged? Does it time out on certain inputs? If you haven't instrumented those outcomes, you can't craft a UI that respects them. I've seen products ship a "confidence score" badge that the front-end just parrots from the model—so when the model mistakenly outputs 99% confidence on garbage, the UI proudly displays the lie. Better to design a threshold: below 80%, no answer shown.
Real architecture, real contracts
The physical security technology briefing (source 8) lays out five retrofit patterns. Each one changes what the UI can promise:
- Edge-only inference: UI gets results in <100 ms but limited to simple detections. Contract: fast, narrow.
- Cloud-only inference: UI waits 1-5 seconds but gets rich classification. Contract: slow, deep.
- Edge inference with cloud fallback: UI shows immediate light result, then replaces with deeper result when cloud responds. Contract: two-phased, requires careful transition.
- Cloud inference with edge buffering: UI shows a queue of pending results. Contract: batch-oriented, requires progress bars.
- Hybrid: UI requests edge result but cloud may override. Contract: final answer may be delayed.
Choosing a pattern is a product decision. Most teams default to cloud-only because it's easiest to prototype, then struggle with latency UX. The right choice depends on your users' tolerance for delay and the cost of a false positive. For a security alert, a fast false alarm is less damaging than a delayed true detection. The UI must communicate that tradeoff.
The platform shift: from prompt to contract
Google's Antigravity platform (source 2) aims to take ideas "from a prompt to a production-ready application." That's the right aspiration, but prompts don't have contracts. A production app has timeouts, retries, loading states, and error boundaries. The platform's real value isn't generating code fast; it's codifying the interface contract so that every generated endpoint includes a latency budget, a failure mode, and a fallback.
As we move toward agentic systems (source 7), contracts become even more critical. An agent that calls tools or retrieves documents introduces variable latency. The UI can't assume instant responses. It must show reasoning steps, intermediate results, and a clear "still working" indicator—or risk users refreshing and breaking the pipeline.
What to do Monday
Before you add an AI feature to your product, write down the interface contract: the expected latency range, the acceptable error modes, and the user-facing language for each failure. Share it with a product manager and a SRE. If the contract can't be met with your current inference path, change the path or change the contract. Then build the UI. The model is the easy part.
FAQ
Questions people ask about this topic.
What's an interface contract in AI products?
The interface contract is the set of assumptions a user's display makes about timing, data availability, and error handling. The UI might promise an instant result, but the backend has a 2-second inference window and no fallback for low-confidence predictions. That mismatch is where trust breaks.
How do you decide between streaming and batch for AI responses?
Start with the user's tolerance for delay and the cost of partial output. Streaming shines when the model can emit tokens before seeing the full input—think chat or summarization. Batch is safer when the output must be fully coherent before display, like compliance checks or reports. Measure latency budgets end-to-end, not just model time.
What's the most common failure mode in AI product engineering?
Relying on a single inference pipeline with no fallback. When the model times out, returns gibberish, or exceeds latency budget, the UI often hangs indefinitely or shows a partial result. A good failure mode retries with a simpler model, shows a cached fallback, or says 'I don't know'—all of which require a contract designed before the feature ships.
Sources