Replacing the broken survey dashboard with a proper, purpose-built customer service improvement tool inside command.p2ops.com
The old survey dashboard hit a hard wall. This is the right moment to build it correctly rather than patch something that was already overdue for retirement.
A pre-aggregation fix (Option 3 from the original analysis) would work technically, but it buys maybe 60 days before the problem returns, wastes Meg's build time on a system already marked for sunset, and leaves the dashboard serving a 216KB single-file page that Meg called "embarrassing for external partners." The smart move is to build it once and build it right.
The ops dashboard at command.p2ops.com is live, stable, and already where Kirk spends his time. Survey features belong there.
A new primary tab in command.p2ops.com. Not a replacement dashboard — an integrated feature that lives where operations actually happens. Kirk's framing: "a useful, motivating tool to help us improve customer service."
Wireframe — illustrative data only. Design: Edna Mode.
Clean migration from a broken flat-file system to a proper data store. Survey queries live in Pages Functions with a direct D1 binding — no Mac Mini dependency for reads.
SQLite-based, serverless, no infrastructure to manage. Three-layer schema:
mdos, driverssurveys — write-once, UNIQUE on order_number, idempotent ingestSurvey endpoints live in functions/survey/*.js — separate from the existing API proxy. D1 binding is native; no HTTP calls to FastAPI backend required.
Formula: (C × global_mean + n × driver_mean) / (C + n) — where C=10, global mean=4.52 (from actual data). A driver with 3 surveys barely moves the needle; 50 surveys converges to their raw mean. This prevents a driver with 2 five-star reviews from appearing above a driver with 50 consistent reviews. Recalculate global mean quarterly.
Scores are computed in Python at ingest time and stored. Dashboard reads a number, never recomputes.
Python prep script generates batch SQL files from the existing JSON, loaded via wrangler d1 execute. UNIQUE(order_number) constraint means re-running the migration is safe. Nightly processor (process_surveys.py) updated to write new records to D1 with the same dedup logic. JSON kept as parallel write for 30 days as safety net, then sunset.
carrier_normalized = "TopHat" for P2 records while the monthly _ours files use "P2 Last Mile - Ukiah", "Two Phillips Enterprises (Hurricane)", etc. These must be reconciled or the leaderboard splits P2 records across phantom carriers. (2) The 52K-record master file covers only 43 days (March 31–May 2026). P2's full history lives in the monthly _ours files going back to May 2025 — migration must include both, or the leaderboard starts from scratch in late March.The design problem is real: a ranking tool inherently creates winners and losers. The goal is to make the winners feel recognized and the improvers feel seen — not to create a scoreboard that demoralizes.
Kirk checks this from his phone. Mobile-first decisions: sub-nav stays as text labels with horizontal scroll (no icons — they're ambiguous on a performance tool). Leaderboard collapses to 4 columns on mobile: RANK · DRIVER · SCORE · TREND. Category heatmap hidden on mobile with a tooltip directing to desktop. Comments browser works well on mobile — card format is naturally responsive.
Same dark UI, P2 blue (#3b8def), same token system as the rest of command.p2ops.com. The Survey Performance tab slots into the existing nav as the third primary item, filling the "+" placeholder. No new design language — no visual discontinuity for the user.
Leaderboard is the first deliverable. Comments and Driver Detail follow. Trend lines wait for data accumulation. survey.p2ops.com is redirected only after the new section is verified.
| Phase | Deliverable | Estimate | Gate |
|---|---|---|---|
| Foundation | D1 schema, migration, nightly writer | 3 days | Gandalf security review |
| Tier 1 | Leaderboard, Markets, Categories | 4 days | Coach Beard leaderboard sign-off |
| Tier 2 | Driver Detail, Comments, Nav | 4 days | None blocking (builds on Tier 1) |
| QA + Sunset | Testing, redirect, cleanup | 2.5 days | Kirk confirmation before redirect |
| Trend Lines | Weekly trend views | 2 days | Deferred to August 2026+ |
Each agent went deep on their domain. These are their actual findings, not summaries of assumptions.
The 99.5% revelation: The 52K-record file is almost entirely competitor data. P2's share is 244 records across 28 drivers — across all four MDOs combined. This changes the D1 scope entirely: the database stays small, query performance is trivial, and PII exposure is minimal.
On Bayesian scoring: Formula confirmed — C=10, global mean=4.52 from the full P2 dataset. A driver needs ~50 surveys to converge to their raw mean. This is the right formula; the implementation must be validated against the existing leaderboard before cutover.
Critical migration finding: The main survey_data.json only covers March 31–May 2026 (43 days). P2's full history going back to May 2025 lives in the monthly _ours files. Migration must include both or the new leaderboard starts with 6 weeks of history instead of 12 months.
Most valuable analysis we're not doing: Score-to-pullback correlation. Join survey scores to invoice pullbacks by delivery date + MDO. If lower-scoring deliveries generate more pullbacks, that's a direct dollar value on improving customer service. The data to do this is already in-house — no new collection needed.
D1 storage outlook: At current growth, the free tier (500MB) handles ~18 months. Budget conversation for the paid tier is a 2027 problem, not today's.
Architecture call: Pages Functions → D1 direct binding. Not through FastAPI backend. Doing reads through the backend adds 150–500ms latency, makes the Mac Mini/cloudflared a dependency for leaderboard loads, and hits rate limits not designed for interactive traffic. Survey endpoints live in functions/survey/*.js, separate from the existing catch-all proxy — no conflict with current architecture.
Migration risk is low: 244 records, clean order_number dedup key, existing processor logic unchanged. Python one-time migration script, 3 batches of 100 records. Under 5 minutes.
Biggest risk #1: CF Access status on command.p2ops.com needs confirmation before a single line of feature code is written. Driver names, scores, and customer comments are behind this endpoint. If CF Access isn't live and validated, that's task one — everything else is blocked.
Biggest risk #2: wrangler.toml doesn't exist in command-pwa yet. Adding the D1 binding requires one. If misconfigured, it breaks the existing Pages deploy. This must be handled carefully and not rushed.
What to cut: Manual tag system → auto-classify at ingest instead (covers 80% of the use case, no write endpoint, no extra security surface). Keyword search in Comments → defer (SQLite FTS5 not available in Workers runtime). Trend Lines → defer until data is deeper.
Gates flagged by Meg: Gandalf must see the architecture before implementation. Coach Beard must sign off if leaderboard ordering differs from the current display. Belle should see the D1 resource addition (low stakes, appropriate visibility).
Navigation: Add "SURVEY PERFORMANCE" as the third primary tab — not "Rankings." That name front-loads the wrong framing. Five sub-nav tabs inside: LEADERBOARD · CATEGORIES · MARKETS · TRENDS · COMMENTS. Driver Detail is a drill-down destination from the leaderboard, not a peer navigation item. URL-addressable, full page.
Leaderboard framing call: The core design problem is that ranking inherently creates losers. Three decisions hold the "motivating" goal: (1) trend arrow is the visual anchor, rank number is present but not dominant; (2) no red for low scorers — score in secondary text color only; (3) top 3 get a 1px gold left border and rank in gold text — the same treatment as the active nav tab. Clean. Consistent. Not cartoonish.
On gradient glow medals: Rejected. Cheap gamification. Not what this platform's design language is.
Comments browser: 3px danger-red left border on negative reviews — nothing else changes. The customer's words are in standard text. The stripe says "needs attention" without being alarming. Asymmetry is intentional — negative cards need to be locatable in a scroll.
Edna's flags for routing: Coach Beard required before Meg touches the leaderboard (four specific policy questions: rank visibility for drivers 4+, whether below-rank-10 divider is appropriate, whether low-performer scores should appear at all, and whether drivers know they're being rated). Hermione for HR/privacy on individual performance data. Matilda for comms strategy — if drivers become aware this tool exists (they will), that changes behavior. Answer the comms question before launch, not after.
Don't display a numeric rank. Display a tier. Three tiers: High Performer → Solid → Developing. Drivers see their tier, their score, and their trend arrow. They don't see "you are #17 of 28" — that number doesn't tell them what to do differently, it just tells them where they stand in a hierarchy. The trend arrow is what actually changes behavior: "I was in the middle and now I'm moving up" beats "I am ranked 12th" every time.
Minimum survey threshold before showing a tier: 10 confirmed surveys. Before that: "Building your baseline — not enough surveys yet for a reliable picture." Protects newer drivers and low-volume routes from unfair snapshots.
Absolute thresholds, not percentile rank. "Gold" should mean "your score is above 4.5 with at least 20 surveys" — achievable by everyone. If it means "you're in the top 20%", most of the team is structurally excluded no matter what they do. That's not motivating, that's a slow morale drain.
Two separate use cases that need separate designs: (1) Supervisor coaching aid — full data, individual scores, trends, specific comments, flag for 1:1. (2) Driver self-service — their own score, tier, trend, and anonymized positive feedback only. Drivers almost never hear directly from people they served. When they do, it lands. Don't show drivers their negative comments — that's for supervisors to deliver in context, not for a dashboard to surface without framing.
No cross-market driver ranking visible to drivers. Hurricane and Wenatchee are different routes, different customers, different volumes. Cross-market comparison belongs in management reporting only, with a disclaimer. Drivers seeing each other's market scores will generate grievances, not motivation.
Cadence: Monthly for formal review. Weekly is too reactive — one bad week creates anxiety, not insight. Supervisors get real-time access to catch a sharp decline early.
Hermione flag — Coach Beard's priority: "The Hermione flag is the one I'd move on before this tool goes live — if scores are touching compensation or scheduling in any way, she needs to see the design. Everything else can be iterative." This is his clearest directive. If there's any possibility survey scores influence pay, scheduling priority, or disciplinary action downstream, Hermione reviews the design before it ships. Non-negotiable.
What tools like this usually get wrong: Optimizing for reporting, not behavior change. Hiding the criteria (drivers don't know what it takes to move up). Deploying without explanation — Kirk's rollout message matters as much as the design. Treating recognition as a one-time leaderboard post rather than an ongoing conversation.
Kirk's framing wasn't "build a nicer leaderboard." It was build a tool that improves customer service. That's a different goal and it shapes what we track.
These are not optional reviews. Each one is a hard gate for the relevant phase.
This plan is ready to execute the moment you say go. The team is briefed. The architecture is designed. The gates are defined. Kirk decides.
| Full Upgrade Now (recommended) | Patch + Defer | |
|---|---|---|
| Dashboard downtime | 2–3 weeks until Leaderboard is live | 48–72 hours for patch |
| Build investment | ~16 dev days over 4–6 weeks | ~3 dev days (patch), then ~16 days later |
| Total dev cost | ~16 dev days | ~19 dev days (patch + full build) |
| survey.p2ops.com | Redirected and sunset | Lives on for months |
| Data architecture | D1 — scalable, queryable, no size ceiling | Pre-aggregated JSON — same fragility |
| Score-to-pullback analysis | Enabled (Athena can run it) | Not enabled |
| Platform direction | One dashboard for all of P2 ops | Two dashboards in parallel |
Jane's recommendation: greenlight the full upgrade. The patch wastes build time and leaves the architecture broken. This is the right moment — the tool is broken, the team is available, and the build is well-scoped.
First action after greenlight: Jane briefs Gandalf with Meg's architecture doc. Coach Beard and Hermione are briefed simultaneously. Phase 1 starts when Gandalf gives the all-clear.