P2 Driver Performance Platform

The Forcing Function

Why we're doing this now

The old survey dashboard hit a hard wall. This is the right moment to build it correctly rather than patch something that was already overdue for retirement.

32 MiB

survey_data.json current size

25 MiB

Cloudflare Pages file size limit

52K

Total records (all carriers)

⚠ Dashboard is currently dark
survey.p2ops.com cannot deploy. The leaderboard is frozen on data from May 14. Every day this goes unfixed is a day Kirk and managers are flying blind on driver performance. This is not a future problem — it's active right now.

Why not just patch it?

A pre-aggregation fix (Option 3 from the original analysis) would work technically, but it buys maybe 60 days before the problem returns, wastes Meg's build time on a system already marked for sunset, and leaves the dashboard serving a 216KB single-file page that Meg called "embarrassing for external partners." The smart move is to build it once and build it right.

The ops dashboard at command.p2ops.com is live, stable, and already where Kirk spends his time. Survey features belong there.

✓ The real data picture (from Athena's analysis)
The 52K-record file is 99.5% competitor data. P2's actual survey records are 244 records across 28 drivers — the four MDOs combined. At the actual P2 volume (~66 records/day for the whole network), D1's free tier handles this comfortably for the next 18+ months. This is a clean, right-sized migration.

What We're Building

The Driver Performance section

A new primary tab in command.p2ops.com. Not a replacement dashboard — an integrated feature that lives where operations actually happens. Kirk's framing: "a useful, motivating tool to help us improve customer service."

command.p2ops.com

Service Area Map

Ops Dashboard

Driver Performance

Leaderboard

How it's built

Clean migration from a broken flat-file system to a proper data store. Survey queries live in Pages Functions with a direct D1 binding — no Mac Mini dependency for reads.

📦 Data Store: Cloudflare D1

SQLite-based, serverless, no infrastructure to manage. Three-layer schema:

Dimension tables: mdos, drivers
Fact table: surveys — write-once, UNIQUE on order_number, idempotent ingest
4 aggregate tables: driver lifetime, driver weekly, category weekly, MDO weekly
Comments queried live (paginated) — everything else reads from aggregates

🔌 Query Path: Pages Functions → D1

Survey endpoints live in functions/survey/*.js — separate from the existing API proxy. D1 binding is native; no HTTP calls to FastAPI backend required.

Leaderboard load: 1 row/driver from aggregate — near-instant
Market view: handful of rows from MDO aggregate — near-instant
Comments: paginated live query with indexes — fine at this scale
CF Access coverage is automatic (covers whole domain)

🔢 Bayesian Scoring (pre-computed at ingest)

Formula: (C × global_mean + n × driver_mean) / (C + n) — where C=10, global mean=4.52 (from actual data). A driver with 3 surveys barely moves the needle; 50 surveys converges to their raw mean. This prevents a driver with 2 five-star reviews from appearing above a driver with 50 consistent reviews. Recalculate global mean quarterly.

Scores are computed in Python at ingest time and stored. Dashboard reads a number, never recomputes.

📤 Migration: 244 P2 records → D1 in <5 minutes

Python prep script generates batch SQL files from the existing JSON, loaded via wrangler d1 execute. UNIQUE(order_number) constraint means re-running the migration is safe. Nightly processor (process_surveys.py) updated to write new records to D1 with the same dedup logic. JSON kept as parallel write for 30 days as safety net, then sunset.

⚠ Data quality issues to fix at migration
Athena found two structural issues that must be resolved before D1 goes live: (1) Carrier name normalization — the main JSON has carrier_normalized = "TopHat" for P2 records while the monthly _ours files use "P2 Last Mile - Ukiah", "Two Phillips Enterprises (Hurricane)", etc. These must be reconciled or the leaderboard splits P2 records across phantom carriers. (2) The 52K-record master file covers only 43 days (March 31–May 2026). P2's full history lives in the monthly _ours files going back to May 2025 — migration must include both, or the leaderboard starts from scratch in late March.

Design Direction — Edna Mode + Coach Beard

Built to motivate, not surveil

The design problem is real: a ranking tool inherently creates winners and losers. The goal is to make the winners feel recognized and the improvers feel seen — not to create a scoreboard that demoralizes.

⚠ Design decision required — Kirk's call
Edna recommends a ranked list (01, 02, 03…) with trend arrows as the visual hero — rank is secondary, movement is primary. Coach Beard recommends replacing numeric ranks entirely with tiers (High Performer / Solid / Developing) so drivers never see a position number. These are compatible on the management view but diverge on the driver-facing view. Kirk needs to decide: does the leaderboard show a ranked list or a tiered view? Both approaches are buildable. This decision affects the Phase 2 leaderboard spec before Meg touches it.

✓ What we're doing

Trend arrow is the visual hero — rank number is secondary
"Most Improved" sort as a secondary option alongside Score
Top 3: rank number in gold/silver/steel, 1px left border — understated recognition, no glow effects
Low scorers: score in secondary text color — they recede, not highlighted in red
Red left border stripe on negative comments only — visible but not alarming
Driver detail: trend first, then category breakdown, then comments
Sub-nav: pill-fill style, distinct from primary gold-underline nav

✗ What we're NOT doing

No gradient glow medals — cheap gamification, not recognition
No red rows for low-scoring drivers
No "wall of shame" framing — ever
No manual tag system (write endpoint + moderation UI = security surface + dev days)
No icon-only sub-nav on mobile (icons are ambiguous for this content)
No heatmap in v1 (deferrable — bar chart covers it adequately)
No keyword search in Comments v1 (D1 SQLite doesn't expose FTS5 in Workers runtime)

📱 Mobile considerations

Kirk checks this from his phone. Mobile-first decisions: sub-nav stays as text labels with horizontal scroll (no icons — they're ambiguous on a performance tool). Leaderboard collapses to 4 columns on mobile: RANK · DRIVER · SCORE · TREND. Category heatmap hidden on mobile with a tooltip directing to desktop. Comments browser works well on mobile — card format is naturally responsive.

🎨 Fits the existing platform

Same dark UI, P2 blue (#3b8def), same token system as the rest of command.p2ops.com. The Survey Performance tab slots into the existing nav as the third primary item, filling the "+" placeholder. No new design language — no visual discontinuity for the user.

Build Plan

Phased delivery — ~16 dev days over 4–6 weeks

Leaderboard is the first deliverable. Comments and Driver Detail follow. Trend lines wait for data accumulation. survey.p2ops.com is redirected only after the new section is verified.

1

Foundation

~3 dev days · Week 1

D1 database creation, schema, indexes
Historical migration: P2 records from _ours monthly files + master JSON
Carrier name normalization pass before D1 ingest
process_surveys.py updated to write to D1 nightly
wrangler.toml updated with D1 binding (carefully — not rushed)
Confirm CF Access active on command.p2ops.com before any feature work

2

Leaderboard, Market View, Category Analysis

~4 dev days · Weeks 2–3

Pages Functions survey endpoints (leaderboard, market, category)
Leaderboard UI with trend arrow hero, filters, top-3 treatment
Market View — per-MDO rankings and volumes
Category Analysis — bar chart by item type
Validate Bayesian scoring against existing survey.p2ops.com output before deploy
Coach Beard sign-off required before this ships

3

Driver Detail + Comments Browser + Nav Integration

~4 dev days · Weeks 3–4

Driver Detail page (URL-addressable: /performance/drivers/[id])
Comments Browser with auto-classification tags and sentiment filters
Auto-classifier in process_surveys.py: low_score, damage, timeliness_concern, positive
Full nav integration into command.p2ops.com
Persistent filter state across sub-nav tabs

4

QA + Sunset Old Dashboard

~2.5 dev days · Week 5

End-to-end testing across all four markets
Mobile QA — leaderboard, comments, filters
survey.p2ops.com → redirect to command.p2ops.com/performance
JSON parallel write stopped, D1 is sole source of truth
Deploy notes and runbook updated

5

Trend Lines — Phase 2 (Deferred)

~2 dev days · August+ 2026

Build after 3 additional months of P2 survey data accumulation
Weekly trend charts per driver and per market
The feature is right — the data isn't ready yet

Phase	Deliverable	Estimate	Gate
Foundation	D1 schema, migration, nightly writer	3 days	Gandalf security review
Tier 1	Leaderboard, Markets, Categories	4 days	Coach Beard leaderboard sign-off
Tier 2	Driver Detail, Comments, Nav	4 days	None blocking (builds on Tier 1)
QA + Sunset	Testing, redirect, cleanup	2.5 days	Kirk confirmation before redirect
Trend Lines	Weekly trend views	2 days	Deferred to August 2026+

Team Input

What the specialists said

Each agent went deep on their domain. These are their actual findings, not summaries of assumptions.

🔬

Athena

Business Intelligence & Data

Complete

The 99.5% revelation: The 52K-record file is almost entirely competitor data. P2's share is 244 records across 28 drivers — across all four MDOs combined. This changes the D1 scope entirely: the database stays small, query performance is trivial, and PII exposure is minimal.

On Bayesian scoring: Formula confirmed — C=10, global mean=4.52 from the full P2 dataset. A driver needs ~50 surveys to converge to their raw mean. This is the right formula; the implementation must be validated against the existing leaderboard before cutover.

Critical migration finding: The main survey_data.json only covers March 31–May 2026 (43 days). P2's full history going back to May 2025 lives in the monthly _ours files. Migration must include both or the new leaderboard starts with 6 weeks of history instead of 12 months.

Most valuable analysis we're not doing: Score-to-pullback correlation. Join survey scores to invoice pullbacks by delivery date + MDO. If lower-scoring deliveries generate more pullbacks, that's a direct dollar value on improving customer service. The data to do this is already in-house — no new collection needed.

D1 storage outlook: At current growth, the free tier (500MB) handles ~18 months. Budget conversation for the paid tier is a 2027 problem, not today's.

⚙️

Meg

Creative Technology & Engineering

Complete

Architecture call: Pages Functions → D1 direct binding. Not through FastAPI backend. Doing reads through the backend adds 150–500ms latency, makes the Mac Mini/cloudflared a dependency for leaderboard loads, and hits rate limits not designed for interactive traffic. Survey endpoints live in functions/survey/*.js, separate from the existing catch-all proxy — no conflict with current architecture.

Migration risk is low: 244 records, clean order_number dedup key, existing processor logic unchanged. Python one-time migration script, 3 batches of 100 records. Under 5 minutes.

Biggest risk #1: CF Access status on command.p2ops.com needs confirmation before a single line of feature code is written. Driver names, scores, and customer comments are behind this endpoint. If CF Access isn't live and validated, that's task one — everything else is blocked.

Biggest risk #2: wrangler.toml doesn't exist in command-pwa yet. Adding the D1 binding requires one. If misconfigured, it breaks the existing Pages deploy. This must be handled carefully and not rushed.

What to cut: Manual tag system → auto-classify at ingest instead (covers 80% of the use case, no write endpoint, no extra security surface). Keyword search in Comments → defer (SQLite FTS5 not available in Workers runtime). Trend Lines → defer until data is deeper.

Gates flagged by Meg: Gandalf must see the architecture before implementation. Coach Beard must sign off if leaderboard ordering differs from the current display. Belle should see the D1 resource addition (low stakes, appropriate visibility).

🎨

Edna Mode

Design Lead — UI/UX & Aesthetics

Complete

Navigation: Add "SURVEY PERFORMANCE" as the third primary tab — not "Rankings." That name front-loads the wrong framing. Five sub-nav tabs inside: LEADERBOARD · CATEGORIES · MARKETS · TRENDS · COMMENTS. Driver Detail is a drill-down destination from the leaderboard, not a peer navigation item. URL-addressable, full page.

Leaderboard framing call: The core design problem is that ranking inherently creates losers. Three decisions hold the "motivating" goal: (1) trend arrow is the visual anchor, rank number is present but not dominant; (2) no red for low scorers — score in secondary text color only; (3) top 3 get a 1px gold left border and rank in gold text — the same treatment as the active nav tab. Clean. Consistent. Not cartoonish.

On gradient glow medals: Rejected. Cheap gamification. Not what this platform's design language is.

Comments browser: 3px danger-red left border on negative reviews — nothing else changes. The customer's words are in standard text. The stripe says "needs attention" without being alarming. Asymmetry is intentional — negative cards need to be locatable in a scroll.

Edna's flags for routing: Coach Beard required before Meg touches the leaderboard (four specific policy questions: rank visibility for drivers 4+, whether below-rank-10 divider is appropriate, whether low-performer scores should appear at all, and whether drivers know they're being rated). Hermione for HR/privacy on individual performance data. Matilda for comms strategy — if drivers become aware this tool exists (they will), that changes behavior. Answer the comms question before launch, not after.

🧠

Coach Beard

Team Culture & People

Complete

Don't display a numeric rank. Display a tier. Three tiers: High Performer → Solid → Developing. Drivers see their tier, their score, and their trend arrow. They don't see "you are #17 of 28" — that number doesn't tell them what to do differently, it just tells them where they stand in a hierarchy. The trend arrow is what actually changes behavior: "I was in the middle and now I'm moving up" beats "I am ranked 12th" every time.

Minimum survey threshold before showing a tier: 10 confirmed surveys. Before that: "Building your baseline — not enough surveys yet for a reliable picture." Protects newer drivers and low-volume routes from unfair snapshots.

Absolute thresholds, not percentile rank. "Gold" should mean "your score is above 4.5 with at least 20 surveys" — achievable by everyone. If it means "you're in the top 20%", most of the team is structurally excluded no matter what they do. That's not motivating, that's a slow morale drain.

Two separate use cases that need separate designs: (1) Supervisor coaching aid — full data, individual scores, trends, specific comments, flag for 1:1. (2) Driver self-service — their own score, tier, trend, and anonymized positive feedback only. Drivers almost never hear directly from people they served. When they do, it lands. Don't show drivers their negative comments — that's for supervisors to deliver in context, not for a dashboard to surface without framing.

No cross-market driver ranking visible to drivers. Hurricane and Wenatchee are different routes, different customers, different volumes. Cross-market comparison belongs in management reporting only, with a disclaimer. Drivers seeing each other's market scores will generate grievances, not motivation.

Cadence: Monthly for formal review. Weekly is too reactive — one bad week creates anxiety, not insight. Supervisors get real-time access to catch a sharp decline early.

Hermione flag — Coach Beard's priority: "The Hermione flag is the one I'd move on before this tool goes live — if scores are touching compensation or scheduling in any way, she needs to see the design. Everything else can be iterative." This is his clearest directive. If there's any possibility survey scores influence pay, scheduling priority, or disciplinary action downstream, Hermione reviews the design before it ships. Non-negotiable.

What tools like this usually get wrong: Optimizing for reporting, not behavior change. Hiding the criteria (drivers don't know what it takes to move up). Deploying without explanation — Kirk's rollout message matters as much as the design. Treating recognition as a one-time leaderboard post rather than an ongoing conversation.

The Bigger Picture

Customer service improvement — the real goal

Kirk's framing wasn't "build a nicer leaderboard." It was build a tool that improves customer service. That's a different goal and it shapes what we track.

💡 Score-to-pullback correlation

Athena's highest-impact opportunity: join survey scores to invoice pullbacks by delivery date + MDO. If lower-scoring deliveries generate more pullbacks, we have a direct dollar figure on what improving customer service is worth. This data is already in-house — no new collection, just a join. This is Athena's next task if Kirk greenlights the build.

💡 Comment auto-classification

56% of P2 survey records include customer comments. Right now, a 5/5 score with a comment about a damaged floor is invisible as a risk. Auto-classification at ingest (damage, communication, timeliness, positive) surfaces these patterns. A driver with 9.2 composite but 40% of comments tagged "damage" is a different conversation than their score suggests.

💡 Coaching loop, not surveillance

The most valuable use of this tool is a supervisor opening the Driver Detail page before a 1:1. Score trend + category breakdown + customer comments in one place = a specific, evidence-based conversation. "Your timeliness score dropped three weeks in a row" is more useful than "your overall score is 8.1." The design is built for this.

💡 Market-level pattern recognition

Markets differ. Category Analysis by MDO will surface whether Ukiah's lower scores in "communication" are a training issue vs. a routing issue vs. a driver issue. This is the kind of insight that currently takes Kirk asking Matilda to pull data manually. The dashboard makes it self-serve.

Required Sign-Offs

Gates before and during the build

These are not optional reviews. Each one is a hard gate for the relevant phase.

🔒

Gandalf — Security Review Required before Phase 1 starts

New public endpoint serving driver names, survey scores, and customer comments. CF Access status on command.p2ops.com must be confirmed. Architecture doc from Meg goes to Gandalf before any code is written. No exceptions — this is the lesson from the command-api fight last week.

🧠

Coach Beard — Leaderboard Display Required before Phase 2 ships

Standing sign-off authority on ranking display. Needs to see: rank visibility approach for drivers 4+, low-performer visual treatment, whether all drivers are visible or just top N, and whether drivers know this tool exists. If the ordering differs from the current leaderboard at all, he reviews it before it goes live.

⚖️

Hermione — HR/Privacy Review Required before Phase 2 ships

Individual performance data on a shared dashboard. Hermione assesses whether Cloudflare Access role restrictions are appropriate (ops managers only vs. general staff) and whether there are any state-level requirements around employee performance tracking in CA (Ukiah market) or UT/CO/WA.

📣

Matilda — Driver Awareness Comms Required before public launch

If drivers become aware this tool exists — and they will — that changes behavior. Kirk needs a deliberate answer to the question before launch, not after. Matilda advises on the operational angle. Coach Beard advises on the people angle. Kirk decides the policy.

💰

Belle — D1 Resource Addition Low stakes, appropriate visibility

New Cloudflare D1 resource added to the account. Free tier, no immediate cost. Belle should see it for completeness — if the project grows toward the paid tier, she should have tracked it from the start.

✅

Kirk — Sunset Confirmation Before survey.p2ops.com redirect

Kirk confirms the new section is working as expected before survey.p2ops.com is redirected and the old file-based system is shut down. One-time verification, ~15 minutes.

Your Call

The decision

This plan is ready to execute the moment you say go. The team is briefed. The architecture is designed. The gates are defined. Kirk decides.

	Full Upgrade Now (recommended)	Patch + Defer
Dashboard downtime	2–3 weeks until Leaderboard is live	48–72 hours for patch
Build investment	~16 dev days over 4–6 weeks	~3 dev days (patch), then ~16 days later
Total dev cost	~16 dev days	~19 dev days (patch + full build)
survey.p2ops.com	Redirected and sunset	Lives on for months
Data architecture	D1 — scalable, queryable, no size ceiling	Pre-aggregated JSON — same fragility
Score-to-pullback analysis	Enabled (Athena can run it)	Not enabled
Platform direction	One dashboard for all of P2 ops	Two dashboards in parallel

Ready to proceed?

Jane's recommendation: greenlight the full upgrade. The patch wastes build time and leaves the architecture broken. This is the right moment — the tool is broken, the team is available, and the build is well-scoped.

First action after greenlight: Jane briefs Gandalf with Meg's architecture doc. Coach Beard and Hermione are briefed simultaneously. Phase 1 starts when Gandalf gives the all-clear.

✓ Proceed — Full Upgrade

↩ Back to discussion

Driver Performance Platform Upgrade

Why we're doing this now

The Driver Performance section

How it's built

Built to motivate, not surveil

Phased delivery — ~16 dev days over 4–6 weeks

What the specialists said

Customer service improvement — the real goal

Gates before and during the build

The decision

Ready to proceed?