Day 4: Leading an AI Pilot Project

Choosing the Right Pilot: The Three Selection Criteria

The most common pilot design mistake is choosing a use case that is either too ambitious or too invisible. Too ambitious means the stakes are high, the scope is large, and failure has real consequences — this is the wrong environment to prove a new technology. Too invisible means the results are difficult to attribute to the AI, the business impact is minimal, and success doesn't create momentum for the next initiative.

The right pilot sits at the intersection of three criteria:

1. Small scope, definable boundaries

The pilot should involve one team, one use case, and one tool — not a department, not multiple workflows, not three tools in competition. Small scope means faster results, fewer variables, and a cleaner causal story when you report results. "We used this tool for meeting summaries in the product team for 30 days" is a story you can tell clearly. "We piloted AI across operations" is not.

2. High visibility, meaningful outcome

Small scope doesn't mean invisible outcome. The task you choose should produce results that are noticed and valued by your organization. If you save 10 hours per week of work that nobody was waiting for, the success doesn't build momentum. If you reduce report turnaround from 3 days to same-day, people notice.

3. Measurable before and after

You need to be able to measure the current state before the pilot begins. If you can't measure it now, you won't be able to prove improvement later. Time is the easiest metric to establish: how long does this task take today? If you don't know, spend week one measuring it before turning the AI on.

The 30-Day Pilot Playbook

Week 1 — Days 1-7

Baseline and Setup

Measure current state: time spent, error rate, and output volume for your target task
Select and onboard your pilot team (2-5 people is ideal for a first pilot)
Complete vendor setup: access provisioned, security review done, data handling confirmed
Run a 2-hour training session: the AI tool itself plus your specific use case prompts
Set week 1 goal: everyone on the pilot team completes the target task once using the AI, then documents their experience in writing

Week 2 — Days 8-14

Calibration

Run daily standup check-ins (10 minutes) — what's working, what's not, what questions came up
Collect all outputs from week 1 and do a quality review: what did the AI get right? What required significant editing?
Refine your prompts and instructions based on what you learned in week 1
Document the top 3 patterns where AI performs well and the top 3 where human review caught errors
Begin time-tracking in parallel with the AI use — measure actual time per task

Week 3 — Days 15-21

Production Run

Remove the extra check-ins — team runs on the AI tool independently
Continue time-tracking for all target tasks
Run a mid-point quality audit: sample 20% of AI outputs and score for accuracy and completeness
Identify any workflow integration improvements that would reduce friction
Collect informal feedback from the pilot team: would you keep using this? Would you recommend it?

Week 4 — Days 22-30

Measurement and Decision

Complete final time-tracking data collection for all pilot participants
Calculate actual time savings vs. baseline (Week 1 data)
Final quality audit: compare error rates, revision requirements, and output quality to baseline
Survey pilot team with three questions: usefulness, ease of use, and likelihood to continue
Write a 1-page pilot results summary for leadership: what we tested, what we measured, what we found, our recommendation

The 5 Metrics That Actually Matter

Time per task

The most credible metric. Measure average time to complete the target task before and after. Requires week 1 baseline measurement.

First-pass quality

Percentage of AI outputs that required no or minimal revision before use. Measures accuracy + usefulness in one number. Track over 30 days to see the trend.

Adoption rate

What percentage of target task completions used the AI tool vs. the old way? Less than 70% usage signals an adoption problem, not a tool problem.

Error rate

How often did AI outputs contain significant errors that were caught before use? Track the rate, the types of errors, and whether the rate improved over the 30 days.

Team NPS

One question: "Would you continue using this tool?" Score: 9-10 = Promoter, 7-8 = Passive, 0-6 = Detractor. Calculate NPS = % Promoters - % Detractors. Target: above +20 for a successful pilot.

Common Failure Modes (and How to Avoid Them)

Failure Mode 1: The pilot runs without a baseline

You start the pilot without measuring the current state first. At the end, you can't prove improvement because you don't know what "before" looked like. Prevention: spend the first 3 days of Week 1 measuring baseline before you turn the AI on.

Failure Mode 2: The pilot team is all enthusiasts

You staffed the pilot with your most tech-enthusiastic team members. They love the tool. But when you expand to the rest of the team, adoption collapses because the pilot didn't surface the real resistance. Prevention: include at least one skeptic in your pilot team. Their pushback makes the eventual rollout stronger.

Failure Mode 3: The tool is evaluated in isolation

The pilot team uses the AI tool in a bubble, but it doesn't integrate with the systems and workflows they use for everything else. The task that takes 2 hours with the old workflow takes 3 hours with the new one because of switching costs. Prevention: map the full workflow before the pilot, including the steps before and after the AI task.

Failure Mode 4: Success is declared before it's earned

The week 2 check-in goes well, the team seems to like the tool, and you go to leadership with a success story before you have data. Then the week 4 measurement is disappointing and you've already spent the credibility. Prevention: share findings only after you have 30 days of data. "We're encouraged by early signals" is not a pilot result.

Day 4 Exercise

Design a 30-Day AI Pilot for Your Team

Using the pilot selection criteria and playbook above, design a complete pilot you could launch with your next budget approval. Work through these four decisions:

Choose the use case. Apply the three selection criteria: small scope, high visibility, measurable. Write one sentence for each criterion explaining why your chosen use case qualifies.
Identify your pilot team. Name 3-5 specific people. Confirm that at least one is a skeptic. Note what training they will need.
Define your baseline measurement. Specifically: what will you measure, how will you measure it, and who will collect the data in Week 1 before the AI is turned on?
Set your go/no-go criteria. "We will recommend expanding this pilot to the full team if: [specific metric] is [specific threshold] at the end of 30 days." Be specific enough that there is no ambiguity about what success means.

The output of this exercise is a pilot design document that, combined with the Day 3 business case, gives you everything you need to get started.

Key Takeaways from Day 4

The right pilot is: small scope with clear boundaries, high-visibility meaningful outcome, and measurable before you start.
The 30-day playbook: Week 1 baseline + setup, Week 2 calibration, Week 3 production run, Week 4 measurement and decision.
The 5 metrics that matter: time per task, first-pass quality, adoption rate, error rate, and team NPS.
The 4 failure modes: no baseline, enthusiast-only team, tool in isolation, premature success declaration. Know them before you start.
You now have a complete pilot design. With the Day 3 business case, you have the full package to launch.

← Day 3: Business Case Day 5: AI Strategy & Change Management →

Run your first pilot with expert guidance

Our bootcamp includes a full pilot design session — we help you design, scope, and structure a pilot you can launch within 30 days of the course.

Reserve Your Seat →