Day 2: Evaluating AI Tools for Your Team

Today's Objective

Every AI tool evaluation should score five dimensions.

Why Vendor Demos Are Designed to Deceive You (Not Maliciously)

Vendor demos are not evil. They are just optimized to show you the product at its absolute best — with curated data, prepared scenarios, and an expert operator who knows all the shortcuts. This is not deception in the malicious sense. It is marketing. The problem is that demos are structurally incapable of showing you what the product actually does for your team, on your data, in your workflows.

The evaluation framework in this lesson is designed to replace the information you didn't get from the demo. It asks the questions that separate tools that work from tools that look good in presentations.

The 5-Dimension Evaluation Framework

Every AI tool evaluation should score five dimensions. These are listed in priority order — dimension 1 is a hard requirement, dimension 2 is a near-hard requirement, and dimensions 3-5 are scored comparatively.

AI Tool Evaluation Scorecard — Score each dimension 1-5

1. Security & Compliance

Does the tool meet your organization's data handling requirements? SOC 2 Type II? GDPR/CCPA? Industry-specific (HIPAA, FedRAMP, etc.)? Where is data stored and processed? Does the vendor train on your data?

__/5

Minimum: 4

2. Accuracy on Your Data

How well does the tool perform on examples drawn from your actual work — not vendor-curated demos? Test with 10-20 real examples. Score each output: correct, partially correct, or wrong.

__/5

Minimum: 3

3. Cost & ROI

Total cost of ownership including setup, training, and ongoing fees. Quantified value against your specific use case. Payback period. Compare against your best non-AI alternative (usually more staff time).

__/5

Weight: 25%

4. Integration

How well does it fit into existing workflows and systems? Native integrations to tools your team already uses. API availability. Implementation complexity and cost.

__/5

Weight: 15%

5. Usability

Will your actual team use this in practice? Give it to 2-3 representative end users for 30 minutes. Count unprompted errors and questions. Measure task completion rate.

__/5

Weight: 10%

Hard gates: A score below 4 on Security and below 3 on Accuracy means stop — regardless of how good the other dimensions look. A tool that creates compliance risk or that gets your data wrong is not worth deploying, no matter how elegant the interface.

How to Run a Real Evaluation (Not Just a Demo)

Structure your evaluation in three phases, each lasting no more than a week:

Phase 1: Vendor Qualification (Days 1-2)

Before you spend time evaluating a tool, qualify the vendor. Ask in writing, before any demo: security documentation (SOC 2 report or equivalent), a reference customer list you can call, pricing structure, and a data processing agreement. Vendors that stall, redirect, or can't provide these within 48 hours fail qualification. Move on.

Phase 2: Structured Testing (Days 3-5)

Create a test set of 15-20 real examples from your team's actual work. For each example, you already know the correct answer. Run every tool through the same test set without vendor involvement. Score outputs yourself using a simple rubric: 2 points for fully correct, 1 point for partially correct, 0 for wrong or misleading.

This is the step most evaluations skip. It is also the step that surfaces the most important information. You will almost always find that tools that performed beautifully in the vendor demo score significantly lower on real data. That is useful information.

Phase 3: End-User Testing (Days 6-7)

Give tools that passed Phase 2 to 2-3 representative end users — people who will actually use the tool, not power users or tech enthusiasts. Give them a realistic task and observe. Count how many times they get stuck, make errors, or need help. The tool your end users complete the task with fastest and fewest errors is probably the right choice, even if it scored lower on features.

Red Flags in AI Vendor Pitches

These are signals to slow down significantly or walk away:

"Our AI learns continuously from your data." This often means your data is being used to train their models. Verify explicitly whether your data is used for model training and what the opt-out mechanism is.
No specific accuracy metrics, only qualitative claims. "Highly accurate" without numbers is meaningless. Push for specific benchmarks with methodology.
Demo that uses only vendor-provided examples. Ask to run your own data through the demo. Resistance to this request is a strong signal.
"The accuracy improves significantly once we train on your data." Translation: the base model doesn't work well for your use case, and you'll be doing the work of improving it for them.
No ability to see or audit what the AI actually decided and why. If you can't understand what the AI is doing, you can't catch its errors. This is especially important for consequential decisions.
Reference customers who all signed NDAs about their results. Real results customers are proud of. Mystery results are often underwhelming results.

The Build vs. Buy Decision Matrix

At some point you will face the question: should we buy an existing AI tool, or should we build something custom? This question is usually framed badly. It is almost never a pure build vs. buy choice — it is a spectrum from "buy off-the-shelf" to "buy a platform and configure" to "build custom on a foundation model." Here is how to think through it:

Factor	Points toward Buy	Points toward Build
Use case specificity	General task (writing, summarization, Q&A) — off-the-shelf tools do this well	Highly specific domain knowledge required (proprietary terminology, specialized workflows)
Data sensitivity	Standard commercial data privacy is sufficient	Data cannot leave your environment under any circumstances
Competitive advantage	The AI capability itself is not a differentiator — you just need the task done	The AI capability is core to your product or competitive position
Engineering capacity	No dedicated AI/ML engineering team available	You have or can hire engineers who can build and maintain AI systems
Speed to value	Need results in weeks	Can invest 6-18 months before seeing results

The honest default: For 90% of business use cases, buy is the right answer. Building custom AI is expensive, slow, and requires ongoing maintenance your team will resent. Reserve custom builds for cases where you genuinely cannot buy what you need or where the AI capability is genuinely core to what makes you different.

Day 2 Exercise

Evaluate 2 AI Tools for Your Team Using the Framework

Take 2 AI tools — either tools you are currently considering or two of the most commonly discussed tools in your industry. Apply the 5-dimension framework to each:

Run the security check first. Look up each tool's security documentation (search the vendor name + "SOC 2" or "security whitepaper"). Note whether they meet your data requirements. If one fails, eliminate it immediately.
Create a 5-question test based on tasks your team actually does. Run each tool through the same 5 questions. Score outputs: correct (2), partial (1), wrong (0). Calculate a score out of 10.
Compare total cost. Find the per-seat annual cost for each tool. Calculate total cost for your team size.
Score dimensions 4 and 5 based on what you can learn from documentation and a brief trial.
Write one sentence summarizing your recommendation: "I recommend [tool] over [tool] because [primary reason], with the caveat that [biggest concern]."

Key Takeaways from Day 2

Vendor demos are optimized to impress, not to inform. Structure your evaluation to test what the vendor can't control — your actual data, your real tasks, your end users.
The 5-dimension framework: security, accuracy, cost, integration, usability — in that priority order, with hard gates on security and accuracy.
Three-phase evaluation: vendor qualification (days 1-2), structured testing on your data (days 3-5), end-user testing (days 6-7).
The 6 red flags that should slow you down significantly: data training claims, no accuracy numbers, demo-only data, "trains on your data" accuracy promise, no audit capability, mysterious reference customers.
Default to buy for most use cases. Custom builds are for unique domains, stringent data requirements, or genuine competitive differentiation.

Day 2 Checkpoint

Before moving on, confirm understanding of these key concepts:

What is the core concept introduced in this lesson?
How does the main technique or tool work in practice?
What common mistakes should be avoided?
How would this apply in a real-world project?
What is the next logical step to build on this knowledge?

Day 3: Building an AI Business Case →

Evaluating AI Tools for Your Team

Today's Objective

Why Vendor Demos Are Designed to Deceive You (Not Maliciously)

The 5-Dimension Evaluation Framework

How to Run a Real Evaluation (Not Just a Demo)

Phase 1: Vendor Qualification (Days 1-2)

Phase 2: Structured Testing (Days 3-5)

Phase 3: End-User Testing (Days 6-7)

Red Flags in AI Vendor Pitches

The Build vs. Buy Decision Matrix

Evaluate 2 AI Tools for Your Team Using the Framework

Key Takeaways from Day 2

Day 2 Checkpoint

Need help evaluating AI tools for your team?

Supporting References & Reading

Go deeper with these external resources.