The Scenario
Your Head of Operations is sceptical about spending time configuring Cowork projects. "Just type the prompt and get the answer," he says. "All that setup with context files and instructions is overhead." Half the team agrees with him. The other half has invested hours building detailed context files and swears the output is dramatically better.
You need data, not opinions. Today, you're going to run a controlled A/B test: the exact same task, executed twice — once with a bare Cowork session (no project, no context files, no instructions) and once with a fully configured project. You'll score both outputs using the same rubric and produce a comparison document that settles the debate with evidence.
This tutorial is deliberately structured as a scientific experiment. You define the hypothesis ("context files improve output quality"), control the variables (same prompt, same model, same task), measure the results (structured scoring rubric), and draw conclusions from data. The rigour matters — it's the difference between "I think it helps" and "here's the measured improvement across six dimensions."
What You'll Learn
By completing this A/B test, you'll be able to:
- Design a fair, controlled comparison between configured and unconfigured Cowork sessions
- Build and apply a structured scoring rubric that eliminates subjective bias
- Quantify the quality improvement from context files across multiple dimensions
- Calculate the time-to-value of context file setup (setup cost vs per-task savings)
- Present evidence-based findings to a sceptical stakeholder
Prerequisites
- Claude Desktop with Cowork enabled (Pro or Max plan)
- A configured Cowork project with context files and Project Instructions (if you completed "The Cowork Colleague Setup" tutorial, use that project)
- A complex, realistic task that you can run twice — something with enough nuance that context actually matters (a client email, a report section, a proposal draft)
- 15 minutes to design a fair scoring rubric before you start
For this test to be valid, you must not prime either session. Run the bare session first, completely cold — no preamble, no "by the way, I prefer British English," no context whatsoever. If you leak context into the bare session, the comparison is contaminated.
Step 1: Design the Test Task
Choose a task that meets all of these criteria:
- Complex enough that context matters — a simple "define this term" will produce identical results regardless of context
- Has measurable quality dimensions — tone, accuracy, formatting, terminology, specificity
- Is representative of your actual work — so the findings apply to real use cases
- Can be run twice without the second run being influenced by the first
Good test tasks:
- Draft a client proposal for a specific service
- Write a board-level executive summary of last quarter's performance
- Create an internal memo about a policy change
- Produce a project status update for a cross-functional stakeholder
Bad test tasks (avoid these):
- "Define the term 'ROI'" — too simple, context adds nothing
- "Summarise this PDF" — summarisation quality depends more on the document than on context
- "Write a haiku about productivity" — creative tasks are too subjective to score consistently
- "Calculate the sum of these numbers" — computational tasks are context-independent
Write your test task as a single prompt. Keep it identical for both runs. Example:
Draft an email to our largest client informing them that we're increasing service fees by 8% effective next quarter. Explain the rationale (increased operational costs, expanded service scope), emphasise the value they receive, and propose a call to discuss.
Save this prompt in a file. You'll copy-paste it verbatim into both sessions.
Checkpoint: You have a test task prompt saved, and it meets all four criteria (complex, measurable, representative, repeatable).
Step 2: Build the Scoring Rubric
Before running either test, define how you'll score the outputs. This prevents post-hoc rationalisation — you decide what matters before you see the results.
Create a scoring sheet with these dimensions (adjust to your context):
| Dimension | Weight | Score (1-5) | Notes |
|---|---|---|---|
| Tone accuracy — Does it match the expected formality and warmth for this communication type? | 20% | ||
| Brand voice — Does it sound like your company, or like generic AI output? | 20% | ||
| Formatting — Correct heading structure, paragraph length, naming conventions? | 15% | ||
| Terminology — Uses correct internal/industry terms, avoids banned vocabulary? | 15% | ||
| Recipient awareness — Addresses the recipient appropriately, references the relationship correctly? | 15% | ||
| Completeness — Covers all required elements without adding fabricated content? | 15% | ||
| Weighted Total | 100% |
Define what each score means:
- 5 — Indistinguishable from what a skilled human team member would produce
- 4 — Minor adjustments needed (one or two word changes)
- 3 — Usable but requires noticeable editing
- 2 — Structure is right but tone, terminology, or format is clearly off
- 1 — Would need to be rewritten from scratch
Let's knock something off your list
Draft an email to our largest client informing them that we're increasing service fees by 8% effective next quarter. Explain the rationale (increased operational costs, expanded service scope), emphasise the value they receive, and propose a call to discuss.
Running Test A: the same prompt in a bare session with zero context
Checkpoint: Your scoring rubric is written with 6 dimensions, weights, a 1-5 scale, and clear definitions for each score level.
Step 3: Run Test A — No Context (Bare Session)
Open a completely fresh Cowork session. Don't use any project — start from the default, unconfigured state. Don't provide any preamble, background, or preferences. Just paste your test prompt exactly as written.
When Cowork produces its output, save it immediately as test-a-bare.md. Don't read it carefully yet — you don't want your reaction to the first output to influence how you run the second test.
Note: resist the temptation to add "by the way, use British English" or "keep it formal." The point of Test A is to see what Cowork produces when it knows nothing about you.
When you read the output later, watch for these common bare-session patterns:
- American English defaults: "organization" instead of "organisation," dates in MM/DD/YYYY format
- Generic opener: "I hope this email finds you well" or "Dear Sir/Madam" — phrases your company may never actually use
- Corporate filler: "We value your continued partnership" — cliches that pad the word count without adding meaning
- Wrong formality level: Either too stiff or too casual for the context, because the model has no reference for your standards
- Fabricated specifics: Plausible but fictional details to fill knowledge gaps — project names, dates, or figures it has no source for
These aren't errors. They're the natural result of a system operating without context. Documenting them precisely is what makes your A/B comparison persuasive.
Checkpoint: Test A output is saved. You didn't provide any context beyond the raw prompt.
Step 4: Run Test B — Full Context (Configured Project)
Switch to your configured project — the one with about-me.md, brand-voice.md, Project Instructions, and Memory entries. Paste the exact same prompt.
Save the output as test-b-context.md.
If you don't have a configured project, you can create a minimal one for this test. Upload a brand-voice.md with your company's writing standards, add Project Instructions with your formatting rules, and add a few Memory entries with key facts (client names, your role, your company's communication style). Even a 15-minute setup will produce a measurable difference.
Running A/B comparison
Test B runs the identical prompt inside a fully configured project
Checkpoint: Test B output is saved. The same prompt was used in a fully configured project.
Step 5: Score Both Outputs
Now read both outputs carefully. For each one, fill in the scoring rubric independently — score Test A completely before looking at Test B's scores.
Scoring Test A (bare session):
Read test-a-bare.md and score each dimension. Common observations for bare sessions:
- Tone is generically professional (no personality)
- Uses American English by default (if you're in a British English organisation)
- Addresses the recipient formally ("Dear Mr. Smith") when your company uses first names
- Uses corporate cliches ("we value your partnership") instead of authentic language
- Format follows a generic template, not your company's standards
Scoring Test B (configured project):
Read test-b-context.md and score each dimension. Common observations for configured sessions:
- Tone matches the brand voice file
- Correct spelling convention applied automatically
- Recipient addressed correctly (using name from Memory)
- Company-specific terminology used naturally
- Format follows the rules from Project Instructions
Calculate the weighted total for each.
Scoring Discipline
Two rules to keep the scoring honest:
-
Score independently. Fill in every cell for Test A before you open the Test B scoring sheet. If you score both simultaneously, your brain will anchor one against the other rather than evaluating each on its own merits.
-
Use the full scale. If most scores cluster at 3 or 4, your rubric is too vague. A score of 5 should mean "I'd send this to a client without any changes." A score of 1 should mean "I'd need to rewrite this from scratch." If your outputs genuinely fall in a narrow range, that's a valid finding — but check your rubric definitions first.
If you're unsure about a score, read the specific passage out loud. AI-generated text that looks fine on screen often reveals its generic quality when spoken. If it sounds like something anyone at any company could have written, that's a 2 on brand voice, not a 4.
Checkpoint: Both outputs scored on all six dimensions. Weighted totals calculated.
Step 6: Analyse the Differences
Create your comparison document. Open a new file called context-ab-test-results.md and structure it:
Summary Table
| Dimension | Test A (Bare) | Test B (Context) | Difference |
|---|---|---|---|
| Tone accuracy | /5 | /5 | |
| Brand voice | /5 | /5 | |
| Formatting | /5 | /5 | |
| Terminology | /5 | /5 | |
| Recipient awareness | /5 | /5 | |
| Completeness | /5 | /5 | |
| Weighted Total | /5 | /5 |
Side-by-Side Excerpts
For each dimension where the scores differed, include a direct quote from each output showing the difference. For example:
Tone accuracy:
- Test A: "We would like to inform you that effective Q3 2026, service fees will be adjusted upward by 8%."
- Test B: "Sarah, I wanted to give you a heads-up that we'll be adjusting our fees from next quarter — here's why, and what it means for the work we're doing together."
These excerpts make the difference tangible. Abstract scores are less persuasive than seeing the actual words side by side.
Checkpoint: Your comparison document includes the summary table and side-by-side excerpts for every dimension where scores differed.
Step 7: Quantify the Time Impact
The sceptic's argument is that context file setup is "overhead." Counter this by calculating the time economics:
Time spent on context setup:
- How long did it take to write
about-me.md? (One-time cost) - How long for
brand-voice.md? (One-time cost) - How long for Project Instructions? (One-time cost)
- How long for Memory entries? (One-time cost, occasional updates)
- Total one-time setup: _____ minutes
Time saved per task:
- How many minutes of editing did Test A's output need to reach "ready to send" quality?
- How many minutes of editing did Test B's output need?
- Time saved per task: _____ minutes
Break-even calculation:
- Total setup time divided by time saved per task = number of tasks to break even
- If setup took 60 minutes and each task saves 10 minutes of editing, you break even after 6 tasks
- After break even, every task is pure time savings
Add this analysis to your comparison document.
Most people break even within the first week of using a configured project. The compounding effect is significant — not just time savings per task, but the elimination of the cognitive load of remembering to specify preferences every single time.
Checkpoint: Your document includes a time-impact analysis with setup cost, per-task savings, and break-even calculation.
Step 8: Write the Recommendation
Conclude your document with a recommendation section aimed at the sceptics on your team:
-
Headline finding: What was the overall score difference? Express it as both a number and a percentage improvement.
-
Where context matters most: Which dimensions showed the largest score gap? (Usually brand voice and recipient awareness.)
-
Where context matters least: Were there any dimensions where scores were equal? (Usually completeness — the bare session includes the same elements, just expressed differently.)
-
The economic case: How many tasks does it take to recoup the setup investment? What are the ongoing savings?
-
Recommendation: Should your team invest in context file configuration? Under what conditions? (e.g., "Every team member who uses Cowork more than 3 times per week should have a configured project.")
Task complete
A/B comparison document with scoring rubric
Side-by-side excerpts for all 6 dimensions
Time-impact and break-even analysis
Completed A/B test with quantified evidence of context file impact
Checkpoint: Your recommendation is written with specific findings, not generic advocacy.
Expected Output
Your deliverable is a documented A/B comparison:
test-a-bare.md— output from the unconfigured sessiontest-b-context.md— output from the configured projectcontext-ab-test-results.md— the full comparison document with scoring table, side-by-side excerpts, time-impact analysis, and recommendation
This is evidence-based persuasion. When someone on your team says "context files are a waste of time," you hand them the comparison document and let the data speak.
Extension Challenges
-
Multi-task test — Run the A/B comparison across three different task types (email, report, presentation brief). Does the context advantage hold across all task types, or is it more pronounced for some?
-
Minimal context test — Run a third test (Test C) with just the Project Instructions and no Knowledge Base files. This isolates how much value comes from the instructions alone versus the reference documents. Plot the scores as A < C < B and identify the diminishing returns point.
-
Peer blind test — Give both outputs (unlabelled) to three colleagues and ask them to identify which one was written by your team versus AI. If the configured output fools more people, that's the strongest possible evidence for context file investment.