Domain 2 · Task Statement 2.4

Batch Processing & Data Transformation

TL;DR

Design batch processing pipelines in Cowork — from triggering parallel sub-agents across mixed file types to extracting structured data, cleaning inconsistent formats, and delivering polished output files.

What You Need to Know

Batch processing is where Cowork stops being a convenience and becomes a competitive advantage. Point at a folder of 30 invoices, 50 receipt photos, or 200 CSVs and say "extract, clean, compile" — that single operation saves entire afternoons. But doing it well requires pipeline thinking, not hopeful prompting.

The Three-Stage Pipeline

Every successful batch operation follows the same structure, whether you're renaming five files or extracting data from five hundred. Three stages:

  1. Source — the files in your folder (what you're working with)
  2. Transform — what happens to each file (extraction, conversion, cleaning)
  3. Output — where and how the results land (single spreadsheet, renamed files, formatted report)

The best prompts describe all three stages explicitly. Compare:

You need to process a folder of CSV files into standardised Excel workbooks.

Before

Process the files in this folder.

After

For each CSV file in this folder, convert it to Excel format. Add a header row if missing, standardise the date column to YYYY-MM-DD, and remove any rows where the Amount column is blank. Save each converted file with the same name but .xlsx extension.

The first prompt tells Claude nothing about the transform or output stages. It'll guess — and different sub-agents will guess differently, producing inconsistent results across your files. The second prompt is a complete pipeline specification: source (CSV files), transform (add headers, standardise dates, remove blanks), output (.xlsx with matching names).

How Batch Parallelisation Works

When Cowork encounters a batch operation — 20 invoices to process, say — it doesn't plod through them one at a time. It spawns sub-agents: parallel workers that each handle a portion of the batch. Those 20 invoices might become five sub-agents processing four each, all running simultaneously. What would take 25 minutes sequentially finishes in under five.

The trigger is natural language, not special syntax. Claude picks up on patterns that imply iteration across a collection:

  • "For each receipt image..." — the classic parallelisation signal
  • "Process all the PDFs..." — batch language that triggers parallel execution
  • "Analyse every spreadsheet..." — "every" signals completeness across a set
  • "Do this in parallel" — removes all ambiguity

No /parallel command exists. No special keyword you must type exactly. Claude infers the parallelisation opportunity from the shape of your request. If the work is naturally decomposable — the same operation applied independently to multiple files — it'll typically parallelise without being told.

[!]

Sub-Agents Work in Isolation

Once spawned, sub-agents can't communicate with each other. Each one processes its assigned files independently and returns results to the parent agent, which combines everything at the end. If processing File B genuinely depends on what was found in File A, parallelism will produce errors. This constraint shapes every batch pipeline you design.

Structured Data Extraction

This is the highest-value batch operation for most professionals: pulling structured data from unstructured sources. A folder of photographed receipts, scanned invoices, or free-form PDF reports — each formatted differently, each burying the same types of information in different layouts.

Cowork reads each source file, interprets the content regardless of format, and writes standardised data to your output. The critical ingredient is an explicit output schema. Without one, sub-agents independently decide what to call things — one extracts "Amount" while another extracts "Total" for the same field. That inconsistency compounds across dozens of files. Specify exactly what you want:

You need to extract expense data from 25 photographed receipts.

Before

Extract data from these receipts.

After

For each receipt image in this folder, extract: Date (YYYY-MM-DD), Vendor Name, Subtotal, Tax, and Total. If any field is not visible, write 'UNREADABLE'. Compile all data into Receipt-Summary.xlsx with those columns plus a Source Image column. Add a SUM formula for the Total column and flag any receipt where Total exceeds $500 with red conditional formatting.

The second prompt specifies exact field names, a date format, a fallback value for unreadable data, the output file name, column structure, a formula, and a formatting rule. Every sub-agent follows the same schema. The data compiles cleanly without manual reconciliation.

Format Conversion with Content Restructuring

Cowork handles format conversion that goes well beyond changing the file extension. A CSV-to-Excel conversion can include adding formulas, creating pivot tables, applying conditional formatting, and splitting data across multiple tabs. Text files can become formatted Word documents with headings and tables. Spreadsheet data can become presentation slides with charts.

The key insight: specify what the output should look like, not just what format it should be in. "Convert to Excel" produces a basic container change. "Convert to Excel with a summary tab showing monthly totals, a pivot table by category, and conditional formatting that highlights amounts over budget" produces a finished business tool. Same source data, wildly different value.

Data Cleaning and Normalisation

Files from different sources arrive in different formats. Dates show up as DD/MM/YYYY in one file and MM-DD-YYYY in another. Currency symbols are mixed in with plain numbers. Category names drift — "Travel", "Travelling", "Travel Expense" all mean the same thing.

Cowork normalises everything during batch processing when you specify the target format. Build your cleaning rules into the prompt:

  • Dates: "Standardise all dates to YYYY-MM-DD format"
  • Currency: "Remove currency symbols and format amounts to two decimal places"
  • Categories: "Map all category variations to these standard names: Travel, Meals, Accommodation, Office Supplies, Other"
  • Missing data: "If a field is empty, write 'N/A' — never leave blank cells"

These rules apply consistently across every file in the batch because every sub-agent receives the same instructions. Consistency at scale — that's what separates a Cowork pipeline from manual processing.

Mixed File Types in a Single Batch

A common misconception: Cowork can only batch-process files of the same type. Not true. A folder containing PDFs, spreadsheets, images, and text files can all run through a single batch operation. Cowork identifies each file's type and adapts accordingly — parsing text from PDFs, interpreting visual content from images, reading structured data from spreadsheets.

This matters because real-world data rarely arrives in a uniform format. An expense folder might contain photographed receipts (JPG), emailed invoices (PDF), and manually logged amounts (CSV). One prompt, one pipeline, one output — regardless of source format.

The Token Cost Reality

Parallel batch processing is faster but more expensive in token usage. Each sub-agent needs its own reasoning stream, its own context window, its own tool calls. Five sub-agents processing four files each burn considerably more of your usage allocation than one agent working through 20 files sequentially.

The trade-off is speed for cost. Know when it matters. Urgent deadline with 50 client reports due today? Parallelise aggressively. Routine weekly cleanup? Run it sequentially and preserve your allocation. The exam tests whether you understand this trade-off, not just whether you can trigger parallelism.


Common Mistakes

Common Mistake

Listing files in a linear sequence — 'First process invoice-A.pdf, then invoice-B.pdf, then invoice-C.pdf' — forcing sequential execution across the entire batch.

Instead: Use batch language that enables parallelisation: 'For each PDF in this folder, extract the invoice number, date, and total.' Every file's extraction is independent, so there's no reason to impose an order. Let Cowork decide how to distribute the work across sub-agents.

Common Mistake

Prompting 'extract data from these invoices' without specifying which fields to extract, what to name the columns, or how to handle missing values.

Instead: Define an explicit output schema: 'Columns: Date (YYYY-MM-DD), Vendor (text), Amount (number, two decimals), Category. If a field is missing, write N/A.' Every sub-agent follows the same schema, so the data compiles cleanly without manual reconciliation.

Common Mistake

Parallelising tasks where each file's processing depends on the output of the previous file — a running total, cumulative analysis, or sequential numbering.

Instead: Recognise sequential dependencies and instruct Cowork accordingly: 'Process these files in sequence — each row's running total depends on the previous result.' Parallel sub-agents can't see each other's intermediate outputs, so dependent tasks must run in order.


Hands-On Activity

Hands-On Activity

Build a Batch Processing Pipeline

15 min

Create a folder of mixed files, design a complete source-transform-output pipeline, and verify that parallel sub-agents produce consistent, normalised results.

What you will learn

  • Design a three-stage batch pipeline (source, transform, output) in a single prompt
  • Trigger sub-agent parallelisation across mixed file types
  • Verify data normalisation consistency across files processed by different sub-agents
  • Identify when batch operations require explicit schema definitions
  1. 01

    Create a folder called 'Batch-Lab'. Place 8-10 small files inside — a mix of text files, CSVs, or PDFs. Each file should contain structured information: dates, names, and monetary amounts. Use inconsistent formats deliberately — write dates as DD/MM/YYYY in some files and MM-DD-YYYY in others. Mix currency symbols ($, GBP, EUR) with plain numbers. Fictional data works fine.

    Why: You need enough files to trigger sub-agent parallelisation (typically 5+) and deliberate inconsistency to test the normalisation stage of your pipeline.

    Expected: A folder with 8-10 files containing extractable data points in deliberately inconsistent formats.

  2. 02

    In Cowork, select the 'Batch-Lab' folder and prompt: 'For each file in this folder, extract any dates, names, and monetary amounts you find. Compile all extracted data into a single file called Extracted-Data.xlsx with columns: Source File, Date, Name, Amount. Standardise all dates to YYYY-MM-DD format. Remove currency symbols and format amounts to two decimal places. If a field is not present, write N/A.'

    Why: This tests the full three-stage pipeline: parallel source reading, structured extraction with normalisation rules, and compiled output with a defined schema.

    Expected: An execution plan showing sub-agents assigned to process files in parallel, followed by a compilation step.

  3. 03

    Allow the plan to execute. While it runs, watch the sub-agent progress bars — you should see multiple bars advancing simultaneously.

    Why: Seeing multiple sub-agents work at once reinforces how batch parallelisation works in practice and gives you a feel for how long similar operations take.

    Expected: Multiple progress bars moving simultaneously, with the total elapsed time noticeably shorter than processing each file individually would take.

  4. 04

    Open Extracted-Data.xlsx and check three things: (a) every source file appears in the Source File column, (b) all dates are in YYYY-MM-DD format regardless of the original, and (c) all amounts are plain numbers with two decimal places — no currency symbols.

    Why: This quality check confirms that normalisation worked consistently across files processed by different sub-agents and that your output schema was followed uniformly.

    Expected: A clean spreadsheet with consistent date formatting, uniform amount formatting, every source file represented, and N/A in any cells where data was not found.


Practice Question

Practice Question

An office manager has a folder containing 25 photographed paper receipts (JPG images) from a team event. They need to extract the date, vendor name, and total amount from each receipt into a single expense spreadsheet. What is the most efficient approach?


Sources