Interview Study Workflow

Study Design

Before any analysis can run, the study must be designed to generate the right data. Every downstream analysis decision — what to code, what to segment on, what to report — flows from what questions were asked. A poorly designed guide cannot be rescued by better analysis.

0.1

Interview Guide Design

The instrument that determines everything downstream

The interview guide is the instrument. Every downstream analysis decision flows from what questions were asked. There are three question modules, and the distinction between them is the most important design decision in the study.

Three Question Modules

Module A: Historical evaluation — triggers, criteria, alternatives considered, reasons for rejection. These produce basis variables for segmentation because they capture state of mind that preceded the purchase decision. They are not contaminated by what the participant is using today.
Module B: Current state — satisfaction, frustrations, current gaps, goals. These produce descriptor and criterion variables — they describe what happened after the decision was made, not what drove it.
Module C: Perceptual/hypothetical — brand perceptions, willingness to pay. Descriptor variables.

On criterion variable capture: In our current study design, the participant's current tool is asked directly in the interview and coded as an criterion variable during the application phase — not from a pre-interview screener. The core principle remains: keep criterion variables (what you are trying to explain) separate from basis variables (what you cluster on). Module B satisfaction and frustration data are descriptor variables by default, not basis variables.

JTBD Framing for Module A

When writing evaluation trigger questions, use a Jobs-to-be-Done framing: what was the participant trying to accomplish, what context made them start looking, what would need to be true for them to take action? People do not naturally articulate evaluation criteria — they tell stories about situations. The stories contain the criteria.

Output

study_config.json — question IDs, coding types, module classifications

0.2

Participant Simulation

End-to-end pipeline testing before real interviews

Before running real interviews, the full pipeline is tested with simulated participants. Simulation allows you to catch bugs, validate script paths, and build an initial codebook before spending budget on real transcripts.

Configure Parameters Before Launching

Simulation is not one-size-fits-all. Before running, explicitly decide on these roster parameters:

Company size distribution — what mix of small, mid-market, and enterprise participants should the roster reflect?
Job title and seniority — VP, Director, Manager in what proportions? Seniority affects communication style, decision authority, and frustration profiles.
Current tool distribution — which products are represented and at roughly what market-share proportions?
Industry distribution — which sectors are in scope?
Any other roster fields the study requires — these are passed to the simulation agent so each participant's answers are grounded in their assigned profile.

Decide these before launching and document them in the study's roster design notes. The simulation is only as realistic as the roster it draws from.

Verboseness Distribution — Where These Numbers Come From

Each participant is simulated with one of three verboseness levels. These defaults are empirically calibrated from a real Maze study.

Source: We analyzed N=74 real interview transcripts from a Maze study with UX researchers — actual professional interviews on the Maze platform. For each participant we measured total response word count and identified three natural clusters. Results from that analysis:

Level	Share	Observed avg	Observed range
Not Verbose	~34%	808 words	500–1,200
Somewhat Verbose	~43%	~1,500 words	1,200–2,100
Very Verbose	~23%	2,100+ words	1,900–3,000

35% Not Verbose — target 900 words 43% Somewhat Verbose — target 1,500 words 22% Very Verbose — target 2,400 words

Why targets differ from observed averages — the overshoot problem: The simulation targets (900 / 1,500 / 2,400) are guidance given to the simulation agent, not hard caps. LLM-simulated participants consistently overshoot their word count guidance — tell the agent to aim for 900 words and it often produces 1,100–1,200. By setting targets somewhat below where we want the output to land, the overshoot brings the final word count back into the real-data range. If a colleague asks why the targets don't exactly match the observed averages, the answer is: they are deliberately set conservatively to compensate for the LLM tendency to be more verbose than instructed.

Simulated participants are assigned realistic demographic variation from the roster parameters above. The simulation agent answers as a realistic professional in this domain — with the vocabulary, concerns, and communication patterns that role and domain entail.

Agent Configuration

Transcript Generator

Scriptresearch/00 How to simulate participants/simulate.py

Modelclaude-sonnet-4-6

Temperaturedefault (not set explicitly — Anthropic default of 1.0 for natural variation across participants)

Max tokens24,000 (sized for ~5 participants × ~3,000 words at Very Verbose + JSON overhead)

Batch size5 participants per API call

Inter-batch pause1.5 seconds

Retry attempts2

System promptNone — all instructions live in the user message

PromptYou are generating simulated interview transcripts for the study: "{study_name}"

PARTICIPANT POPULATION: {population}
STUDY CONTEXT: {study_context}

PARTICIPANTS IN THIS BATCH:
- Participant {id}: {dimension_fields}, Verboseness: {level} (target {N} words, range {min}–{max} words)
...(5 participants per batch)

SIMULATION RULES — apply every rule to every participant without exception:

1. ALL QUESTIONS REQUIRED. Every question in the interview guide must receive a response. No question may be skipped, merged with another, or omitted.

2. VERBOSENESS-DRIVEN LENGTH. Each participant has a word-count target and acceptable range. Calibrate the depth and length of all responses so the total word count lands within that range. Include the actual word count in the "total_words" field of your output.

3. MAZE-TRANSCRIPT SPEECH STYLE. These transcripts simulate Maze AI transcription output — cleaned and conversational, not raw speech. OMIT: um, uh, hmm, er, and heavy false starts (Maze removes these entirely). INCLUDE: hedge markers — "I think," "I mean," "I guess," "kind of," "you know," "honestly" — at roughly 2–4% of total words. Sentences starting with "So," "And," "But." Occasional subtle self-repairs (0–1 per participant). SENTENCE LENGTH: 15–22 words on average. Do not write short fragments.

4. PROFILE-APPROPRIATE RESPONSES. Each participant's answers must reflect their actual seniority, industry, and company size. A VP at a 5,000-person tech company uses different vocabulary, references different tools, and describes different problems than a manager at a 200-person healthcare company.

5. NO BRAND SKEWING. Do not bias awareness, sentiment, or behavior toward any specific product or vendor unless the study context explicitly directs it.

6. GEOGRAPHIC DISTRIBUTION. Weight company HQ toward high-density business states (CA, TX, NY, FL, IL, MA) while including reasonable spread across others.

7. DOLLAR SCALING. All budget figures, spend amounts, and WTP numbers must scale realistically with the participant's company size tier.

8. TOOL/BRAND AWARENESS BY PROFILE. Enterprise tools (Workday, SAP, iCIMS) for larger companies; SMB tools (BambooHR, Lever, Breezy) for smaller ones. VPs and Directors have broader market awareness than Managers.

9. PARTICIPANT DIVERSITY. Every participant in this batch must sound clearly different from the others. Vary the tools mentioned, examples used, problems described, frustrations expressed, and vocabulary chosen. Do not let any two participants share the same anecdotes, use the same phrases, or describe the same situation.

10. RESPONSE VARIATION WITHIN EACH PARTICIPANT. Response length should vary naturally across questions (log-normal, not uniform). Expect: 10–15% of responses ≤10 words (confirmations, "I don't know," transitions); 50–60% of responses: 40–180 words (standard substantive answers); 10–15% of responses: 200–400 words (narrative / experience-heavy questions).

11. SESSION ARC. First 2–3 questions: slightly shorter, slightly more tentative. Middle questions: most expansive and narrative. Final questions: slightly shorter and more direct.

INTERVIEW GUIDE:
{full interview guide text}

OUTPUT INSTRUCTIONS: Return ONLY a valid JSON object. No text before or after it. No markdown code blocks. Match this exact structure: {"study": "...", "participants": [{"participant_id", {dimension_fields}, "verboseness", "total_words", "estimated_duration_min", "transcript": [{"question_id", "question", "response"}]}]}. Generate all 5 participants. Make each one distinctly different from the others.

Why Sonnet (Not Haiku or Opus)

Participant simulation is synthetic text generation, not expert reasoning. Opus's reasoning advantage delivers effectively zero value on this task while costing ~5× Sonnet (~$29/study vs. ~$6/study for 180 participants). Haiku 4.5 would cut the cost from ~$6 to ~$2 per study but carries real risk on rule adherence: the prompt contains 11 numbered simulation rules plus the full interview guide, and smaller models tend to drop rules late in long prompts. Sonnet reliably holds all rules across the batch and produces distinct-sounding participants. The extra ~$4/study over Haiku is a cheap insurance policy on the input data every downstream pipeline step depends on.

Cost comparison per 180-participant study (~144K input + ~360K output tokens across 36 batches): Haiku 4.5 ~$1.94 | Sonnet 4.6 ~$5.83 (current default) | Opus 4.6 ~$29.16.

Post-Generation Validation

After each batch returns, simulate.py runs a word-count check against the verboseness spec, overwrites the reported total_words with the ground-truth count, and flags any participant whose word count falls outside their target range or who answered fewer questions than the interview guide contains. Warnings print per batch but do not fail the run — the roster designer reviews warnings and decides whether to re-run individual batches.

Output → A6 Data Files - Simulated/

transcripts.json — array of participants with transcript Q&A pairs

0.3

Maze Transcript Processing

Converting real interview exports into pipeline-ready JSON

Real interviews conducted via Maze are exported as a CSV where each column is one participant's transcript. The Maze export has a known inconsistency: some participants have their full transcript in a single cell; others have it split across multiple rows.

Script

python maze_to_json.py path/to/maze-export.csv

What the Script Does

Reads each column as one participant, concatenates all non-empty cells to handle split-row exports
Uses regex to identify turn headers (sequence number, timestamps, speaker label)
Groups turns into question-response pairs: each interviewer turn starts a new question
Prints participant count, sequence gaps, and warnings for any participants that could not be parsed

Quality check: Review the printed summary before proceeding. Any participant with a parse warning should be spot-checked manually against the raw CSV.

Codebook Discovery

The discovery phase converts raw participant responses into a structured codebook — the instrument that defines exactly what themes exist, how they are bounded, and what counts as an inclusion. Every downstream analysis depends on getting this right.

Hybrid execution model. Discovery runs across two modes. Steps 1.1, 1.2a, 1.2b, and 1.6 run in the Claude Code conversation window using the Claude Code subscription (no per-call API charges). Steps 1.3 (extraction) and 1.4 (clustering) run via run_extract_cluster.py on the Anthropic API because of their high parallelism. Step 1.5 (enhanced six-pass codebook construction) runs via run_codebook.py on the API because its ~60–75K token structured JSON output exceeds what can be produced reliably in a single conversation session.

1.1

Study Context Generation

[CONVERSATION MODE] Context document passed to all downstream agents

Before any other coding step runs, Claude Code (running in the conversation window using the Claude Pro subscription) reads a random sample of 50 participant responses and produces a structured four-section study context document. This step used to run as a Python+API subprocess; moving it into conversation mode eliminates its API charge entirely.

Agent Configuration

Keycontext_generator

Modelclaude-sonnet-4-20250514

Temperature0 — deterministic; factual characterization, no variation needed

Max tokens8,000

Persona"You are a qualitative research analyst. You have been given a sample of interview responses and your job is to produce a structured context document that will help downstream agents understand who was interviewed, how they communicate, and what subjects they discuss. Be descriptive, not prescriptive. Capture patterns, vocabulary, and professional context. Do not draw analytical conclusions — that is for the coding agents."

TaskReads 50 randomly sampled responses. Produces a 4-section JSON: who was interviewed, what they were asked, how they communicate, and dominant topics. Explicitly instructed not to pre-determine codes or draw conclusions.

Four Sections of the Context Document

Who was interviewed — roles, seniority, company sizes, industries, range of variation
What they were asked — plain-language summary of each question's intent, inferred from actual responses
How they communicate — vocabulary, technical vs. non-technical language, response length tendencies, jargon
Dominant topics and patterns — themes appearing most frequently across the 50 sampled responses

Why this exists: Agents making segmentation and grouping decisions perform better when they understand the professional vocabulary of the people they're analyzing. Without context, the extraction agent treats all language generically. With it, it recognizes that "running payroll in-house" and "managing payroll processing internally" are the same concept.

What it is not: Descriptive, not prescriptive. It tells downstream agents what the study population is like — not which codes to create. Analytical conclusions come from the data.

Output → A6 Data Files - Simulated/

study-context.json

1.2

Question Classification

[CONVERSATION MODE] Two-part process: type determination (Part 1a) + structural definition (Part 1b)

Classification runs in two sequential sub-steps, both now executed in the Claude Code conversation window using the Claude Pro subscription. Part 1a determines question type cheaply. Part 1b defines the structural codebook for non-thematic questions with full data coverage. Output is written directly to questions-registry.csv in the study data folder.

Part 1a — Type Determination (5-participant sample per question)

Agent Configuration — Part 1a

Keyquestion_classifier

Modelclaude-opus-4-6

Temperature0 — deterministic; this is a gate decision, not a creative one

Max tokens2,000

TaskReceives the question text and 5 randomly sampled responses. Classifies the question as thematic, categorical, rank-order, or binary. Returns only the coding type and brief reasoning — no structural details.

PromptYou are a senior qualitative methodologist classifying interview questions by coding type.

TASK: Determine the coding type for this interview question. Read the question text and the sample responses, then classify the question.

QUESTION: "{question_text}"

SAMPLE RESPONSES (small sample — for type determination only):
1. "{response}"
...(up to 5 randomly sampled)

CODING TYPES:
1. "thematic" — open-ended question producing complex, multi-faceted responses. Responses contain multiple ideas, experiences, or judgments that need to be broken into meaning units and grouped into themes. Most open-ended "why" and "how" questions fall here.
2. "categorical" — question where responses naturally cluster into a small set of distinct categories (3–7). Typically satisfaction, sentiment, or preference questions where each response maps to one category.
3. "rank_order" — question where responses contain a numeric value or quantity that maps to ordinal buckets. Company size, years of experience, frequency counts.
4. "binary" — question where responses indicate yes or no, presence or absence. "Did you..." or "Have you..." questions.

INSTRUCTIONS: (1) Read the question text and all sample responses. (2) Determine which coding type best fits. (3) Provide brief reasoning. (4) Do NOT define categories, buckets, or binary criteria — that comes in a separate step with full data.

Output JSON: {"coding_type": "thematic|categorical|rank_order|binary", "reasoning": "Why this coding type fits"}

Flow: How Part 1a Handles Information

Input

One question at a time, with 5 randomly sampled participant responses to that question.

ExampleQ7: "What frustrates you about your current HR system?" P5: "Reporting is terrible..." P12: "Interface looks like 2005" P23: "Re-entering data 3 times..." P41: "Useless for analytics" P88: "Can't customize permissions"

→

What the Agent Does

Reads the question and the 5 sample responses. Asks: which of the four coding types fits? Decides between thematic, categorical, rank-order, or binary. Writes brief reasoning. Does NOT define structure yet — that is Part 1b's job.

Reasoning"Open-ended frustrations question. Responses contain multiple distinct ideas per person (reporting, UI, data entry, analytics, permissions). Not enumerable into 3-7 categories. → thematic"

→

Output → Part 1b or Step 1.3

A single classification record per question. Thematic questions skip Part 1b and go straight to extraction (Step 1.3). Non-thematic questions go to Part 1b for structural definition.

Passes On{ "question_id": "Q7", "coding_type": "thematic", "reasoning": "Open-ended; multiple ideas per response; not enumerable" }

Why only 5 responses for type determination: Question type is largely apparent from the question text itself; the responses confirm it. Binary and rank-order questions are identifiable from even one or two answers. Categorical vs. thematic distinction needs a few more — 5 is sufficient. Keeping Part 1a cheap lets it run across all questions in parallel, preserving the token budget for Part 1b.

Why Opus (Not Sonnet) for Question Type Classification

Classification is a high-leverage, low-volume decision. The pipeline makes one call per question, typically 15-25 calls per study, so the cost delta between Sonnet and Opus is trivial (roughly $1-2 extra per study). The failure mode, however, is catastrophic and silent. A thematic question misclassified as categorical collapses all inductive themes into a handful of named buckets and cannot be recovered downstream. A categorical question misclassified as thematic fills the codebook with noisy open-ended clusters that should have been clean counts. A rank-order question misclassified as thematic loses the ordinal structure entirely.

Because the classification drives every downstream branch of the pipeline, and because the error is invisible until a human reviewer catches it in the finished codebook, the strongest available reasoning model is warranted even though the task itself is usually straightforward. Opus is also better at catching the subtle case where the question text looks open-ended but the responses themselves fall cleanly into a small set of categories, or where a nominally closed question receives rich open-ended explanations that deserve thematic coding. Sonnet tends to classify from the question text alone; Opus weighs both the text and how participants actually answered.

The 5-participant sample is retained because it is sufficient for the reasoning model to spot the pattern — going larger would raise cost without materially improving accuracy on a decision this structural.

Part 1b — Structural Definition (all participants, non-thematic questions only)

Agent Configuration — Part 1b

Keyextractor_1

Modelclaude-sonnet-4-20250514

Temperature0.2

TaskReceives the question text, the coding type from Part 1a, and ALL participant responses to that question. For categorical: proposes 3-7 mutually exclusive category names and definitions. For rank-order: proposes bucket boundaries from the full observed range. For binary: proposes positive/negative labels, definition, and inclusion/exclusion criteria.

PromptYou are an expert qualitative methodologist.

TASK: This question has been classified as "{coding_type}". {Type-specific instruction: For categorical — examine all responses and identify 3–7 mutually exclusive categories that capture the full range of actual answers. For rank_order — examine all responses and propose sensible ordinal buckets that span the full observed range. For binary — examine all responses and define what counts as the positive case with clear inclusion and exclusion criteria.}

You are reading ALL {n} participant responses to ensure the structural definition covers the full range of actual answers in the dataset.

QUESTION: "{question_text}"

ALL RESPONSES ({n} total):
1. "{response}"
...

Output JSON (type-specific format):
Categorical: {"category_details": {"categories": ["Cat 1", "Cat 2", ...], "category_definitions": {"Cat 1": "Definition...", ...}}}
Rank-order: {"rank_order_details": {"buckets": [{"label": "1–100", "range_low": "1", "range_high": "100"}, ...]}}
Binary: {"binary_details": {"positive_label": "Yes", "negative_label": "No", "definition": "...", "inclusion_criteria": "...", "exclusion_criteria": "..."}}

Flow: How Part 1b Handles Information

Input

One non-thematic question + the coding type from Part 1a + ALL ~180 participant responses to that question.

ExampleQ3 [categorical]: "How many employees does your company work with?" P1: "About 800 people" P2: "Around 250" P3: "Just under 5,000" P4: "We're a 40-person team" P5: "12,000 globally" ... 175 more responses

→

What the Agent Does

Reads the full range of actual answers. For categorical: identifies 3-7 mutually exclusive categories with definitions. For rank-order: proposes ordinal buckets spanning the full observed range. For binary: defines positive/negative labels with inclusion/exclusion criteria. Structure is data-driven, not predefined.

Reasoning"Responses span 12 → 12,000 employees. Natural B2B segmentation breaks fall at SMB (1-250), Mid-Market (251-1,000), Enterprise (1,001+). All 180 responses fit cleanly into these three buckets."

→

Output → Final Codebook

A complete, authoritative structural definition for that question. Goes directly into the final codebook — not refined later by the dual architects (Steps 1.5-1.6), which only touch thematic questions.

Passes On{ "question_id": "Q3", "coding_type": "categorical", "categories": [ {"label": "SMB", "definition": "1-250"}, {"label": "Mid-Market", "definition": "251-1,000"}, {"label": "Enterprise", "definition": "1,001+"} ] }

Why all participants for structural definition: Category names, bucket boundaries, and binary criteria must reflect the full range of answers actually observed. A category appearing in 5% of a 180-person study (9 people) might appear only once in a 30-person sample. Reading all participants guarantees complete coverage. The token cost is low because non-thematic responses are short (a few words to a sentence) and Part 1b runs only for the handful of non-thematic questions. For a study with 5 non-thematic questions and 180 participants, Part 1b reads approximately 45,000 tokens total — less than reading 30 participants across all questions.

Thematic — open-ended, multi-code Categorical — Part 1b proposes categories from full dataset Rank-Order — Part 1b proposes buckets from full value range Binary — Part 1b proposes labels and criteria from full dataset

The structural definition from Part 1b is authoritative. For any question classified as categorical, rank-order, or binary, the structure defined in Part 1b becomes the final codebook structure for that question. It is not refined by the codebook construction agent (Step 1.5), which only operates on thematic questions. Review the non-thematic entries in questions-registry.csv during the Step 1.6 human review.

Why One Agent (Not Two) for Classification

Classification is a structured decision with a finite, well-defined outcome space — the four types are exhaustive and mutually exclusive. It is not an interpretive judgment the way codebook construction is. Two agents classifying the same question would almost always agree; the rare disagreement would be on edge cases better resolved by reading more responses, not by running a second agent. The complexity and cost of dual-agent classification with reconciliation is not justified by the improvement in output quality.

Output → A6 Data Files - Simulated/

questions-registry.csv — question_id, question_text, coding_type, n_codes_or_categories

1.3

Meaning Unit Extraction

[SCRIPT: run_extract_cluster.py] One agent reads study context, then breaks every response into the smallest distinct ideas

One Sonnet agent processes all participant responses to thematic questions. Before reading any responses, it receives the study context document from Step 1.1 — who was interviewed, how they communicate, and what topics they discuss. This primes the agent with the professional vocabulary and communication style of the participants so it can make better interpretation decisions.

Agent Configuration

Keyextractor_1

Modelclaude-sonnet-4-20250514

Temperature0.2 — slight openness to capture borderline meaning units a temperature-0 agent would skip

Batch size20 responses per API call, 18 parallel workers

Persona"You are an expert qualitative researcher performing inductive coding. You are thorough and nuanced. Capture both explicit statements and clearly implied meaning. Look for subtle differences in how participants express similar ideas. It is better to extract a borderline meaning unit than to miss one."

Prompt{persona}

TASK: Read these interview responses to the question below and break each one into discrete meaning units. A meaning unit is the smallest segment of text that expresses a single idea, experience, or judgment (Graneheim & Lundman, 2004).

For each meaning unit, write a short descriptive code (3-8 words) that captures the specific meaning. Stay close to the participant's language. This is first-cycle descriptive coding (Saldana, 2016).

QUESTION: "{question_text}"

RESPONSES:
Participant {id}: "{response}"
...

RULES:
- One response can contain multiple meaning units
- Descriptive codes should be specific, not generic ("had to re-enter data three times" not "data issues")
- Preserve the participant's meaning. Do not interpret or abstract yet.
- If empty or off-topic: mark as "no_codable_content"
- Include the exact quote for each meaning unit

GUARD RULES:
- WORD BOUNDARY: Only code complete, standalone words. Never extract a word found inside a longer word (e.g., "equity" is not evidence that the participant said "quit").
- NEGATION: Pay attention to negation (not, never, didn't, wasn't). "Not satisfied" means dissatisfied. Identify the full negated phrase before coding.
- SARCASM: Watch for ironic statements where context indicates the speaker means the opposite of their literal words. Code the intended meaning, not the literal words.
- HEDGING: Distinguish definitive statements from hedged ones. "Kind of," "sort of," "I guess," "maybe" weaken meaning — code the actual strength of the statement.
- ABSENCE: If a participant does not mention a topic, do NOT code that as absence. Only code what is stated or clearly implied. Silence is not data.
- CONTEXT: Read the full response before coding any part. A phrase can change meaning based on what surrounds it.

Output JSON: {"meaning_units": [{"participant_id": 1, "text": "exact quote", "descriptive_label": "short label"}, ...]}

Flow: How the Extractor Handles Information

Input

One thematic question + a batch of 20 participant responses to that question. The agent has already been primed with the study context document from Step 1.1.

ExampleQ7: "What frustrates you about your current HR system?" P5: "The reporting is terrible — I can't pull anything on demand. And we have to re-enter the same data three times every onboarding. Honestly the only thing it does well is payroll runs."

→

What the Agent Does

Reads each response in full, then breaks it into the smallest distinct ideas. One sentence can yield multiple meaning units; one idea can span multiple sentences. Applies guard rules (negation, sarcasm, hedging, word boundary). Writes a 3-8 word descriptive code for each unit, staying close to the participant's language.

SegmentationP5's response → 3 distinct ideas: (1) Reporting can't be pulled on demand (2) Onboarding requires triple data entry (3) Payroll runs are the only strength

→

Output → Step 1.4

A flat list of meaning unit records. Each record contains the participant ID, the exact verbatim quote, and a short descriptive label. ~340 meaning units per question for a 180-person study.

Passes On[ { "participant_id": 5, "text": "reporting is terrible — I can't pull anything", "descriptive_label": "Cannot pull reports on demand" }, { "participant_id": 5, "text": "re-enter the same data three times every onboarding", "descriptive_label": "Onboarding triple data entry" }, { "participant_id": 5, "text": "the only thing it does well is payroll runs", "descriptive_label": "Payroll is sole strength" } ]

Methodological Grounding for the Guard Rules

Dunivin (2024), Scalable qualitative coding with LLMs, establishes the central principle: LLMs require more precise codebook descriptions than human coders do because they lack the contextual understanding human coders develop through training and discussion. Every guard rule below is a precision instruction the model would otherwise miss.

Meaning unit definition (cited inline in prompt): Graneheim & Lundman (2004) — the smallest segment expressing a single idea
Descriptive label definition (cited inline in prompt): Saldana (2016), first-cycle descriptive coding — labels close to the participant's language
Negation, absence, context rules: Hsieh & Shannon (2005), Three approaches to qualitative content analysis — addresses interpretation decisions including the risk of misreading negated statements and the principle that absence of a statement is not coded data
Sarcasm and hedging rules: De Paoli (2024), Performing an inductive thematic analysis of semi-structured interviews with a large language model, identifies sarcasm, irony, and hedged language as the dominant failure modes for LLM-based qualitative analysis. Explicit prompt instructions to watch for these are the documented mitigation
Word boundary rule: LLM-specific. Derived from observed extraction failures where the model identifies a concept inside a longer word as a keyword match. Khalid & Witmer (2025), Best practices for prompt engineering in qualitative analysis with LLMs, recommends explicit boundary instructions for token-based models
Human oversight at Step 1.6: Bhaduri et al. (2024), LLMs as qualitative research assistants, argues that LLM-generated codes require structured human review before downstream use. Our Step 1.6 codebook review is this checkpoint

Reliability target grounding. Dunivin (2024) reports that GPT-4 with chain-of-thought prompting achieved Cohen's κ ≥ 0.79 (excellent agreement) on 3 of 9 codes and κ ≥ 0.6 (substantial) on 8 of 9 codes against human coders. Our HR Leaders BambooHR study achieved an overall weighted κ of 0.909 — above the strongest results in the published literature. This is why we are confident the single-extractor design is sufficient.

Chain-of-thought grounding. Dunivin (2024) finds that requiring the model to reason about each code before assigning it improves coding fidelity. This is the basis for the landscape analysis requirement in Steps 1.4 and 1.5 — clustering and codebook construction agents must write what they observe before proposing structure.

Why One Agent (Not Two)

The HR Leaders BambooHR study produced an overall weighted Kappa of 0.909. The four codes that fell below threshold were definition problems, not extraction failures — a second extractor would not have fixed them. Dual extraction was adding methodological complexity without improving downstream reliability. The complexity budget is better spent at the codebook construction step, where boundary-drawing actually matters.

Why Sonnet (Not Opus) for Extraction

Extraction is the highest-volume call in the entire pipeline. For a typical 180-participant × 13-thematic-question study, that is ~2,340 extraction calls — roughly 10× the number of Opus calls everywhere else in discovery combined. At ~800 output tokens each, that is ~1.9M output tokens just for extraction. Opus is ~5× the cost of Sonnet, so upgrading extraction is the single biggest cost delta you could make to the pipeline.

The HR Leaders BambooHR study hit weighted κ = 0.909 with Sonnet extraction. That is above the strongest published results (Dunivin 2024 reports GPT-4 at κ ≥ 0.79 on 3 of 9 codes). When you debrief that study, the four codes that fell below threshold were definition problems at construction time, not extraction misses. Opus at extraction would not have fixed them.

Extraction is a more mechanical task than construction. It is "find spans that express one idea, write a short label." It is not extended reasoning. Opus's reasoning advantage (GPQA Diamond, etc.) is smallest on pattern-recognition tasks like this. The reasoning-heavy step is codebook construction — which is exactly where we already spend the Opus budget.

Where Opus Would Actually Help

The documented failure modes for LLM qualitative extraction are:

Sarcasm and irony (De Paoli 2024)
Negation handling ("I don't care about reporting" being coded as reporting-interested)
Hedged language ("I guess it's fine" being coded as satisfaction)
Compound responses where one sentence contains two distinct ideas

On #1 and #3 (tone and implicature), Opus is genuinely better. On #2 and #4 (structural), the gap is small — Sonnet handles these well with explicit prompt instructions, which the current extractor has. The current extractor prompt includes guard rules for sarcasm, hedging, negation, and compound sentences; if a future study surfaces a concentration of errors in the tone categories specifically, extraction can be upgraded to Opus as a targeted fix rather than a blanket cost increase.

Why exact quotes are preserved: Codebook agents downstream need the actual language participants used to write grounded examples and counter-examples. Abstract descriptive labels without verbatim quotes produce ambiguous codebook entries. Reports also need verbatim quotes for client deliverables.

Output → A6 Data Files - Simulated/

meaning-units-log.csv — meaning_unit_id, participant_id, question_id, exact_quote, descriptive_label

1.4

Per-Question Clustering

[SCRIPT: run_extract_cluster.py] Opus groups meaning units into conceptually coherent clusters, one question at a time

Agent Configuration

Keyclusterer

Modelclaude-opus-4-6

Temperature0 — deterministic; clustering decisions must be consistent across questions

Max tokens32,000

Persona"You are a qualitative researcher performing preliminary grouping of descriptive codes. Your job is to compress meaning units into rough semantic clusters that will be passed to a global synthesis agent. When in doubt about whether two codes belong together, keep them separate. Under-clustering is recoverable downstream; over-merging is not — once two distinct concepts are collapsed into one cluster, the distinction is lost permanently. Err strongly on the side of finer-grained clusters."

Prompt{persona}

{study_context_block — who was interviewed, how they communicate, dominant topics from Step 1.1}

TASK: You are performing preliminary clustering of meaning unit descriptions from a qualitative interview. This is an intermediate compression step — your clusters will be passed to a single Opus codebook architect that will build the canonical global codebook. The richer your output, the better that downstream agent can do its job.

NOTE ON TERMINOLOGY: At this stage we are working with "meaning unit descriptions" (the short descriptive labels the prior extraction agent assigned to each meaning unit). The word "code" is reserved for the final codebook entries that the downstream architect will define. You are not creating codes here — you are grouping descriptions into clusters that will inform code creation later.

Your goal is to group semantically related meaning unit descriptions into rough clusters. This does NOT need to be perfect. The downstream architect will refine, split, and merge your clusters. Be inclusive: it is better to put a borderline description in a cluster than to leave it unclustered.

QUESTION: "{question_text}"

MEANING UNIT DESCRIPTIONS (participant_id + descriptive label):
{stripped_units}

INSTRUCTIONS:
1. Read through all meaning unit descriptions and identify recurring conceptual patterns.
2. Group related descriptions into clusters. Each cluster should represent one coherent idea.
3. Give each cluster a short, descriptive label (4-8 words).
4. For each cluster, include AT LEAST 5 representative meaning unit descriptions that characterize the cluster. More is better — the downstream architect relies on these examples to understand what the cluster contains. If a cluster genuinely has fewer than 5, include all of them.
5. For each cluster, list AT LEAST 5 participant IDs whose meaning units best exemplify it. These will be used to pull representative quotes for the downstream architect. Again, more is better.
6. Estimate how many unique participants contributed to each cluster.
7. Let the data decide the number of clusters. There is NO minimum and NO maximum — create as many clusters as the meaning unit descriptions themselves warrant. If the responses contain 7 distinct ideas, make 7 clusters. If they contain 35, make 35. Do not force artificial merging to hit a lower number, and do not artificially split to hit a higher number. Under-clustering is irreversible (distinctions collapsed here cannot be recovered); over-clustering is recoverable. When in doubt, keep separate.
8. Meaning unit descriptions that don't fit any cluster: list only up to 20 uncategorized examples.

Output JSON: {"clusters": [{"cluster_label", "representative_meaning_units", "representative_participant_ids", "approximate_participant_count"}], "uncategorized_meaning_units": [...]}

Clustering per question first reduces cognitive load on the global construction agent. Instead of receiving thousands of raw descriptive labels in one block, it receives pre-organized clusters per question — making the landscape analysis step more tractable.

Batching: One call per question. The clusterer runs once per thematic question, processing all meaning units for that question in a single Opus call. Questions are processed in parallel via thread pool, but within a question there is no internal batching. Input is capped at 500 meaning unit descriptions per question (evenly sampled if more) — comfortably within Opus's context window.

No minimum or maximum on the number of clusters. The prompt explicitly instructs the agent to let the data decide how many clusters to create. If a question's responses contain 7 distinct ideas, the agent should make 7 clusters; if they contain 35, the agent should make 35. Under-clustering is irreversible (distinctions collapsed here cannot be recovered); over-clustering is recoverable downstream. The agent is told to err toward finer-grained clusters.

Receives the study context from Step 1.1. Before reading any meaning unit descriptions, the clusterer is primed with the same who/how/what context document the extractor sees. This helps it recognize when descriptions that look superficially different are referring to the same underlying concept in the participants' shared vocabulary.

Terminology note: At this stage we deliberately use the phrase "meaning unit description" rather than "code." The word "code" is reserved for the final codebook entries produced in Step 1.5. The clusterer is grouping descriptive labels — it is not creating codes.

Why minimums of 5: The clusterer must include at least 5 representative meaning unit descriptions and at least 5 participant IDs (with quotes attached downstream) per cluster. The richer the cluster summary, the better the downstream architect can decide what to merge, split, and define. Opus is configured with a 32,000-token max output here, so the richer cluster summaries fit comfortably without crowding the budget.

Flow: How the Clusterer Handles Information

Input

The study context document from Step 1.1 (who was interviewed, how they communicate, dominant topics) plus all meaning unit descriptions for ONE question, stripped to just {participant_id, descriptive_label} — verbatim quotes are NOT sent to keep input tokens manageable. Capped at 500 units (evenly sampled if more).

Example (Q7, ~340 units)[ {P5: "Cannot pull reports on demand"}, {P5: "Onboarding triple data entry"}, {P12: "Interface looks like 2005"}, {P23: "Re-entering same data 3x"}, {P41: "Useless for analytics"}, {P67: "No real-time dashboards"}, {P88: "No per-role permissions"}, {P102:"Can't extract custom data"}, ... 332 more ]

→

What the Agent Does

Reads all meaning unit descriptions, identifies recurring conceptual patterns, and groups semantically related descriptions into rough clusters. The number of clusters is whatever the data warrants — no minimum, no maximum. Names each cluster, picks at least 5 representative meaning unit descriptions, lists at least 5 representative participant IDs, and estimates how many participants contributed. Errs toward finer-grained clusters — over-merging here is irreversible.

Grouping"Cannot pull reports", "Useless for analytics", "No real-time dashboards", "Can't extract custom data" → all about REPORTING "Onboarding triple data entry", "Re-entering same data 3x" → DUPLICATE DATA ENTRY

→

Output → Step 1.5

A set of cluster objects per question — however many the data warrants. Each cluster has a label, AT LEAST 5 representative meaning unit descriptions, AT LEAST 5 representative participant IDs, and an approximate participant count. The pipeline then bolts on 5 verbatim quotes per cluster (looked up from the originals) before passing to Step 1.5.

Passes On[ { "cluster_label": "Reporting & analytics gaps", "representative_meaning_units": [ "Cannot pull reports on demand", "Useless for analytics", "No real-time dashboards", "Can't extract custom data", "No exportable raw data" ], "representative_participant_ids": [5, 41, 67, 102, 134], "approximate_participant_count": 28 }, { "cluster_label": "Duplicate data entry burden", "representative_meaning_units": [ "Onboarding triple data entry", "Re-entering same data 3x", "Sync gaps force manual re-key", "Same field across 3 modules", "No single source of truth" ], "representative_participant_ids": [5, 23, 89, 145, 167], "approximate_participant_count": 19 }, ... ~18 more clusters ]

Why Opus here: Over-merging at the cluster stage causes irreversible information loss — the global synthesis agents can only see what the clusterer passes. If two distinct concepts are collapsed here, they cannot be recovered. Opus errs toward finer-grained clusters. Under-clustering is recoverable; over-merging is not.

1.5

Global Codebook Construction

[SCRIPT: run_codebook.py] Enhanced six-pass Opus architect with adjacency disambiguation and boundary examples

Codebook Architect (single Opus agent)

Keycodebook_architect

Modelclaude-opus-4-6

Temperature0.2 — low enough for consistent merge/split calls, just high enough for meaningful internal deliberation

Max tokens32,000

Persona"You are a senior qualitative methodologist building the canonical global codebook for an interview study. You think rigorously about where to merge codes (when two ideas are conceptually adjacent) and where to keep them separate (when distinguishing them carries real analytical signal). You always run through both lenses before committing to a final codebook: a parsimony pass that asks 'can these be merged without losing meaningful distinction?' and a distinction critique that asks 'have I collapsed anything that should stay separate?' Every theme you commit to must be distinct enough that two independent coders could reliably tell it apart from every other theme using only the written definition and criteria. You record the reasoning behind every non-obvious merge or split decision in a self-audit log so a reviewer can trace why each code exists."

Prompt{persona}

{study_context_block — who was interviewed, how they communicate, dominant topics from Step 1.1}

TASK: Build the canonical global codebook for this interview study.

You are seeing preliminary clusters from {N} thematic questions across {total_participants} participants. Your job is to synthesize these clusters into a definitive, study-wide list of codes that applies consistently across all questions.

CRITICAL NAMING RULE — NO "AND" IN CODE NAMES: Every code name must express exactly one concept. If you find yourself wanting to use "and" or "&" in a name (e.g., "Reporting and Analytics Capability Gap"), that is a structural signal that the code conflates two ideas. You MUST either split it into two separate codes, or pick the dominant concept and drop the other. Conflated names are the largest single source of dual-coder disagreement downstream.

REQUIRED REASONING PROCESS — work through these six passes in order:

PASS 1 — LANDSCAPE ANALYSIS (write this out, 4-6 sentences): What patterns do you see across all questions? Which conceptual ideas appear in multiple questions under different framings? Which clusters look like genuine standalones? Which look like edge cases that will need a judgment call?

PASS 2 — PARSIMONY DRAFT (internal): Build a first draft by merging clusters that capture the same core idea. For each candidate merge, ask "can these be merged without losing meaningful distinction?"

PASS 3 — DISTINCTION CRITIQUE (internal): Walk back through the parsimony draft and ask "have I collapsed anything that should stay separate?" Split any code where two coders could not reliably tell its sub-meanings apart.

PASS 4 — ADJACENCY DISAMBIGUATION (internal, NEW): For every code in the draft, identify the 2 nearest neighbor codes and write a one-sentence disambiguation test answering "what specifically tells a coder this is THIS code, not the neighbor?" If you cannot write a sharp test, the boundary is too soft — either merge the two codes or sharpen the definitions until a test becomes writable.

PASS 5 — PARSIMONY CHALLENGE (internal, NEW): Final merge-or-keep check. For each code, ask "if I removed this entirely and forced its meaning units into the closest neighbor, would the analysis lose anything important?" If no, merge.

PASS 6 — COMMIT FINAL: Output the final codebook. For every non-obvious merge or split decision, write a `decisions_log` entry explaining the reasoning.

CLUSTERS BY QUESTION:
{questions_text — each cluster includes its label, up to 5 sample meaning unit descriptions, ~participant count, and representative verbatim quotes}

QUALITY BAR for each committed code:
- One coherent concept (not a bundle of 2-3 ideas)
- Name contains no "and" or "&"
- Distinct enough that two independent coders could tell it apart from every other code using only the written definition and criteria
- Named for the concept, not for the question that prompted it
- Has 2 adjacency tests naming its nearest neighbors with a one-sentence disambiguation test each
- Has exactly 3 positive example quotes with reasoning and exactly 3 BOUNDARY example quotes with reasoning

NO LIMIT ON THE NUMBER OF CODES: Let the data decide how many final codes the codebook contains. There is NO minimum and NO maximum. Create as many codes as the clusters themselves warrant — if the study contains 12 distinct concepts, make 12 codes; if it contains 45, make 45. Do not force merges to hit a lower number, and do not force splits to hit a higher number. The parsimony pass, distinction critique, adjacency disambiguation, and parsimony challenge exist to find the right number, not a target number.

For each final code, write:
- A precise one-sentence definition applicable across all questions
- Inclusion criteria: when to assign this code (specific and testable)
- Exclusion criteria: when NOT to assign, and which adjacent code to assign instead
- Adjacency tests: the 2 nearest neighbor codes, each with a one-sentence disambiguation test
- EXACTLY 3 positive example quotes from the clusters above, each with reasoning for why it fits the code
- EXACTLY 3 BOUNDARY example quotes. A boundary example is a quote that looks like it could match this code but actually belongs to a specific named adjacent code. For each, name the adjacent code it belongs to and explain why. These replace the prior "negative examples" field — examples of obviously-unrelated content do not help coders; examples of almost-matches do.

Output JSON: {"landscape_analysis", "themes": [{code_name, definition, inclusion_criteria, exclusion_criteria, adjacency_tests, source_questions, examples, boundary_examples}], "decisions_log": [{decision_type, final_code_name, clusters_involved, reasoning}], "other_codes", "total_themes", "construction_notes"}

Flow: How the Codebook Architect Handles Information

Input

The study context document from Step 1.1 (who was interviewed, how they communicate, dominant topics) plus ALL clusters from ALL ~13 thematic questions in the study, simultaneously. Each cluster includes its label, up to 5 sample meaning unit descriptions, ~3 representative verbatim quotes, and approximate participant count. ("Meaning unit descriptions" is the term we use at this stage — the word "code" is reserved for the final codebook entries that this agent is about to define.)

Example (~400 clusters across 13 Qs)Q5 — "What triggers a search for a new HR system?": Cluster: "Reporting workflow gaps" (~22) quotes: P14, P67, P122 Cluster: "Compliance audit failures" (~18) ... Q7 — "What frustrates you about your current HR system?": Cluster: "Reporting & analytics gaps" (~28) quotes: P5, P41, P102 Cluster: "Duplicate data entry" (~19) ... Q11 — "What capabilities are missing?": Cluster: "Cannot extract data" (~24) quotes: P34, P88, P156 ... [~10 more questions, ~400 clusters total]

→

What the Agent Does

Six-pass reasoning. (1) Landscape analysis written out. (2) Parsimony draft merging conceptually adjacent clusters. (3) Distinction critique splitting anything two coders could not tell apart. (4) Adjacency disambiguation — name the 2 nearest neighbors for every code and write a one-sentence test distinguishing each. (5) Parsimony challenge — second merge-or-keep check. (6) Commit final codebook with a decisions_log entry for every non-obvious merge or split. Cross-question synthesis: the same underlying concept appearing under different framings becomes ONE canonical code. Code names cannot contain "and" or "&".

Cross-Question SynthesisQ5 "Reporting workflow gaps" Q7 "Reporting gaps" Q11 "Cannot extract data" Q15 "Manual report compilation" ↓ (parsimony merges; adjacency test keeps separate from "Data Entry Burden") ONE global code: "Reporting Capability Gap" (not "Reporting and Analytics" — name must be one concept)

→

Output → Step 1.6

A committed codebook with as many global codes as the data warrants — no target number. Each code is fully defined with inclusion and exclusion criteria, 2 adjacency tests naming its nearest neighbors with disambiguation tests, exactly 3 positive example quotes with reasoning, and exactly 3 boundary example quotes with reasoning (each naming the adjacent code the quote actually belongs to). Plus a decisions_log of every non-obvious merge/split and a landscape_analysis chain-of-thought summary. Each code is named for the concept, not the question, and no code name contains "and" or "&".

Passes On{ "code_name": "Reporting Capability Gap", "definition": "Participant cannot get the reports or data they need from their current system in a timely or self-serve way", "inclusion_criteria": "Mentions inability to pull reports, manual compilation, or IT dependence for routine data extraction", "exclusion_criteria": "If issue is about data entry or duplication, code as 'Data Entry Burden' instead", "adjacency_tests": [{ "neighbor_theme": "Data Entry Burden", "disambiguation_test": "If the complaint is about getting data OUT, this code; if about putting data IN, Data Entry Burden" }], "source_questions": ["Q5","Q7","Q11","Q15"], "examples": [{ "text": "I can't pull anything on demand", "reasoning": "Direct statement of inability to extract reports" }], "boundary_examples": [{ "text": "We re-enter the same data three times", "belongs_to_theme": "Data Entry Burden", "reasoning": "About input friction, not reporting" }] } ... N more codes (data decides N) + decisions_log entries

2026 enhancement — six-pass reasoning for dual-coder reliability. The prompt now requires a six-pass reasoning process instead of four: (1) Landscape analysis, (2) Parsimony draft, (3) Distinction critique, (4) Adjacency disambiguation — for every code, identify the 2 nearest neighbor codes and write a one-sentence disambiguation test per neighbor, (5) Parsimony challenge — a second parsimony walk-through to catch anything the first pass missed, (6) Commit final codebook. The enhancement also bans "and"/"&" in code names (conflated names like "Reporting and Analytics Capability Gap" were the dominant source of Phase 2 disagreement in prior studies — Coder A locked onto the first concept, Coder B locked onto the second), and replaces generic negative examples with boundary examples — quotes that look like they should match but actually belong to a specific, named adjacent code. Projected effect: 15–25 percentage point reduction in Phase 2 dual-coder disagreement, saving $100–200 per study in arbiter API calls.

Why this step stays on the Anthropic API (not conversation mode): The enhanced output is roughly 60–75K tokens of structured JSON — themes with definitions, inclusion/exclusion criteria, adjacency tests, positive examples, and boundary examples, plus the decisions log and landscape analysis. This exceeds what can be produced reliably in a single sustained conversation session. Running it via run_codebook.py on the API produces it in one validated call; conversation-mode attempts risk dropped fields or broken schemas in the single most-critical file in the entire pipeline.

Why a single Opus agent (vs. the prior dual + reconciler architecture): The previous design used two Sonnet agents (parsimony + distinction) running in parallel and an Opus reconciler making the final calls. That architecture cost three API calls for one decision and added an integration step that could itself introduce errors. A single Opus call with internal parsimony pass + distinction critique chain-of-thought captures the bulk of the divergence-then-reconcile benefit at a fraction of the cost and complexity. Opus is strong enough to hold both lenses simultaneously inside one extended reasoning pass.

Why Opus: Codebook construction is the single highest-leverage decision in the pipeline. Theme names coined here propagate into every downstream code assignment by every application coder. Expert synthesis reasoning across all questions simultaneously requires the strongest available model. Opus scores 17 points higher than Sonnet on expert reasoning benchmarks (GPQA Diamond) at 1.67x the cost — a worthwhile trade for the one call that determines the entire instrument.

Why temperature 0.2: Low enough that the agent is consistent and deterministic on routine merge/split calls. Just high enough that the parsimony-vs-distinction tension produces meaningful internal deliberation rather than collapsing to the first plausible grouping.

Output → A6 Data Files - Simulated/

codebook-audit-trail.csv — architect's non-obvious merge and split decisions with reasoning
codebook-landscape-analysis.txt — architect's chain-of-thought landscape analysis

1.6

Validator + Dimension Architect + Human Review

[CONVERSATION MODE] Two reasoning passes run in conversation, then human approval

After run_codebook.py commits the global codebook, Step 1.6 runs in the Claude Code conversation window (not as a Python+API subprocess). Claude Code performs the per-question validator pass and the dimension architect pass sequentially, assembles the final codebook.json, and pauses for human review. Moving this step into conversation mode eliminates its API cost entirely.

The two reasoning passes Claude Code executes in conversation:

Per-Question Validator (Phase 5)

Keyvalidator

Modelclaude-sonnet-4-20250514

Temperature0

Max tokens16,000

Persona"You are a qualitative methodologist applying a global codebook to a specific interview question. Your job is to identify which global themes are relevant to this question's responses, add question-specific coding notes where needed, and flag any patterns the global codebook does not cover. Be precise: only mark a theme as applicable if it genuinely appears in the sample responses."

Prompt{persona}

TASK: Apply the global codebook to this specific interview question.

The global codebook was built from ALL thematic questions in this study. Your job is to:
1. Identify which global themes actually appear in responses to THIS question
2. Add question-specific coding notes where the global definition needs clarification
3. Flag any patterns in this question's responses that the global codebook doesn't cover

QUESTION: "{question_text}"
TOTAL PARTICIPANTS: {total_participants}

GLOBAL CODEBOOK (study-wide canonical themes):
{themes_text — name, definition, include/exclude for each theme}

SAMPLE RESPONSES TO THIS QUESTION:
{sample_text — 15 random participant responses}

INSTRUCTIONS:

1. APPLICABLE THEMES: Code each sample response against the global codebook. List every theme that appears in at least one sample response.

2. QUESTION-SPECIFIC NOTES: For applicable themes, add a note if the global definition needs clarification for this question's framing. Example: the global theme "Ease of Use" has a positive definition, but in a frustrations question, it will always appear as an absence — note this for coders.

3. COVERAGE GAPS: Are there patterns in these sample responses that no global theme captures? If so, describe them.

4. QUESTION-SPECIFIC ADDITIONS: For any substantial uncovered patterns, propose a new theme entry to be added to this question's codebook only. Only create additions for genuine patterns, not single outliers.

Output JSON: {"applicable_themes": [...], "coverage_gaps": [...], "question_specific_additions": [...], "validation_notes": "..."}

Dimension Architect (Phase 6)

Keydimension_architect

Modelclaude-opus-4-6

Temperature0

Max tokens16,000

Persona"You are a senior market research methodologist designing the segmentation variable structure for a B2B interview study. You understand the distinction between basis variables (go into cluster analysis), criterion variables (validate clusters, never enter clustering), and descriptor variables (describe segments after clustering). You apply the N/10 rule for maximum cluster variables, the variance filter (9-91% prevalence for binary themes), and the temporal contamination principle (Module B data is downstream of the tool choice and must not be used as basis variables). You group themes into composite dimensions using conceptual grouping, not statistical methods."

Prompt{persona}

TASK: Design the segmentation dimension structure for this interview study.

You have a completed codebook with {N_themes} global themes and {N_questions} coded questions. Your job is to produce a `dimensions` section that classifies every variable into one of three roles and groups thematic variables into composite dimensions for cluster analysis.

CRITERION VARIABLE (from screener — never enters clustering):
{criterion_text}

QUESTION CONTEXT (researcher-provided temporal and purpose hints):
{context_text}

GLOBAL THEMES (from thematic questions):
{themes_text}

NON-THEMATIC QUESTIONS (categorical, rank_order, binary):
{simple_text}

HARD CONSTRAINT: Maximum {max_dimensions} basis dimensions (derived from N/10 rule: sample_size // 10). This is a ceiling, not a target.

THREE ROLES — classify every variable into exactly one:

BASIS — goes into cluster analysis. Must pass all three tests: (1) Varies across participants (binary: 9-91% prevalence); (2) Variation predicts different purchase behavior; (3) Not redundant with another basis variable.

CRITERION — never enters clustering. Used after clustering to validate that segments predict something useful. TEMPORAL RULE: Module B data (current-state questions) is downstream of the tool choice and must be treated as descriptor, not basis.

DESCRIPTOR — never enters clustering. Used after clustering to describe and communicate each segment.

DIMENSION TYPES: composite_binary (multiple binary theme codes, OR logic), ordinal_encoded (single ordered categorical), binary_field (single yes/no), passthrough (as-is).

INSTRUCTIONS:
Step 1 — Identify the criterion variable from study config.
Step 2 — Identify firmographic basis dimensions (company size, seniority, team size, budget authority).
Step 3 — Group thematic themes from Module A (historical evaluation) into composite basis dimensions. Apply variance filter logic.
Step 4 — Classify Module B themes as descriptor (temporal contamination).
Step 5 — Check basis count against the {max_dimensions} ceiling; remove weakest differentiators if over.
Step 6 — Assign all remaining themes and fields to descriptor dimensions. Nothing may be left unclassified.

Output JSON: {"dimensions": [{name, label, purpose, type, component_themes, source_questions, logic, rationale}], "basis_count", "max_dimensions_applied", "architecture_notes"}

Why Opus for dimension architecture: An error here — classifying a Module B variable as basis, or including a near-zero-variance binary as a basis dimension — will silently distort every downstream segment. The error cannot be recovered by the segmentation pipeline. Opus is used for irreversible, high-stakes structural decisions.

Human Review Priorities

Every entry in codebook-audit-trail.csv — these are the merge and split decisions the architect flagged as non-obvious, exactly where the instrument is weakest
Codes with participant counts near the minimum threshold — assess whether they are substantial enough to warrant a dedicated code
Non-thematic question structures (categories, buckets, binary criteria) from Step 1.2 — the classification agent's proposals are authoritative but should be verified
Dimension classifications (basis vs. criterion vs. descriptor) — any Module B variable classified as basis is a red flag

The most important check: Read each code definition and test it against 3-5 actual participant responses. If you cannot confidently predict whether a response would receive the code, the definition needs sharpening. A weak definition produces systematic error across the entire dataset.

Output → A6 Data Files - Simulated/

codebook.json — final approved codebook with dimensions section

2.0

Phase 2 Prelude — Pilot Calibration Round

A low-cost dry run on 20 stratified participants before committing to the full Phase 2 coding run. Catches soft definitions before they propagate across the full dataset and become expensive arbiter calls.

2.0

Pilot Calibration

[SCRIPT: run_coding.py on pilot subset] + [CONVERSATION: disagreement review and definition refinement]

Pilot calibration runs the full dual-coder application pipeline against a stratified sample of 20 participants, reviews the disagreements, and refines any code definitions that proved soft in practice — all before committing to the full ~180-participant Phase 2 run where definition problems become expensive.

Why this step exists

Even with the enhanced Step 1.5 prompt (adjacency tests, boundary examples, banned "and" in names), some definitions only reveal their softness when two independent coders apply them to real responses. Catching those in a 20-participant pilot costs roughly 1/9th of the full run. Catching them only after a full run means re-running the arbiter across hundreds of disagreements, or worse, shipping a weaker final dataset.

Procedure

Select 20 stratified pilot participants in conversation. Claude Code reads the transcripts and picks 20 participants stratified by company size, seniority, and any other dimensions that matter for the study. Writes the pilot participant IDs to a pilot manifest.
Run run_coding.py on the pilot subset. The application pipeline runs dual-coder extraction + Kappa + arbiter against only the 20 pilot participants. Cost is roughly $5–8 instead of ~$50 for the full run.
Review disagreements in conversation. Claude Code reads the pilot outputs and flags any code with Kappa < 0.75 or with more than one definition_ambiguous=true flag from the arbiter. These are the codes whose definitions need tightening.
Refine codebook definitions in conversation. For each flagged code, Claude Code proposes a sharpened definition, tighter inclusion/exclusion criteria, or an additional boundary example drawn from the actual pilot disagreements. Forest reviews and approves each change.

What pilot calibration does NOT do: It does not restructure the codebook. No merging, splitting, renaming, or adding new themes. Those decisions belong to Step 1.5 / 1.6, which operated on the full dataset. Pilot calibration only tightens wording for codes that were already committed — it is a last-mile definition refinement, not a second discovery pass.

Expected outcome: In a typical study, 3–8 codes out of ~50 need some definition tightening after the pilot. If more than ~15 codes need tightening, that is a signal the Step 1.5 codebook has deeper problems and should be rebuilt rather than patched.

Output → A6 Data Files - Simulated/

pilot-participants.json — stratified pilot manifest
pilot-reliability.json — Kappa results for the 20-participant subset
codebook.json — updated with refined definitions (version bumped)

Codebook Application

The application phase applies the finalized codebook to all participant transcripts using two independent coding agents, calculates inter-rater reliability, and resolves disagreements. Script: run_coding.py

2.1

Dual-Agent Coding

Two independent agents apply the codebook to every participant

Two independent agents each receive the finalized codebook and all participant responses. Each agent independently processes every participant's responses to every thematic question and produces a participant-level output: for each participant, which codes apply.

Unit of analysis: Agents use meaning unit segmentation internally as a thinking tool, but they report at the participant level: which codes apply to this participant's responses to this question. This is the correct unit of analysis for both Kappa calculation and downstream frequency analysis.

Why Dual Agents

Independent application by two agents replicates the intercoder reliability design from qualitative research (O'Connor & Joffe, 2020). Disagreements between the two agents flag cases where the codebook definition is ambiguous enough to produce different readings — exactly the cases that need a resolver and may warrant codebook refinement.

Agent 1 — Inclusive Coder

Agent Configuration — Agent 1

Keyagent_1

Modelclaude-sonnet-4-20250514

Temperature0 — deterministic application of the codebook

Max tokens4,000 per response

Codebook emphasisinclusion_first — the "INCLUDE when" criteria are presented before the "EXCLUDE when" criteria in the formatted codebook, subtly biasing the reading toward applying the code when the evidence partially matches.

TaskReceives one participant's response to one thematic question at a time along with the full code list for that question. Decides which codes apply to the response and returns a JSON object with per-code reasoning, the list of codes assigned, and a confidence rating.

PersonaYou are a thorough, inclusive qualitative coder. Capture both explicit statements and clearly implied meaning. When evidence partially matches a code definition, lean toward including the code. It is better to over-include than to miss a relevant code.

Prompt (thematic){persona}

You are coding an interview response for the question: "{question_text}"

{multi_code_instruction}

CODEBOOK:
--- CODE: {code_name} ---
Definition: {definition}
INCLUDE when: {inclusion_criteria}
EXCLUDE when: {exclusion_criteria}
How to tell this code apart from its nearest neighbors: {2 adjacency tests}
Positive examples (this code applies): {3 positive examples with reasoning}
Boundary examples (looks like this code but actually belongs elsewhere): {3 boundary examples with reasoning, each naming the adjacent code it belongs to}
...(one block per code)

PARTICIPANT RESPONSE:
"{response_text}"

INSTRUCTIONS:
1. For each code in the codebook, explain whether the participant's response matches the definition.
2. Be specific: quote the exact words from the response that match (or don't match) each code.
3. WORD BOUNDARY: Only match words that appear as complete, standalone words. Never match a word found inside a longer word.
4. NEGATION: Pay attention to negation words. "Not satisfied" means dissatisfied.
5. SARCASM: Watch for sarcastic or ironic statements where context suggests the speaker means the opposite of their literal words.
6. HEDGING: Distinguish between definitive and hedged statements. "Kind of" or "I guess" weaken meaning.
7. ABSENCE: If the participant does not mention a topic, do NOT treat that as evidence for or against any code.
8. CONTEXT: Read the entire response before coding any part of it.
9. Then list all codes that apply.

Output JSON: {"reasoning": {"code_name_1": "..."}, "codes_assigned": ["code_name_1"], "confidence": "high|medium|low"}

Agent 2 — Conservative Coder

Agent Configuration — Agent 2

Keyagent_2

Modelclaude-sonnet-4-20250514

Temperature0.3 — deliberate variation from Agent 1 to generate productive disagreement on ambiguous cases

Max tokens4,000 per response

Codebook emphasisexclusion_first — the "EXCLUDE when" criteria are presented before the "INCLUDE when" criteria, subtly biasing the reading toward rejecting the code unless the evidence clearly meets the definition.

TaskSame structural task as Agent 1 — one participant's response to one thematic question at a time, returns JSON with per-code reasoning and assigned codes. The persona, temperature, and emphasis flip are what make this agent's output diverge from Agent 1 on borderline cases.

PersonaYou are a conservative, precise qualitative coder. Only assign a code when the participant's words clearly and explicitly match the codebook definition. Do not infer or interpret beyond what was said. When in doubt, do not assign the code.

PromptSame template as Agent 1, but the codebook is rendered with EXCLUDE criteria first. All guard rules (word boundary, negation, sarcasm, hedging, absence, context) are identical — the reliability signal must come from the persona, temperature, and emphasis order, not from differences in the rulebook.

Flow: How the Two Coders Produce Disagreement

Input

One participant's response to one thematic question plus the full code list for that question. Both agents receive the same evidence and the same codebook.

ExampleQ7: "What frustrates you about your current HR system?" P23: "Reporting is kind of a nightmare — we spend hours pulling data for board meetings."

→

What the Agents Do

Each agent independently walks through every code, reasoning about whether the response matches. Agent 1 reads inclusion criteria first and leans toward including. Agent 2 reads exclusion criteria first and leans toward rejecting. Both follow the same guard rules.

DivergenceAgent 1: "reporting_analytics_gap" (kind of weakens it but P23 describes hours of manual work — clearly matches) Agent 2: No code (P23 said "kind of" — hedged; not a clear match)

→

Output → Step 2.2 / 2.3

One JSON record per agent per participant per question. Cases where both agents agree are treated as reliable and pass through. Cases where they disagree are flagged for the arbiter in Step 2.3.

Agreement PatternAgreement on most cases → kappa > 0.81 Disagreement on ambiguous cases → sent to arbiter

Why Sonnet (Not Opus) for the Two Coders

Application coding is the highest-volume phase in the pipeline. A typical study is ~180 participants × ~15 thematic questions × 2 agents = ~5,400 coding calls, plus the arbiter calls on disagreements. Running all of that on Opus would cost roughly 5× more and slow the wall-clock significantly. That cost would be hard to justify for a task where Sonnet is already strong.

Application coding is fundamentally a pattern-matching problem, not a novel reasoning problem. The codebook already exists. The inclusion and exclusion criteria, the definition, the adjacency tests, and the three positive and three boundary examples — all the hard thinking has been done upstream by the Opus codebook architect in Step 1.5. The coder's job is to read a participant's words and decide whether they match the criteria. Sonnet's accuracy on that task is close to Opus's, and the HR Leaders BambooHR simulated run hit a weighted κ = 0.909 with this exact setup, well above the Landis & Koch "almost perfect" threshold of 0.81 and well above the client-deliverable threshold set at κ ≥ 0.65. The four codes that fell below threshold in that run were definition problems in the codebook, not coder errors — Opus coders would not have rescued them.

The productive signal in dual coding is disagreement — the arbiter cannot do its job if both coders agree on everything. That disagreement is generated by prompting Agent 1 as inclusive (temperature 0) and Agent 2 as conservative (temperature 0.3), and by flipping the order in which the codebook inclusion and exclusion criteria are presented. Switching one or both agents to Opus would not create more useful disagreement; it would just make both agents more confident. The goal is two coherent but different coding philosophies, not two different raw IQs.

The one failure mode this architecture does not catch is when both Sonnet coders confidently agree on the wrong answer. No disagreement means no arbiter trigger. That risk is a codebook-quality problem, not a model-choice problem — if the definition and the positive and negative examples are sharp enough, two independent Sonnet agents with different temperatures and different emphasis orders will almost always diverge on ambiguous cases. The fix for that failure mode is better definitions in Step 1.5, which is why the codebook architect is required to provide adjacency tests plus exactly 3 positive and 3 boundary examples per code.

Concurrency: Both coding agents run in parallel using a semaphore-controlled thread pool (18 concurrent workers). The API output token rate limit is the binding constraint, not compute.

Output → A6 Data Files - Simulated/

agent-1-codes.json, agent-2-codes.json

2.2

Kappa Quality Gate

Inter-rater reliability calculated at the participant × code level

After both agents have coded all participants, Cohen's Kappa is calculated per code and as an overall weighted average at the participant × code level.

Why participant × code, not meaning unit × code: Our downstream analysis asks "what percentage of participants expressed theme X" — a participant-level question. Kappa at the meaning unit level would be influenced by segmentation variability between agents. Participant-level Kappa measures what actually matters.

0.81+

Almost Perfect

0.61–0.80

Substantial

0.41–0.60

Moderate

Below 0.65

Flag for Review

Source: Landis & Koch, 1977

Benchmark: HR Leaders BambooHR achieved overall weighted Kappa of 0.909. The four codes that fell below threshold were definition problems, not model capability problems — Opus would not have fixed them.

On the independence limitation: Two LLM agents running the same underlying model are not independent in the same way two human coders are. Kappa scores are valid as internal quality metrics but should be noted as LLM-to-LLM agreement if presented to a research-savvy client.

Output → A6 Data Files - Simulated/

reliability.txt, reliability-summary.json, flagged-items.json

2.3

Disagreement Resolution

A third agent resolves every contested coding decision

For any participant × code combination where the two agents disagreed, a third resolver agent reviews both agents' reasoning alongside the participant's actual transcript and the codebook definition, and makes a final determination.

The resolver records its reasoning in the output file alongside the final code assignment. This creates an audit trail for every contested coding decision.

Agent 3 — Neutral Arbiter

Agent Configuration — Agent 3

Keyagent_3

Modelclaude-opus-4-6

Temperature0 — deterministic; this is a judgment call, not a creative task

Max tokens4,000 per resolution

Codebook emphasisbalanced — inclusion and exclusion criteria shown in neutral order so the arbiter reads the rulebook straight rather than with either coder's bias

TaskReceives the participant's exact response, the codebook definition and criteria, and both coders' code assignments with their reasoning. Decides which codes are correct based strictly on the codebook, notes which coder's reading was favored, and flags whether the definition itself is ambiguous (both readings reasonable).

PersonaYou are a neutral arbiter resolving a coding disagreement. Review the participant's words, the codebook definition, and both coders' reasoning. Decide strictly based on whether the evidence meets the codebook definition. Do not favor either coder.

Prompt{persona}

Two independent coders have coded the same interview response and disagree. Your task is to determine the correct coding based strictly on the codebook definition.

QUESTION: "{question_text}"

PARTICIPANT RESPONSE:
"{response_text}"

CODEBOOK: {full code list with definitions, include/exclude criteria, adjacency tests, positive and boundary examples}

CODER A's ASSESSMENT:
Codes assigned: {agent_1 codes}
Reasoning: {agent_1 per-code reasoning}

CODER B's ASSESSMENT:
Codes assigned: {agent_2 codes}
Reasoning: {agent_2 per-code reasoning}

INSTRUCTIONS:
1. Review the participant's exact words.
2. Review the codebook definition, inclusion criteria, and exclusion criteria.
3. Evaluate each coder's reasoning.
4. WORD BOUNDARY: Only match complete, standalone words.
5. NEGATION: Identify the complete negated phrase before coding.
6. SARCASM: Code the intended meaning, not the literal words.
7. HEDGING: Code the actual strength of the statement.
8. ABSENCE: Silence on a topic is not data.
9. CONTEXT: Read the entire response before coding any part of it.
10. Determine the correct codes based on the codebook definition.
11. If the codebook definition is ambiguous (both coders' interpretations are reasonable), flag it.

Output JSON: {"reasoning": "...", "codes_assigned": ["code_name"], "favored_coder": "A|B|neither", "definition_ambiguous": true|false, "ambiguity_note": "..."}

Flow: How the Arbiter Handles Information

Input

Only the cases where Agent 1 and Agent 2 disagreed. The arbiter sees the original participant response, the relevant codebook entries, and both coders' reasoning side by side.

ExampleDisagreement on P23, Q7, code "reporting_analytics_gap": A: assigned (inclusive reading of "kind of a nightmare") B: not assigned (hedging rule — "kind of" weakens)

→

What the Arbiter Does

Re-reads the response against the codebook definition with neutral bias. Weighs the hedging guard rule against the strength of the concrete evidence ("we spend hours pulling data for board meetings"). Decides based on the rulebook and notes whether the disagreement revealed an ambiguous definition.

Reasoning"'Kind of a nightmare' is hedged, but the participant then provides concrete behavioral evidence (hours pulling data) that clearly meets the inclusion criteria. Code applies. Coder A's reading was correct. Definition is not ambiguous."

→

Output → application-coding-detail.json

Final code assignment for every contested case with full arbiter reasoning, favored-coder flag, and ambiguity flag. Cases flagged as ambiguous feed back into codebook refinement.

Record{ "segment_id": "P23_Q7", "codes_assigned": ["reporting_analytics_gap"], "favored_coder": "A", "definition_ambiguous": false }

Why Opus (Not Sonnet) for the Arbiter

Disagreements are where the hard calls cluster. By definition, the arbiter only sees cases where two reasonable coders looked at the same evidence and came to different conclusions — the edge cases involving sarcasm, hedged language, compound statements, and borderline definition fits. This is exactly the shape of task where Opus's reasoning advantage is largest, because the decision requires weighing competing considerations rather than pattern-matching to a clear example.

The arbiter is also low-volume and high-leverage. If the codebook is clean, dual-coder agreement on most items runs 80–90%, meaning the arbiter fires on only the remaining 10–20%. On a 5,400-call study, that is roughly 540–1,080 Opus calls — a rounding error in cost — yet each of those calls directly determines a final code assignment that enters the participant database. Spending more per call on the highest-stakes decisions is exactly the right place to put the token budget.

This mirrors the architecture used on the discovery side. The Step 1.5 codebook architect is Opus because its decisions propagate everywhere downstream. The arbiter is the mirror image on the application side: a small number of decisions that disproportionately determine final output quality. Running the arbiter on Sonnet would save almost nothing and would risk propagating coder-level confusion into the final dataset on exactly the cases that matter most.

The arbiter also carries a second responsibility beyond code assignment: it flags whether the underlying definition was ambiguous. Those flags feed back into codebook refinement and are the primary signal for whether a code needs rewording before the next study. That meta-judgment — "is this disagreement about the evidence, or about the rulebook?" — requires reasoning about the codebook itself, not just applying it.

Output → A6 Data Files - Simulated/

application-coding-detail.json — full per-agent detail with resolver notes coding-summary.md — run summary with participant counts and quality metrics

Dataset Assembly

Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all analysis and reporting downstream. Scripts: build-master-dataset.py and build-frequency-report.py

3.1

Master Participant Dataset

One row per participant — grows as analysis progresses

Takes final_codes.json, codebook.json, and roster.json and assembles a flat CSV with one row per participant.

Column Structure (in order)

participant_id
All roster variables (company size, industry, current tool, seniority, etc.)
All non-thematic coded variables as individual columns (categorical, ordinal, binary)
For each thematic question × each code: {qid}_{code_name_snake_case} = 0 or 1
Segment assignment (Run A and Run B) appended after MCA + Ward clustering

Why question × code columns: A participant who mentioned "reporting analytics gaps" in a frustrations question is analytically different from one who mentioned it in an evaluation criteria question. Storing Q5_reporting_analytics_gap and Q8_reporting_analytics_gap as separate columns preserves that distinction. Collapsing to a single column per code loses it permanently.

With 140 codes across 5 thematic questions, this produces approximately 700 binary columns. This is correct — do not collapse them.

Output → A6 Data Files - Simulated/

master-participants.csv — one row per participant, all coded variables + roster

3.2

Frequency Report

First analytical view of the data — used before writing any client section

Reads master-participants.csv and codebook.json and produces code frequency tables and cross-tabs by firmographic variable.

What to Look For

The 5-8 most prevalent themes overall — these anchor the narrative
Themes with the most striking cross-tab differences by firmographic or segment variable — these become the report's key analytical findings
Themes that appear primarily in one question context vs. across multiple questions (pervasive vs. situational)

Output → A6 Data Files - Simulated/

frequency-report.html — sortable tables, bar charts, collapsible cross-tab sections frequency-report.csv — machine-readable flat version

Segmentation

Basis variables are identified and validated, block-wise specific MCA compresses the binary code set into interpretable dimensions, and Ward's hierarchical clustering with bootstrap stability testing discovers natural participant groups. Scripts: segment-prep.py and run-segmentation.py

4.1

Basis Variable Preparation (Prevalence + Yule's Q + Blocking + Transforms)

Preparing the binary code set for block-wise MCA under an additive-only data protocol

The segment-prep.py script reads final_codes.json and the dimensions section of codebook.json, classifies every variable by Wedel & Kamakura (2000) role, then applies two filters to the binary basis set and groups the survivors into blocks so run-segmentation.py can run specific MCA block by block in Step 4.2.

Step 1 — Classify every variable (Wedel & Kamakura, 2000)

Basis variables — enter cluster analysis. Must vary meaningfully across participants, theoretically predict different buyer behavior, and not be an outcome of the tool choice.
Criterion variables — do NOT enter cluster analysis. Current tool adoption, evaluation status, willingness to pay. Used after clustering to validate that segments predict something useful. Putting criterion variables into clustering would build the answer into the question.
Descriptor variables — do NOT enter cluster analysis. Used after clustering to describe each segment in client-facing language (titles, industries, verbatims, observable identifiers).

The Module A/B temporal rule: Module A questions (historical evaluation triggers, criteria, rejection reasons) are basis-variable candidates because they capture the state of mind that preceded the purchase. Module B questions (current satisfaction, current frustrations, current feature usage) are descriptor by default — they are downstream of the tool choice and contaminated by reverse causality. An active Asana user who complains about Asana's timelines is a different person from an active Asana user who doesn't, but both differences are caused by the tool, not by the underlying buyer need.

Step 2 — Symmetric 9-91% prevalence filter

Binary basis variables with prevalence outside the symmetric 9-91% band are dropped from the clustering input. A code endorsed by 95% or 5% of participants carries almost no discriminating power and will dominate the first MCA dimension as noise. The filter is symmetric on purpose: dropping a 92%-prevalence code is as important as dropping an 8%-prevalence one, because near-universal codes create shared-presence artifacts that mirror the shared-absence artifacts at the other end. This matches the Le Roux & Rouanet (2010) guidance for MCA inputs and Dolnicar et al. (2018) on unstable low-prevalence binary indicators.

Dropped variables are logged in segmentation-validation-report.txt with their prevalence and the reason for exclusion, so the researcher can confirm that nothing important was removed.

Step 3 — Yule's Q review within each block

For every pair of surviving binary basis variables within the same conceptual block, the script computes Yule's Q (Yule, 1900; Agresti, 2013, ch. 2.4):

Q = (ad - bc) / (ad + bc) where a=both positive, b=first only, c=second only, d=both negative.

Yule's Q ranges from -1 to +1 regardless of marginal frequencies. This is the key property for sparse interview-coded data. Phi (the Pearson correlation between two binary variables) is artificially bounded by the base rates: two 10%-prevalence codes can be nearly perfectly associated and still only reach phi ≈ 0.5, which makes phi thresholds impossible to interpret consistently across variables with different marginals. Yule's Q does not have this problem.

Pairs with |Q| > 0.85 are flagged for manual review. The script does not automatically merge — the researcher decides: merge the pair into a single composite code, drop the weaker one, or declare the weaker one a supplementary variable that contributes to the MCA projection but not to the dimension definition. This replaces the earlier "80% raw overlap" rule, which had no research grounding and misbehaved under unequal base rates.

Step 4 — Block grouping

Basis variables are grouped into conceptual blocks so Step 4.2 can run specific MCA block by block. Block assignment follows this precedence: (1) an explicit block field on the dimension, (2) the temporal_layer field, (3) the primary source question. In practice this means pain-point codes form one block, evaluation-criteria codes form another, rejection-reason codes form a third, and so on. Block-wise MCA is the Greenacre (2017) and Husson, Lê & Pagès (2017) recommendation for categorical variables with natural conceptual groupings — it prevents a single block from dominating the global dimensions and makes each dimension interpretable as "how participants vary within this conceptual area."

The mapping is written to segmentation-blocks.json and consumed directly by run-segmentation.py.

N/k rule applies after MCA, not before. The sample_size / 10 heuristic (Wedel & Kamakura, 2000; Dolnicar et al., 2018) constrains the number of dimensions entering the clustering step, not the raw basis count. With N = 180 and 9-12 retained MCA dimensions, the ratio is a comfortable 15-20:1. This is why we no longer cap the pre-MCA basis count — the compression happens during MCA, and the ratio is checked on the compressed output.

Step 5 — Apply declared transforms (second pass)

After reviewing the first-pass validation report, the researcher authors segmentation-transforms.json per study, listing the merges and reclassifications that resolve the Yule's Q flags. The script is re-run with the transforms file as an optional fifth argument, applies the transforms declaratively, and writes the final basis set. Re-running with the same config file is idempotent — same inputs plus same transforms always produce the same outputs.

Two transform types are supported today:

merge_binary_or — creates a new column (prefixed seg_) as the logical OR of two or more existing binary basis components. The new variable goes into the clustering basis; the original components are preserved in segmentation-profile.csv.
reclassify_basis_to_descriptor — removes a variable from the clustering basis but keeps it in segmentation-profile.csv for post-hoc segment description. Used when a Yule's Q flag reveals a structural stayer/switcher contamination or another latent-variable problem that a merge cannot fix.

Every transform entry carries a free-text reason field that is copied verbatim into the provenance log. Nothing about a transform is implicit.

Data protocol: additive-only, provenance-tracked

The segmentation prep stage never modifies the source-of-truth files. roster.json, final_codes.json, and codebook.json are inputs only. Every operation in Step 4.1 is additive or reclassifying. Nothing is deleted from the dataset. This is by design: it prevents accidental contamination of the coded study, and it keeps the second-pass result fully reproducible from the same source files plus the same transforms config.

Prevalence-filtered variables are moved, not removed. Codes outside the 9-91% window are written to segmentation-profile.csv with a role of basis_dropped_by_prevalence_filter.
Reclassified variables are moved, not removed. Codes that a transform pulls out of the basis are written to segmentation-profile.csv with a role of basis_reclassified_to_descriptor_by_transform.
Merged variables carry a seg_ prefix so they are visually distinguishable from codes assigned by the coders in Phase 2. The original components of every merge are preserved in segmentation-profile.csv with a role of basis_merged_into_seg_variable (component preserved).
Every choice is logged in segmentation-provenance.json — prevalence drops with percentages, raw Yule's Q flags, each transform with its declared rationale, the final basis list, the final blocks mapping, and a column_roles dictionary that assigns a role to every column in both CSVs. Any value in either CSV can be traced back to its origin in one lookup.
The complement invariant. segmentation-ready.csv and segmentation-profile.csv are complements over the same participant set. Every variable that ever existed in the flattened dimensions list appears in exactly one of the two files. The provenance file asserts this.

Inputs (source of truth, never modified)

roster.json final_codes.json codebook.json segmentation-transforms.json — hand-authored per study after the first-pass validation report

Output → A6 / B6 Data Files/

segmentation-ready.csv — final basis variables for clustering (complement of profile CSV) segmentation-profile.csv — every variable NOT in the clustering basis (descriptors, criterion, dropped, reclassified, preserved merge components) segmentation-blocks.json — block → [variable names] mapping consumed by Step 4.2 segmentation-validation-report.txt — prevalence log, Yule's Q flags, block grouping segmentation-provenance.json — machine-readable audit trail: drops, transforms, column roles for every CSV column

4.1b

Basis Co-Occurrence Diagnostic [CONVERSATION]

5-minute Yule's Q heatmap between segment-prep and MCA review to catch cross-block duplicates and persona artifacts before clustering

A mandatory sanity check between segment-prep.py and run-mca.py. Read segmentation-ready.csv, compute pairwise Yule's Q across the entire post-transform basis set, and save yule-q-heatmap.png and yule-q-summary.txt to the study's A6 / B6 folder. Review the top 20 pairs by |Q| before running MCA.

This step exists because of a specific failure on the HR Leaders BGR simulated track (April 2026). The within-block-only Yule's Q filter in segment-prep.py missed cross-block duplicates — the same multistate-compliance concept was asked in four different question blocks, each surviving its own block's filter, but they were perfect |Q| = 1.0 cross-block duplicates of each other. Specific MCA then quadruple-counted the same signal, and Ward clustering produced Jaccard bootstrap stability of 0.40 (well below the 0.60 publish threshold). A 5-minute diagnostic would have caught it before any MCA review time was burned. See research/segmentation-failure-learnings.md.

Two failure modes to scan for

Perfect cross-block duplicates (|Q| ≈ 1.0) that the within-block filter missed. Usually the same concept asked in two or three different question blocks. Resolution: add a merge_binary_or or reclassify_basis_to_descriptor entry to segmentation-transforms.json and re-run Step 4.1.
Impossible mutual exclusions within a single block (two common basis codes from the same question with |Q| = −1.0 and zero co-occurrence across 30+ combined participants). On simulated tracks this is a persona artifact from the simulator — document it as a known limitation. On real-data tracks it should never appear; if it does, investigate the coding pipeline.

Healthy signal benchmarks

A healthy basis space typically has mean |Q| in the 0.10-0.25 range with a long tail of strong pairs (|Q| ≥ 0.30) distributed across many variable combinations, not concentrated on a few deterministic pairs. If mean |Q| is near zero, the codes are effectively independent and no clustering method will recover segments — document the null before spending MCA review time on data that cannot be segmented.

Outputs of Step 4.1b

Output → A6 / B6 data folder

yule-q-heatmap.png — symmetric |Q| heatmap over the full basis set yule-q-summary.txt — distribution stats, top-30 pairs by |Q|, bottom-10 near-zero pairs

4.2a

Block-wise MCA Review (analyst-in-the-loop dimension retention)

Running MCA per block, showing eigenvalues and loadings, approving what enters clustering

Dimension retention in MCA is an interpretive decision, not a mechanical one. A "retain 75% of corrected variance" rule will sometimes agree with the analyst and sometimes over- or under-retain. Le Roux & Rouanet (2010, ch. 7), Husson, Lê & Pagès (2017, ch. 3), and Greenacre (2017, ch. 11) all argue the analyst must read the scree, inspect the top contributing variables on each dimension's positive and negative poles, and decide which dimensions carry interpretable structure. That review cannot be automated, so we split it out as its own step.

The run-mca.py script runs block-wise specific MCA with Benzécri correction on every block in segmentation-blocks.json, extracts per-dimension positive and negative pole loadings from the presence ("_1") column coordinates, and produces a review report with a recommendation for every block. The researcher reviews the recommendation, approves or revises it, and Claude writes mca-dimensions.json with the final retention counts plus interpretive labels. Only then does run-segmentation.py run.

What the review report contains

Eigenvalue table per block: raw λ, Benzécri-corrected λ, variance %, cumulative %. The corrected eigenvalue is what matters; raw MCA eigenvalues on indicator data are systematically pessimistic.
ASCII scree bars so the analyst can visually spot the elbow on corrected variance.
Per-dimension interpretation: top five presence-code loadings on the positive pole and top five on the negative pole, with the coordinates shown. This tells the analyst what each dimension actually represents before it enters clustering.
Baked-in recommendation with rationale combining three rules: variance floor (≥ 5% of corrected variance), scree elbow detection (largest relative drop ≥ 40%), and an interpretability filter (trailing dims with no presence loadings above |coord| = 0.10 are trimmed). Cap of 5 dims per block. The recommendation is the starting point; the researcher decides.

Presentation rule — show the eigenvalues and loadings in chat, not just a summary

Hard requirement for the review conversation: when Claude presents the MCA review results for approval, the chat message must display the eigenvalue table (raw λ, Benzécri-corrected λ, variance %, cumulative %) and the top positive-pole and negative-pole variable loadings inline for every retained dimension in every block. A summary table of recommended dimension counts alone is not enough — the researcher cannot validate a retention recommendation without seeing the numbers driving it. The data is already in mca-review.md; it must also appear in the conversation, even when that produces a long message.

Why a mechanical "retain 75% of variance" rule is not the final say

A dimension driven almost entirely by a single binary code — for example, a dimension where one code loads at +2.5 and every other variable sits under ±0.3 — is mathematically a dimension but substantively just a continuous re-expression of that one code. Letting it into clustering re-weights that single code by whatever share of variance the dimension claims. Only human review catches this. The HR Leaders BGR simulated pass flagged exactly this failure mode on the Q15 block, where Dim 2 loaded almost entirely on aggressive_sales_experience; the analyst correctly dropped it.

Outputs of Step 4.2a

Output → A6 / B6 data folder

mca-review.json — machine-readable per-block eigenvalues, variance %, pole loadings, and recommendations mca-review.md — human-readable review report with eigenvalue tables, scree bars, pole listings, recommendations mca-dimensions.json — written after analyst approval; final retention counts + interpretive labels

Without mca-dimensions.json, run-segmentation.py in Step 4.2b refuses to run. The analyst-in-the-loop step is not optional and cannot be skipped by accident.

4.2b

Ward Clustering with Bootstrap Stability

Discovering natural participant groups from approved MCA dimensions

The method we use is block-wise specific Multiple Correspondence Analysis followed by Ward's hierarchical clustering. The run-segmentation.py script re-fits MCA block by block, applies the Benzécri eigenvalue correction, keeps only the dimensions approved in mca-dimensions.json (from Step 4.2a), standardizes them, then runs Ward's minimum-variance linkage across candidate values of k. The best solution is selected using bootstrap stability combined with silhouette score. The entire pipeline is executed twice — once without and once with log(company size) as a basis variable — so the question "should firmographics define segments?" is answered empirically rather than editorially.

Why MCA + Ward is the correct method for this data

Interview-coded data has three properties that rule out naive approaches: (1) the basis set is almost entirely binary code presence/absence; (2) there are 60-140 codes across N = 150-250 participants; (3) the codes cluster into conceptual groups (pain points, evaluation criteria, rejection reasons) that should not be mixed together during dimension extraction. Block-wise specific MCA is the textbook match for data with all three properties (Le Roux & Rouanet, 2010; Greenacre, 2017; Husson, Lê & Pagès, 2017). Ward's linkage (Ward, 1963) is deterministic, interpretable via the dendrogram, and the natural partner for Euclidean distance on standardized MCA scores.

Methods we considered and rejected:

k-means on raw binary codes, or k-means on PCA of the indicator matrix: Rejected. k-means assumes continuous Euclidean space; PCA on binary data distorts the geometry; the combination produces unstable segments that shift dramatically on tiny resamples. This is the failure mode Greenacre (2017, ch. 8) uses as the opening motivation for MCA.
Latent Class Analysis: Reasonable for pure binary data, but requires a model selection step (BIC-based k) that is less transparent to clients than Ward's dendrogram, and struggles with the high dimensionality relative to N. LCA is the right alternative when the basis set is small (under ~20 variables); for our typical basis width it does not compete with MCA + Ward on stability.
Gower + PAM: Rejected as the default. Gower handles mixed types but leaves the basis set at its raw width, which makes the clustering fragile at our typical N, and the resulting segments cannot be described in terms of dimensions — only in terms of individual codes. Gower + PAM remains the correct fallback when the basis set is genuinely mixed-type and mostly non-binary (see framework notes § 6.9).

Block-wise MCA compresses 60-140 binary codes into 9-12 interpretable axes that Ward can cluster cleanly, and the bootstrap tests the whole compression-plus-clustering pipeline end to end.

Pipeline steps in detail

Block-wise specific MCA (Le Roux & Rouanet, 2010; Greenacre, 2017): Each conceptual block of binary codes is re-fit through its own MCA. "Specific" means dimensions are interpreted on the presence ("_1") categories, so they reflect what participants actually said. The number of dimensions retained per block is the count approved by the analyst in Step 4.2a and recorded in mca-dimensions.json — run-segmentation.py refuses to run without that file.
Benzécri eigenvalue correction (Benzécri, 1979; Greenacre, 1984): Raw MCA eigenvalues on indicator matrices are systematically pessimistic — the "variance explained" numbers look terrible even when the structure is strong. We apply λ_corr = (J / (J − 1))² × (λ_raw − 1/J)² where J is the number of active variables in the block. This is not optional in modern MCA practice; raw eigenvalues will make you throw away real structure.
Standardization: Retained dimensions from all blocks are concatenated and z-standardized so no single block dominates Ward's distance calculations purely because it has more active variables.
Ward's hierarchical clustering (Ward, 1963; Murtagh & Legendre, 2014): Minimum-variance linkage on the standardized MCA scores, cut at k = 2 through max_k (default 6). Ward is deterministic (same input → same output every time), produces a real dendrogram for interpretability, and is the pairing Husson, Lê & Pagès (2017) specifically recommend for MCA outputs. This pairing is pre-committed — we do not try multiple algorithms and pick whichever produces the prettiest answer.
Dual-run company size comparison: Run A uses only psychographic blocks; Run B adds log(company_size) as a standalone standardized variable alongside the MCA dimensions. We compare the two solutions with the Adjusted Rand Index (Hubert & Arabie, 1985). Decision rule: if Run A's segments already track company-size tiers at ARI > 0.60, size is redundant and we ship Run A. If ARI(A, B) > 0.80, the two solutions are essentially the same and we ship Run A for parsimony. Otherwise Run B is adding real structure and we ship Run B.
Bootstrap stability on the FULL pipeline (Hennig, 2007; Dolnicar et al., 2018): For each candidate k, draw 200 bootstrap resamples of the participants and re-run the entire pipeline per iteration — MCA refit, Benzécri correction, standardization, Ward cut — then measure the mean Jaccard similarity between each bootstrap's segment assignments and the original. This is the critical correctness point: if we only re-clustered on fixed dimensions, we would be hiding the instability of the MCA step. Hennig (2007) shows that Jaccard-based resampling is the correct non-parametric stability measure for hard clusterings.
k selection (stability-first, not silhouette-first): Prefer k values where mean Jaccard ≥ 0.75. Among stable solutions, pick the highest silhouette. If nothing crosses 0.75, fall back on the most stable solution and flag it in the client-facing writeup. This order matters: silhouette on its own will happily pick a high-separation solution that is an artifact of this one sample, and that is the error we spent the whole pipeline trying to avoid.
Segment profiling on the ORIGINAL codes, not on MCA dimensions: The MCA dimensions are a means, not the deliverable. Once segments are assigned, we cross-tab them against the original binary basis variables and report relative risk ratios so the client sees "Segment 2 is 3.1× more likely to mention reporting gaps," not "Segment 2 has a high score on MCA dimension 4 of block 2."

Stability thresholds (Hennig, 2007; Dolnicar et al., 2018): Jaccard ≥ 0.75 = stable, present with confidence. 0.60-0.75 = patterns worth investigating, include caveats. < 0.60 = effectively noise, consolidate to a simpler k or rework the basis variables.

Why B = 200 bootstrap resamples? B controls the precision of the mean Jaccard estimate via SE = σ / √B. It does not depend on sample size or number of variables. Hennig's clusterboot defaults to B = 100; Dolnicar et al. (2018) explicitly recommend ≥ 200 for published results; Efron & Tibshirani (1993) place B = 50-200 as the defensible range for standard-error estimation. At B = 200, the 95% band around our mean Jaccard is roughly ±0.02 — an order of magnitude tighter than the 0.75 / 0.60 decision thresholds. Going higher is diminishing returns. Full justification in research/segmentation-framework-notes.md § 6.6.

Output → A6 Data Files - Simulated/

segmentation-assignments.csv — participant_id, run_a_segment, run_b_segment, selected_segment, selected_run segmentation-run-a-dimensions.csv — MCA dimension scores for Run A segmentation-run-b-dimensions.csv — MCA dimension scores for Run B segmentation-report.txt — Benzécri variance, silhouette, bootstrap stability, ARI comparison, segment sizes master-participants.csv — selected_segment appended

4.3

Segment Validation, Profiling, and Description

Confirming the solution is real, then describing it on original codes

Before any segment is described to a client, the solution from Step 4.2 must pass three gates: statistical stability, criterion-variable validation, and Kotler's five strategic criteria. Only solutions that clear all three become part of the report.

Gate 1 — Statistical stability (from the run-segmentation outputs)

Read segmentation-report.txt and confirm:

Bootstrap Jaccard ≥ 0.75 on the selected k for the shipped run (A or B). Between 0.60 and 0.75 is presentable with caveats; below 0.60 means the solution is not real and must not be shown as segments — fall back to a simpler k or rework the basis set.
Adjusted Rand Index between Run A and Run B has been reviewed and the decision rule from Step 4.2 was applied consistently (ARI with size-tier > 0.60 → ship A; ARI(A, B) > 0.80 → ship A; otherwise → ship B). Whichever run was shipped, the other run's assignments stay in master-participants.csv as a sensitivity check.
Silhouette was used to break ties among stable solutions, not as the primary selection criterion. Any solution where stability was ignored in favor of silhouette must be flagged and re-reviewed.

Gate 2 — Criterion variable validation (the "does it predict anything" test)

Cross-tab the selected segments against every criterion variable from segmentation-profile.csv — current tool adoption, active evaluation status, willingness to pay, renewal intent, and so on. Criterion variables were deliberately held out of the clustering step (Step 4.1), so this is an honest out-of-sample test of whether the segments have behavioral meaning.

At least one criterion variable must show a meaningful spread across segments. A 3× or larger relative risk ratio on any criterion variable is the minimum bar for calling the solution validated. If every criterion variable is flat across segments, the clustering found statistical structure that has no behavioral consequence, and the solution is not reportable as a segmentation.
Lead with relative risk ratios, not raw percentages. "Segment A is 3.8× more likely to be actively evaluating alternatives than the overall sample" requires no external market estimate, is robust to oversampling, and translates directly into sales prioritization (Dolnicar et al., 2018). Include raw percentages alongside for transparency.

Gate 3 — Kotler's five strategic criteria (Kotler & Keller, 2016)

Even a statistically stable, behaviorally predictive solution must clear five strategic tests before it is worth presenting:

Measurable — segment size and characteristics can be quantified. Segments defined purely by latent attitudes with no way to measure prevalence in the broader market fail here.
Substantial — each segment must represent at least 8-10% of the sample. Smaller segments are not actionable at a consulting budget level, and our N is not large enough to support them reliably.
Accessible — sales and marketing must be able to identify segment membership without re-running the research interview. This is where observable identifiers (below) come in.
Differentiable — segments must respond differently to the marketing mix. Two segments with the same pain points, the same evaluation criteria, and the same tool affinity are one segment.
Actionable — a distinct program (messaging, channel, offer, product roadmap priority) can be designed for each segment. A segmentation that produces no decision is a taxonomy exercise, not a strategic deliverable.

Profiling segments — on the original codes, not on MCA dimensions

Segment profiling is done by cross-tabbing the shipped segment assignment against the original binary basis variables (plus all descriptor variables). The MCA dimensions were compression intermediates — useful for building reproducible segments, useless for describing them to a client. The deliverable says "Segment 2 mentions reporting gaps 3.1× more often than the overall sample," not "Segment 2 scores high on block-2 dimension 4." Each segment's profile sheet includes:

Top 10 codes by relative risk ratio vs. the overall sample (the segment's defining pains, criteria, and rejection reasons)
Descriptor spread (titles, seniority, company size distribution, industry mix, current tool share)
Criterion variable spread (the Gate 2 evidence, expressed as relative risk ratios)
3-5 representative verbatim quotes drawn from participants closest to the segment centroid in MCA space

Observable identifiers (the accessibility test made concrete)

For each segment, identify 2-3 observable proxies a sales rep can assess without a full research interview — roughly in ascending order of observation cost:

Company size (LinkedIn, ZoomInfo, annual report)
Seniority and functional title (LinkedIn)
Industry and sub-industry (LinkedIn, company website)
Tech stack (G2, BuiltWith, job postings mentioning specific tools)
Buying signals (recent funding announcements, job postings complaining about the current tool, public roadmap posts)

If a segment cannot be identified from any combination of these — if membership truly requires a 45-minute research interview to determine — the segment has failed the accessibility test and must be either merged with a neighbor or demoted from "segment" to "persona note" in the final deliverable.

File	Created by	Contents
study-context.json	Discovery 1.1	Who was interviewed, communication patterns, dominant topics
questions-registry.csv	Discovery 1.2	One row per question with coding type and code count
meaning-units-log.csv	Discovery 1.3	All meaning units with exact quotes and descriptive labels
codebook-audit-trail.csv	Discovery 1.5	Architect's non-obvious merge and split decisions with reasoning
codebook-landscape-analysis.txt	Discovery 1.5	Architect's chain-of-thought landscape analysis
codebook.json	After human review	Final approved codebook with all code definitions, criteria, and examples
agent-registry.json	Discovery + Application	Full record of every agent used: model, temperature, persona hash, role, run date
agent-1-codes.json	Application 2.1	Raw coding output from Agent 1
agent-2-codes.json	Application 2.1	Raw coding output from Agent 2
application-coding-detail.json	Application 2.3	Full per-agent detail with resolver notes for every contested decision
reliability.txt	Application 2.2	Human-readable Kappa report
reliability-summary.json	Application 2.2	Machine-readable Kappa data per code
flagged-items.json	Application 2.2	Codes with Kappa below 0.65 and ambiguous definitions
coding-summary.md	Application end	Run summary with participant counts and quality metrics
master-participants.csv	Dataset 3.1 (grows)	One row per participant, all coded variables + roster, grows with factor scores and segment assignments
frequency-report.html	Dataset 3.2	Sortable tables, bar charts, collapsible cross-tab sections by firmographic variable
frequency-report.csv	Dataset 3.2	Machine-readable frequencies
segmentation-ready.csv	Segmentation 4.1	Final basis variables for clustering — complement of profile CSV over the same participants (A6 folder)
segmentation-profile.csv	Segmentation 4.1	Every variable NOT in the clustering basis: descriptors, criterion, prevalence-dropped, reclassified, preserved merge components (A6 folder)
segmentation-blocks.json	Segmentation 4.1	Block-to-variable mapping consumed by run-segmentation.py (A6 folder)
segmentation-transforms.json	Segmentation 4.1	Hand-authored per study — declares the merges and basis→descriptor reclassifications applied after the first-pass validation review (A6 folder)
segmentation-provenance.json	Segmentation 4.1	Machine-written audit trail: prevalence drops, Yule's Q flags, transforms applied, column roles for every CSV column (A6 folder)
segmentation-validation-report.txt	Segmentation 4.1	Human-readable prevalence log and Yule's Q review (A6 folder)
yule-q-heatmap.png	Segmentation 4.1b	Symmetric \|Yule's Q\| heatmap across the full post-transform basis set; scanned for cross-block duplicates and impossible within-block mutual exclusions before MCA review (A6 folder)
yule-q-summary.txt	Segmentation 4.1b	Distribution stats for \|Q\| across all basis pairs, top-30 pairs by \|Q\|, bottom-10 near-zero pairs — diagnostic companion to the heatmap (A6 folder)
mca-review.json	Segmentation 4.2a	Machine-readable block-wise MCA results: raw and Benzécri-corrected eigenvalues, variance %, per-dimension positive/negative pole loadings, recommended retention per block (A6 folder)
mca-review.md	Segmentation 4.2a	Human-readable MCA review with eigenvalue tables, ASCII scree bars, dimension interpretations, per-block recommendations. Eigenvalues and pole loadings must be presented inline in chat for analyst approval (A6 folder)
mca-dimensions.json	Segmentation 4.2a	Analyst-approved per-block retention counts + interpretive labels. Consumed by run-segmentation.py; missing file → Step 4.2b refuses to run (A6 folder)
segmentation-assignments.csv	Segmentation 4.2b	participant_id, run_a_segment, run_b_segment, selected_segment, selected_run (A6 folder)
segmentation-run-a-dimensions.csv	Segmentation 4.2b	Benzécri-corrected MCA dimension scores from the psychographic-only run (A6 folder)
segmentation-run-b-dimensions.csv	Segmentation 4.2b	MCA dimension scores from the run that includes log(company size) (A6 folder)
segmentation-report.txt	Segmentation 4.2b	Benzécri variance, silhouette, bootstrap Jaccard stability, ARI comparison, segment sizes (A6 folder)

Interview Study Workflow

Six phases, from blank canvas to deliverable

Study Design

Codebook Discovery

Codebook Application

Dataset Assembly

Segmentation

Analysis and Reporting

A hybrid of three validated frameworks

Directed Content Analysis

Framework Analysis

Qualitative Content Analysis

Multi-Agent LLM Coding

Study Design

Interview Guide Design

Three Question Modules

JTBD Framing for Module A

Participant Simulation

Configure Parameters Before Launching

Verboseness Distribution — Where These Numbers Come From

Agent Configuration

Why Sonnet (Not Haiku or Opus)

Post-Generation Validation

Maze Transcript Processing

What the Script Does

Codebook Discovery

Study Context Generation

Four Sections of the Context Document

Question Classification

Part 1a — Type Determination (5-participant sample per question)

Why Opus (Not Sonnet) for Question Type Classification

Part 1b — Structural Definition (all participants, non-thematic questions only)

Why One Agent (Not Two) for Classification

Meaning Unit Extraction

Methodological Grounding for the Guard Rules

Why One Agent (Not Two)

Why Sonnet (Not Opus) for Extraction

Where Opus Would Actually Help

Per-Question Clustering

Global Codebook Construction

Validator + Dimension Architect + Human Review

Human Review Priorities

Phase 2 Prelude — Pilot Calibration Round

Pilot Calibration

Why this step exists

Procedure

Codebook Application

Dual-Agent Coding

Why Dual Agents

Agent 1 — Inclusive Coder

Agent 2 — Conservative Coder

Why Sonnet (Not Opus) for the Two Coders

Kappa Quality Gate

Disagreement Resolution

Agent 3 — Neutral Arbiter

Why Opus (Not Sonnet) for the Arbiter

Dataset Assembly

Master Participant Dataset

Column Structure (in order)

Frequency Report

What to Look For

Segmentation

Basis Variable Preparation (Prevalence + Yule's Q + Blocking + Transforms)

Step 1 — Classify every variable (Wedel & Kamakura, 2000)

Step 2 — Symmetric 9-91% prevalence filter

Step 3 — Yule's Q review within each block

Step 4 — Block grouping

Step 5 — Apply declared transforms (second pass)

Data protocol: additive-only, provenance-tracked

Basis Co-Occurrence Diagnostic [CONVERSATION]

Two failure modes to scan for

Healthy signal benchmarks

Outputs of Step 4.1b

Block-wise MCA Review (analyst-in-the-loop dimension retention)

What the review report contains

Presentation rule — show the eigenvalues and loadings in chat, not just a summary

Why a mechanical "retain 75% of variance" rule is not the final say

Outputs of Step 4.2a

Ward Clustering with Bootstrap Stability

Why MCA + Ward is the correct method for this data