KwantumLabs Research Pipeline

Interview Study Workflow

A complete methodology for running large-scale qualitative interview studies — from participant simulation through final client report — with full methodological grounding and quality standards at every step.

run_discovery.py run_coding.py build-master-dataset.py build-frequency-report.py segment-prep.py run-segmentation.py

Pipeline Overview

Six phases, from blank canvas to deliverable

The pipeline is hybrid by design: qualitative rigor at the codebook-building stage, quantitative discipline at the analysis and reporting stage. That combination makes it possible to run 150-200 participant studies and produce defensible, statistically grounded claims — while maintaining the insight depth clients expect from qualitative research.

0

Study Design

Design the interview guide, configure the study, set up the roster, simulate participants for pipeline testing, and process real Maze transcripts when real interviews are complete.

Interview Guide Simulation Maze Processing
1

Codebook Discovery

Seven sub-steps convert raw participant responses into a finalized codebook instrument — the most methodologically intensive phase of the pipeline.

Context Generation Classification Extraction Clustering Dual Construction Reconciliation Human Review
2

Codebook Application

Two independent agents apply the finalized codebook to all participants. Inter-rater reliability is calculated, disagreements are resolved, and quality flags are issued.

Dual-Agent Coding Kappa Quality Gate Resolver Agent
3

Dataset Assembly

Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all reporting and analysis downstream.

Master Dataset Frequency Report
4

Segmentation

Defining dimensions are identified and validated, PCA reduces the variable space, and k-means clustering discovers natural participant groups.

Dimension Classification Variance Filter PCA + K-Means Segment Profiling
5

Analysis and Reporting

Frequency and cross-tab analysis surfaces the key findings. The client report is built as an interactive HTML document, deployed to Cloudflare Pages.

Cross-Tab Analysis Competitive Vulnerability Report Build PDF Optimization Deployment

Methodological Foundation

A hybrid of three validated frameworks

The pipeline does not use Braun and Clarke's reflexive thematic analysis — that method was designed for interpretive, meaning-centered research and its authors explicitly argue that counting theme frequencies does not add analytic value. Our goals require a different foundation: methods built from the ground up to support systematic coding that produces comparable, quantifiable data across participants.

🎯

Directed Content Analysis

Codes are built inductively on the first study, then treated as a fixed measurement instrument for all subsequent studies in the same domain. This is what makes cross-study comparison valid.

Hsieh & Shannon, 2005

Framework Analysis

Originally developed for large-scale applied policy research. Produces a participant × code matrix enabling systematic cross-case comparison — the "how does this theme vary by segment" question.

Ritchie & Spencer, 1994
📐

Qualitative Content Analysis

Structured, rule-based approach that explicitly bridges qualitative interpretation and quantitative analysis. Each code requires a definition, decision rules, and inclusion/exclusion criteria.

Mayring, 2000
🤝

Multi-Agent LLM Coding

Multiple independent agents with differentiated personas produce codebooks of comparable quality to expert human-coded codebooks, when structured reconciliation is applied.

CollabCoder — Gao et al., CHI 2024
0

Study Design

Before any analysis can run, the study must be designed to generate the right data. Every downstream analysis decision — what to code, what to segment on, what to report — flows from what questions were asked. A poorly designed guide cannot be rescued by better analysis.

0.1

Interview Guide Design

The instrument that determines everything downstream
+

The interview guide is the instrument. Every downstream analysis decision flows from what questions were asked. There are three question modules, and the distinction between them is the most important design decision in the study.

Three Question Modules

  • Module A: Historical evaluation — triggers, criteria, alternatives considered, reasons for rejection. These produce defining variables for segmentation because they capture state of mind that preceded the purchase decision. They are not contaminated by what the participant is using today.
  • Module B: Current state — satisfaction, frustrations, current gaps, goals. These produce profiling and outcome variables — they describe what happened after the decision was made, not what drove it.
  • Module C: Perceptual/hypothetical — brand perceptions, willingness to pay. Profiling variables.
On outcome variable capture: In our current study design, the participant's current tool is asked directly in the interview and coded as an outcome variable during the application phase — not from a pre-interview screener. The core principle remains: keep outcome variables (what you are trying to explain) separate from defining variables (what you cluster on). Module B satisfaction and frustration data are profiling variables by default, not defining variables.

JTBD Framing for Module A

When writing evaluation trigger questions, use a Jobs-to-be-Done framing: what was the participant trying to accomplish, what context made them start looking, what would need to be true for them to take action? People do not naturally articulate evaluation criteria — they tell stories about situations. The stories contain the criteria.

Output
study_config.json — question IDs, coding types, module classifications
0.2

Participant Simulation

End-to-end pipeline testing before real interviews
+

Before running real interviews, the full pipeline is tested with simulated participants. Simulation allows you to catch bugs, validate script paths, and build an initial codebook before spending budget on real transcripts.

Configure Parameters Before Launching

Simulation is not one-size-fits-all. Before running, explicitly decide on these roster parameters:

  • Company size distribution — what mix of small, mid-market, and enterprise participants should the roster reflect?
  • Job title and seniority — VP, Director, Manager in what proportions? Seniority affects communication style, decision authority, and frustration profiles.
  • Current tool distribution — which products are represented and at roughly what market-share proportions?
  • Industry distribution — which sectors are in scope?
  • Any other roster fields the study requires — these are passed to the simulation agent so each participant's answers are grounded in their assigned profile.

Decide these before launching and document them in the study's roster design notes. The simulation is only as realistic as the roster it draws from.

Verboseness Distribution — Where These Numbers Come From

Each participant is simulated with one of three verboseness levels. These defaults are empirically calibrated from a real Maze study.

Source: We analyzed N=74 real interview transcripts from a Maze study with UX researchers — actual professional interviews on the Maze platform. For each participant we measured total response word count and identified three natural clusters. Results from that analysis:
LevelShareObserved avgObserved range
Not Verbose~34%808 words500–1,200
Somewhat Verbose~43%~1,500 words1,200–2,100
Very Verbose~23%2,100+ words1,900–3,000
35% Not Verbose — target 900 words 43% Somewhat Verbose — target 1,500 words 22% Very Verbose — target 2,400 words
Why targets differ from observed averages — the overshoot problem: The simulation targets (900 / 1,500 / 2,400) are guidance given to the simulation agent, not hard caps. LLM-simulated participants consistently overshoot their word count guidance — tell the agent to aim for 900 words and it often produces 1,100–1,200. By setting targets somewhat below where we want the output to land, the overshoot brings the final word count back into the real-data range. If a colleague asks why the targets don't exactly match the observed averages, the answer is: they are deliberately set conservatively to compensate for the LLM tendency to be more verbose than instructed.

Simulated participants are assigned realistic demographic variation from the roster parameters above. The simulation agent answers as a realistic professional in this domain — with the vocabulary, concerns, and communication patterns that role and domain entail.

Output → A6 Data Files - Simulated/
transcripts.json — array of participants with transcript Q&A pairs
0.3

Maze Transcript Processing

Converting real interview exports into pipeline-ready JSON
+

Real interviews conducted via Maze are exported as a CSV where each column is one participant's transcript. The Maze export has a known inconsistency: some participants have their full transcript in a single cell; others have it split across multiple rows.

Script
python maze_to_json.py path/to/maze-export.csv

What the Script Does

  • Reads each column as one participant, concatenates all non-empty cells to handle split-row exports
  • Uses regex to identify turn headers (sequence number, timestamps, speaker label)
  • Groups turns into question-response pairs: each interviewer turn starts a new question
  • Prints participant count, sequence gaps, and warnings for any participants that could not be parsed
Quality check: Review the printed summary before proceeding. Any participant with a parse warning should be spot-checked manually against the raw CSV.
1

Codebook Discovery

The discovery phase converts raw participant responses into a structured codebook — the instrument that defines exactly what themes exist, how they are bounded, and what counts as an inclusion. Every downstream analysis depends on getting this right. Script: run_discovery.py

1.1

Study Context Generation

Phase 0 in pipeline — context document passed to all downstream agents
+

Before any other coding step runs, one Sonnet agent reads a random sample of 50 participant responses and produces a structured four-section study context document.

Agent Configuration
Keycontext_generator
Modelclaude-sonnet-4-20250514
Temperature0 — deterministic; factual characterization, no variation needed
Max tokens8,000
Persona"You are a qualitative research analyst. You have been given a sample of interview responses and your job is to produce a structured context document that will help downstream agents understand who was interviewed, how they communicate, and what subjects they discuss. Be descriptive, not prescriptive. Capture patterns, vocabulary, and professional context. Do not draw analytical conclusions — that is for the coding agents."
TaskReads 50 randomly sampled responses. Produces a 4-section JSON: who was interviewed, what they were asked, how they communicate, and dominant topics. Explicitly instructed not to pre-determine codes or draw conclusions.

Four Sections of the Context Document

  • Who was interviewed — roles, seniority, company sizes, industries, range of variation
  • What they were asked — plain-language summary of each question's intent, inferred from actual responses
  • How they communicate — vocabulary, technical vs. non-technical language, response length tendencies, jargon
  • Dominant topics and patterns — themes appearing most frequently across the 50 sampled responses
Why this exists: Agents making segmentation and grouping decisions perform better when they understand the professional vocabulary of the people they're analyzing. Without context, the extraction agent treats all language generically. With it, it recognizes that "running payroll in-house" and "managing payroll processing internally" are the same concept.

What it is not: Descriptive, not prescriptive. It tells downstream agents what the study population is like — not which codes to create. Analytical conclusions come from the data.

Output → A6 Data Files - Simulated/
study-context.json
1.2

Question Classification

Phase 1 — 2-part process: type determination from a tiny sample, then structural definition from the full dataset
+

Classification runs in two sequential sub-steps. Part 1a determines question type cheaply. Part 1b defines the structural codebook for non-thematic questions with full data coverage.

Part 1a — Type Determination (5-participant sample per question)

Agent Configuration — Part 1a
Keyextractor_1
Modelclaude-sonnet-4-20250514
Temperature0.2 — slight flexibility to weigh ambiguous questions
TaskReceives the question text and 5 randomly sampled responses. Classifies the question as thematic, categorical, rank-order, or binary. Returns only the coding type and brief reasoning — no structural details.
PromptYou are an expert qualitative methodologist.

TASK: Determine the coding type for this interview question. Read the question text and the sample responses, then classify the question.

QUESTION: "{question_text}"

SAMPLE RESPONSES (small sample — for type determination only):
  1. "{response}"
  ...(up to 5 randomly sampled)

CODING TYPES:
1. "thematic" — open-ended question producing complex, multi-faceted responses. Responses contain multiple ideas, experiences, or judgments that need to be broken into meaning units and grouped into themes. Most open-ended "why" and "how" questions fall here.
2. "categorical" — question where responses naturally cluster into a small set of distinct categories (3–7). Typically satisfaction, sentiment, or preference questions where each response maps to one category.
3. "rank_order" — question where responses contain a numeric value or quantity that maps to ordinal buckets. Company size, years of experience, frequency counts.
4. "binary" — question where responses indicate yes or no, presence or absence. "Did you..." or "Have you..." questions.

INSTRUCTIONS: (1) Read the question text and all sample responses. (2) Determine which coding type best fits. (3) Provide brief reasoning. (4) Do NOT define categories, buckets, or binary criteria — that comes in a separate step with full data.

Output JSON: {"coding_type": "thematic|categorical|rank_order|binary", "reasoning": "Why this coding type fits"}
Flow: How Part 1a Handles Information
Input
One question at a time, with 5 randomly sampled participant responses to that question.
ExampleQ7: "What frustrates you about your current HR system?" P5: "Reporting is terrible..." P12: "Interface looks like 2005" P23: "Re-entering data 3 times..." P41: "Useless for analytics" P88: "Can't customize permissions"
What the Agent Does
Reads the question and the 5 sample responses. Asks: which of the four coding types fits? Decides between thematic, categorical, rank-order, or binary. Writes brief reasoning. Does NOT define structure yet — that is Part 1b's job.
Reasoning"Open-ended frustrations question. Responses contain multiple distinct ideas per person (reporting, UI, data entry, analytics, permissions). Not enumerable into 3-7 categories. → thematic"
Output → Part 1b or Step 1.3
A single classification record per question. Thematic questions skip Part 1b and go straight to extraction (Step 1.3). Non-thematic questions go to Part 1b for structural definition.
Passes On{ "question_id": "Q7", "coding_type": "thematic", "reasoning": "Open-ended; multiple ideas per response; not enumerable" }
Why only 5 responses for type determination: Question type is largely apparent from the question text itself; the responses confirm it. Binary and rank-order questions are identifiable from even one or two answers. Categorical vs. thematic distinction needs a few more — 5 is sufficient. Keeping Part 1a cheap lets it run across all questions in parallel, preserving the token budget for Part 1b.

Part 1b — Structural Definition (all participants, non-thematic questions only)

Agent Configuration — Part 1b
Keyextractor_1
Modelclaude-sonnet-4-20250514
Temperature0.2
TaskReceives the question text, the coding type from Part 1a, and ALL participant responses to that question. For categorical: proposes 3-7 mutually exclusive category names and definitions. For rank-order: proposes bucket boundaries from the full observed range. For binary: proposes positive/negative labels, definition, and inclusion/exclusion criteria.
PromptYou are an expert qualitative methodologist.

TASK: This question has been classified as "{coding_type}". {Type-specific instruction: For categorical — examine all responses and identify 3–7 mutually exclusive categories that capture the full range of actual answers. For rank_order — examine all responses and propose sensible ordinal buckets that span the full observed range. For binary — examine all responses and define what counts as the positive case with clear inclusion and exclusion criteria.}

You are reading ALL {n} participant responses to ensure the structural definition covers the full range of actual answers in the dataset.

QUESTION: "{question_text}"

ALL RESPONSES ({n} total):
  1. "{response}"
  ...

Output JSON (type-specific format):
Categorical: {"category_details": {"categories": ["Cat 1", "Cat 2", ...], "category_definitions": {"Cat 1": "Definition...", ...}}}
Rank-order: {"rank_order_details": {"buckets": [{"label": "1–100", "range_low": "1", "range_high": "100"}, ...]}}
Binary: {"binary_details": {"positive_label": "Yes", "negative_label": "No", "definition": "...", "inclusion_criteria": "...", "exclusion_criteria": "..."}}
Flow: How Part 1b Handles Information
Input
One non-thematic question + the coding type from Part 1a + ALL ~180 participant responses to that question.
ExampleQ3 [categorical]: "How many employees does your company work with?" P1: "About 800 people" P2: "Around 250" P3: "Just under 5,000" P4: "We're a 40-person team" P5: "12,000 globally" ... 175 more responses
What the Agent Does
Reads the full range of actual answers. For categorical: identifies 3-7 mutually exclusive categories with definitions. For rank-order: proposes ordinal buckets spanning the full observed range. For binary: defines positive/negative labels with inclusion/exclusion criteria. Structure is data-driven, not predefined.
Reasoning"Responses span 12 → 12,000 employees. Natural B2B segmentation breaks fall at SMB (1-250), Mid-Market (251-1,000), Enterprise (1,001+). All 180 responses fit cleanly into these three buckets."
Output → Final Codebook
A complete, authoritative structural definition for that question. Goes directly into the final codebook — not refined later by the dual architects (Steps 1.5-1.6), which only touch thematic questions.
Passes On{ "question_id": "Q3", "coding_type": "categorical", "categories": [ {"label": "SMB", "definition": "1-250"}, {"label": "Mid-Market", "definition": "251-1,000"}, {"label": "Enterprise", "definition": "1,001+"} ] }
Why all participants for structural definition: Category names, bucket boundaries, and binary criteria must reflect the full range of answers actually observed. A category appearing in 5% of a 180-person study (9 people) might appear only once in a 30-person sample. Reading all participants guarantees complete coverage. The token cost is low because non-thematic responses are short (a few words to a sentence) and Part 1b runs only for the handful of non-thematic questions. For a study with 5 non-thematic questions and 180 participants, Part 1b reads approximately 45,000 tokens total — less than reading 30 participants across all questions.
Thematic — open-ended, multi-code Categorical — Part 1b proposes categories from full dataset Rank-Order — Part 1b proposes buckets from full value range Binary — Part 1b proposes labels and criteria from full dataset
The structural definition from Part 1b is authoritative. For any question classified as categorical, rank-order, or binary, the structure defined in Part 1b becomes the final codebook structure for that question. It is not refined by the dual codebook construction agents (Steps 1.5–1.6), which only operate on thematic questions. Review the non-thematic entries in questions-registry.csv during the Step 1.7 human review.

Why One Agent (Not Two) for Classification

Classification is a structured decision with a finite, well-defined outcome space — the four types are exhaustive and mutually exclusive. It is not an interpretive judgment the way codebook construction is. Two agents classifying the same question would almost always agree; the rare disagreement would be on edge cases better resolved by reading more responses, not by running a second agent. The complexity and cost of dual-agent classification with reconciliation is not justified by the improvement in output quality.

Output → A6 Data Files - Simulated/
questions-registry.csv — question_id, question_text, coding_type, n_codes_or_categories
1.3

Meaning Unit Extraction

Phase 2 — one agent reads study context, then breaks every response into the smallest distinct ideas
+

One Sonnet agent processes all participant responses to thematic questions. Before reading any responses, it receives the study context document from Step 1.1 — who was interviewed, how they communicate, and what topics they discuss. This primes the agent with the professional vocabulary and communication style of the participants so it can make better interpretation decisions.

Agent Configuration
Keyextractor_1
Modelclaude-sonnet-4-20250514
Temperature0.2 — slight openness to capture borderline meaning units a temperature-0 agent would skip
Batch size20 responses per API call, 18 parallel workers
Persona"You are an expert qualitative researcher performing inductive coding. You are thorough and nuanced. Capture both explicit statements and clearly implied meaning. Look for subtle differences in how participants express similar ideas. It is better to extract a borderline meaning unit than to miss one."
Prompt{persona}

TASK: Read these interview responses to the question below and break each one into discrete meaning units. A meaning unit is the smallest segment of text that expresses a single idea, experience, or judgment (Graneheim & Lundman, 2004).

For each meaning unit, write a short descriptive code (3-8 words) that captures the specific meaning. Stay close to the participant's language. This is first-cycle descriptive coding (Saldana, 2016).

QUESTION: "{question_text}"

RESPONSES:
Participant {id}: "{response}"
...

RULES:
- One response can contain multiple meaning units
- Descriptive codes should be specific, not generic ("had to re-enter data three times" not "data issues")
- Preserve the participant's meaning. Do not interpret or abstract yet.
- If empty or off-topic: mark as "no_codable_content"
- Include the exact quote for each meaning unit

GUARD RULES:
- WORD BOUNDARY: Only code complete, standalone words. Never extract a word found inside a longer word (e.g., "equity" is not evidence that the participant said "quit").
- NEGATION: Pay attention to negation (not, never, didn't, wasn't). "Not satisfied" means dissatisfied. Identify the full negated phrase before coding.
- SARCASM: Watch for ironic statements where context indicates the speaker means the opposite of their literal words. Code the intended meaning, not the literal words.
- HEDGING: Distinguish definitive statements from hedged ones. "Kind of," "sort of," "I guess," "maybe" weaken meaning — code the actual strength of the statement.
- ABSENCE: If a participant does not mention a topic, do NOT code that as absence. Only code what is stated or clearly implied. Silence is not data.
- CONTEXT: Read the full response before coding any part. A phrase can change meaning based on what surrounds it.

Output JSON: {"meaning_units": [{"participant_id": 1, "text": "exact quote", "descriptive_label": "short label"}, ...]}
Flow: How the Extractor Handles Information
Input
One thematic question + a batch of 20 participant responses to that question. The agent has already been primed with the study context document from Step 1.1.
ExampleQ7: "What frustrates you about your current HR system?" P5: "The reporting is terrible — I can't pull anything on demand. And we have to re-enter the same data three times every onboarding. Honestly the only thing it does well is payroll runs."
What the Agent Does
Reads each response in full, then breaks it into the smallest distinct ideas. One sentence can yield multiple meaning units; one idea can span multiple sentences. Applies guard rules (negation, sarcasm, hedging, word boundary). Writes a 3-8 word descriptive code for each unit, staying close to the participant's language.
SegmentationP5's response → 3 distinct ideas: (1) Reporting can't be pulled on demand (2) Onboarding requires triple data entry (3) Payroll runs are the only strength
Output → Step 1.4
A flat list of meaning unit records. Each record contains the participant ID, the exact verbatim quote, and a short descriptive label. ~340 meaning units per question for a 180-person study.
Passes On[ { "participant_id": 5, "text": "reporting is terrible — I can't pull anything", "descriptive_label": "Cannot pull reports on demand" }, { "participant_id": 5, "text": "re-enter the same data three times every onboarding", "descriptive_label": "Onboarding triple data entry" }, { "participant_id": 5, "text": "the only thing it does well is payroll runs", "descriptive_label": "Payroll is sole strength" } ]

Methodological Grounding for the Guard Rules

Dunivin (2024), Scalable qualitative coding with LLMs, establishes the central principle: LLMs require more precise codebook descriptions than human coders do because they lack the contextual understanding human coders develop through training and discussion. Every guard rule below is a precision instruction the model would otherwise miss.

  • Meaning unit definition (cited inline in prompt): Graneheim & Lundman (2004) — the smallest segment expressing a single idea
  • Descriptive label definition (cited inline in prompt): Saldana (2016), first-cycle descriptive coding — labels close to the participant's language
  • Negation, absence, context rules: Hsieh & Shannon (2005), Three approaches to qualitative content analysis — addresses interpretation decisions including the risk of misreading negated statements and the principle that absence of a statement is not coded data
  • Sarcasm and hedging rules: De Paoli (2024), Performing an inductive thematic analysis of semi-structured interviews with a large language model, identifies sarcasm, irony, and hedged language as the dominant failure modes for LLM-based qualitative analysis. Explicit prompt instructions to watch for these are the documented mitigation
  • Word boundary rule: LLM-specific. Derived from observed extraction failures where the model identifies a concept inside a longer word as a keyword match. Khalid & Witmer (2025), Best practices for prompt engineering in qualitative analysis with LLMs, recommends explicit boundary instructions for token-based models
  • Human oversight at Step 1.7: Bhaduri et al. (2024), LLMs as qualitative research assistants, argues that LLM-generated codes require structured human review before downstream use. Our Step 1.7 codebook review is this checkpoint

Reliability target grounding. Dunivin (2024) reports that GPT-4 with chain-of-thought prompting achieved Cohen's κ ≥ 0.79 (excellent agreement) on 3 of 9 codes and κ ≥ 0.6 (substantial) on 8 of 9 codes against human coders. Our HR Leaders BambooHR study achieved an overall weighted κ of 0.909 — above the strongest results in the published literature. This is why we are confident the single-extractor design is sufficient.

Chain-of-thought grounding. Dunivin (2024) finds that requiring the model to reason about each code before assigning it improves coding fidelity. This is the basis for the landscape analysis requirement in Steps 1.4 and 1.5 — clustering and codebook construction agents must write what they observe before proposing structure.

Why One Agent (Not Two)

The HR Leaders BambooHR study produced an overall weighted Kappa of 0.909. The four codes that fell below threshold were definition problems, not extraction failures — a second extractor would not have fixed them. Dual extraction was adding methodological complexity without improving downstream reliability. The complexity budget is better spent at the codebook construction step, where boundary-drawing actually matters.

Why exact quotes are preserved: Codebook agents downstream need the actual language participants used to write grounded examples and counter-examples. Abstract descriptive labels without verbatim quotes produce ambiguous codebook entries. Reports also need verbatim quotes for client deliverables.
Output → A6 Data Files - Simulated/
meaning-units-log.csv — meaning_unit_id, participant_id, question_id, exact_quote, descriptive_label
1.4

Per-Question Clustering

Phase 3 — organizing meaning units before global codebook construction
+
Agent Configuration
Keyclusterer
Modelclaude-opus-4-6
Temperature0 — deterministic; clustering decisions must be consistent across questions
Max tokens32,000
Persona"You are a qualitative researcher performing preliminary grouping of descriptive codes. Your job is to compress meaning units into rough semantic clusters that will be passed to a global synthesis agent. When in doubt about whether two codes belong together, keep them separate. Under-clustering is recoverable downstream; over-merging is not — once two distinct concepts are collapsed into one cluster, the distinction is lost permanently. Err strongly on the side of finer-grained clusters."
Prompt{persona}

{study_context_block — who was interviewed, how they communicate, dominant topics from Step 1.1}

TASK: You are performing preliminary clustering of meaning unit descriptions from a qualitative interview. This is an intermediate compression step — your clusters will be passed to two parallel codebook architect agents who will build draft codebooks. The richer your output, the better those downstream agents can do their job.

NOTE ON TERMINOLOGY: At this stage we are working with "meaning unit descriptions" (the short descriptive labels the prior extraction agent assigned to each meaning unit). The word "code" is reserved for the final codebook entries that downstream agents will define. You are not creating codes here — you are grouping descriptions into clusters that will inform code creation later.

Your goal is to group semantically related meaning unit descriptions into rough clusters. This does NOT need to be perfect. The downstream architects will refine, split, and merge your clusters. Be inclusive: it is better to put a borderline description in a cluster than to leave it unclustered.

QUESTION: "{question_text}"

MEANING UNIT DESCRIPTIONS (participant_id + descriptive label):
{stripped_units}

INSTRUCTIONS:
1. Read through all meaning unit descriptions and identify recurring conceptual patterns.
2. Group related descriptions into clusters. Each cluster should represent one coherent idea.
3. Give each cluster a short, descriptive label (4-8 words).
4. For each cluster, include AT LEAST 5 representative meaning unit descriptions that characterize the cluster. More is better — the downstream architects rely on these examples to understand what the cluster contains. If a cluster genuinely has fewer than 5, include all of them.
5. For each cluster, list AT LEAST 5 participant IDs whose meaning units best exemplify it. These will be used to pull representative quotes for the downstream architects. Again, more is better.
6. Estimate how many unique participants contributed to each cluster.
7. It is fine to have 15-40 clusters. Do not force artificial merging.
8. Meaning unit descriptions that don't fit any cluster: list only up to 20 uncategorized examples.

Output JSON: {"clusters": [{"cluster_label", "representative_meaning_units", "representative_participant_ids", "approximate_participant_count"}], "uncategorized_meaning_units": [...]}

Clustering per question first reduces cognitive load on the global construction agents. Instead of receiving thousands of raw descriptive labels in one block, they receive pre-organized clusters per question — making the landscape analysis step more tractable.

Receives the study context from Step 1.1. Before reading any meaning unit descriptions, the clusterer is primed with the same who/how/what context document the extractor sees. This helps it recognize when descriptions that look superficially different are referring to the same underlying concept in the participants' shared vocabulary.

Terminology note: At this stage we deliberately use the phrase "meaning unit description" rather than "code." The word "code" is reserved for the final codebook entries produced in Steps 1.5 and 1.6. The clusterer is grouping descriptive labels — it is not creating codes.

Why minimums of 5: The clusterer must include at least 5 representative meaning unit descriptions and at least 5 participant IDs (with quotes attached downstream) per cluster. The richer the cluster summary, the better the downstream architects can decide what to merge, split, and define. Token budget is comfortable — bumping these from earlier ranges adds roughly 8-10K tokens to a 25K prompt, well inside Sonnet's context window.

Flow: How the Clusterer Handles Information
Input
The study context document from Step 1.1 (who was interviewed, how they communicate, dominant topics) plus all meaning unit descriptions for ONE question, stripped to just {participant_id, descriptive_label} — verbatim quotes are NOT sent to keep input tokens manageable. Capped at 500 units (evenly sampled if more).
Example (Q7, ~340 units)[ {P5: "Cannot pull reports on demand"}, {P5: "Onboarding triple data entry"}, {P12: "Interface looks like 2005"}, {P23: "Re-entering same data 3x"}, {P41: "Useless for analytics"}, {P67: "No real-time dashboards"}, {P88: "No per-role permissions"}, {P102:"Can't extract custom data"}, ... 332 more ]
What the Agent Does
Reads all meaning unit descriptions, identifies recurring conceptual patterns, and groups semantically related descriptions into rough clusters (~15-40 per question). Names each cluster, picks at least 5 representative meaning unit descriptions, lists at least 5 representative participant IDs, and estimates how many participants contributed. Errs toward finer-grained clusters — over-merging here is irreversible.
Grouping"Cannot pull reports", "Useless for analytics", "No real-time dashboards", "Can't extract custom data" → all about REPORTING "Onboarding triple data entry", "Re-entering same data 3x" → DUPLICATE DATA ENTRY
Output → Step 1.5
~20 cluster objects per question. Each has a label, AT LEAST 5 representative meaning unit descriptions, AT LEAST 5 representative participant IDs, and an approximate participant count. The pipeline then bolts on 5 verbatim quotes per cluster (looked up from the originals) before passing to Step 1.5.
Passes On[ { "cluster_label": "Reporting & analytics gaps", "representative_meaning_units": [ "Cannot pull reports on demand", "Useless for analytics", "No real-time dashboards", "Can't extract custom data", "No exportable raw data" ], "representative_participant_ids": [5, 41, 67, 102, 134], "approximate_participant_count": 28 }, { "cluster_label": "Duplicate data entry burden", "representative_meaning_units": [ "Onboarding triple data entry", "Re-entering same data 3x", "Sync gaps force manual re-key", "Same field across 3 modules", "No single source of truth" ], "representative_participant_ids": [5, 23, 89, 145, 167], "approximate_participant_count": 19 }, ... ~18 more clusters ]
Why Opus here: Over-merging at the cluster stage causes irreversible information loss — the global synthesis agents can only see what the clusterer passes. If two distinct concepts are collapsed here, they cannot be recovered. Opus errs toward finer-grained clusters. Under-clustering is recoverable; over-merging is not.
1.5

Dual Codebook Construction

Phases 4a and 4b — two agents with opposing biases, run in parallel
+
Agent A — Parsimony Architect
Keyglobal_synthesizer_parsimony
Modelclaude-sonnet-4-20250514
Temperature0.2 — low; consistent, predictable merging decisions
Max tokens32,000
Persona"You are a senior qualitative methodologist building a thematic codebook. You lean toward parsimony: merge codes that are conceptually adjacent. A smaller codebook with clear, well-defined codes is better than a large codebook with overlapping or hard-to-distinguish codes. When in doubt, merge. Each code must be distinct enough that a coder could reliably tell it apart from every other code without needing to ask for clarification."
Agent B — Distinction Architect
Keyglobal_synthesizer_distinction
Modelclaude-sonnet-4-20250514
Temperature0.7 — higher; explores more separation options, more variation
Max tokens32,000
Persona"You are a senior qualitative methodologist building a thematic codebook. You lean toward distinction: if two codes capture meaningfully different ideas, keep them separate even if they are related. A richer codebook with more granular codes is better than a collapsed codebook that loses nuance. When in doubt, keep separate. Define each code precisely enough that the boundary between adjacent codes is clear."
Shared Prompt (both architects receive the same input)
ExecutionBoth agents run simultaneously via ThreadPoolExecutor with separate API clients
Prompt{persona}

{study_context_block — who was interviewed, how they communicate, dominant topics from Step 1.1}

TASK: Build a canonical global codebook for this interview study.

You are seeing preliminary clusters from {N} thematic questions across {total_participants} participants. Your job is to synthesize these clusters into a definitive, study-wide list of themes.

CRITICAL: The theme names and definitions you write here will be used consistently across ALL questions in this study. A participant mentioning "ease of use" in question 5 and another mentioning it in question 12 must both receive the same theme code. Name themes for what they ARE, not for which question they came from.

CLUSTERS BY QUESTION:
{questions_text — each cluster includes label, sample codes, ~participant count, and 3 representative quotes}

INSTRUCTIONS:

Step 1 — Survey the full landscape: Read through all clusters from all questions. Note which conceptual patterns appear across multiple questions and which are unique to one question.

Step 2 — Define themes: Group clusters that represent the same underlying concept, even if they appear in different questions or are framed differently (positively in one question, negatively in another). Name the theme for the concept, not the framing.

Step 3 — Write codebook entries: For each theme, write:
- A precise one-sentence definition that applies regardless of which question raised it
- Inclusion criteria: when to assign this code
- Exclusion criteria: when NOT to assign, and which theme to assign instead
- 2-3 example quotes from the clusters above
- 1-2 negative examples (quotes that seem related but belong elsewhere)

Step 4 — Coverage: Every cluster should map to at least one theme. If clusters don't fit, either create a new theme or mark them as "Other."

Step 5 — Quality checks:
- Each theme covers one coherent concept (not a bundle of 2-3 ideas)
- Theme boundaries are clear enough that two coders would agree on the same responses
- Theme names are specific (not "General Issues" or "Other Concerns")

Output JSON: {"themes": [{"code_name", "definition", "inclusion_criteria", "exclusion_criteria", "source_questions", "examples", "negative_examples"}], "other_codes": [...], "synthesis_notes": "..."}
Flow: How One Codebook Architect Handles Information
Input
The study context document from Step 1.1 (who was interviewed, how they communicate, dominant topics) plus ALL clusters from ALL ~13 thematic questions in the study, simultaneously. Each cluster includes its label, up to 5 example codes, ~3 representative verbatim quotes, and approximate participant count. Both architects (parsimony + distinction) receive the identical input.
Example (~400 clusters across 13 Qs)Q5 — "What triggers a search for a new HR system?": Cluster: "Reporting workflow gaps" (~22) quotes: P14, P67, P122 Cluster: "Compliance audit failures" (~18) ... Q7 — "What frustrates you about your current HR system?": Cluster: "Reporting & analytics gaps" (~28) quotes: P5, P41, P102 Cluster: "Duplicate data entry" (~19) ... Q11 — "What capabilities are missing?": Cluster: "Cannot extract data" (~24) quotes: P34, P88, P156 ... [~10 more questions, ~400 clusters total]
What the Agent Does
Cross-question synthesis. Looks for the same underlying concept appearing in clusters across multiple questions and names it ONE canonical theme that applies everywhere. Writes a precise definition, inclusion criteria, exclusion criteria, 2-3 example quotes, and 1-2 counter-examples. The parsimony agent merges aggressively (temp 0.2); the distinction agent keeps adjacent ideas separate (temp 0.7).
Cross-Question SynthesisQ5 "Reporting workflow gaps" Q7 "Reporting & analytics gaps" Q11 "Cannot extract data" Q15 "Manual report compilation" ↓ (same concept, different framings) ONE global theme: "Reporting & Analytics Capability Gap"
Output → Step 1.6
A complete draft codebook with ~20-30 global themes, each fully defined. Each theme is named for the concept (not the question), so the same theme code can be assigned to responses from any thematic question in the study.
Passes On{ "code_name": "Reporting & Analytics Capability Gap", "definition": "Participant cannot get the data, reports, or analytics they need from their current system in a timely or self-serve way", "inclusion_criteria": "Mentions inability to pull reports, missing analytics, manual report compilation, or dependence on IT for routine data extraction", "exclusion_criteria": "If issue is purely about data entry or duplication, code as 'Data Entry Burden' instead", "source_questions": ["Q5","Q7","Q11","Q15"], "examples": [{ "text": "I can't pull anything on demand", "participant_id": 5, "question_id": "Q7" }] } ... ~25 more themes

Both architects receive the study context from Step 1.1. The same who/how/what document is injected at the top of both prompts. This ensures the divergence between the two drafts comes from the parsimony vs. distinction bias, not from differing interpretations of the underlying data — both agents start from the same understanding of who was interviewed and how they communicate, then make different boundary calls from there.

Why the different temperatures produce useful divergence: The temperature gap (0.2 vs. 0.7) combined with opposing personas reliably produces two drafts with different boundary decisions on ambiguous cases. The places where they disagree are exactly where the codebook instrument is weakest — and where human attention in the review gate is most needed.
Why Sonnet (not Opus) for construction: Construction runs twice in parallel with 32,000-token outputs — Opus cost would be prohibitive for routine studies. Sonnet with chain-of-thought prompting produces adequate construction quality. Opus is reserved for reconciliation, where a single agent makes the final boundary calls.
1.6

Codebook Reconciliation

Phase 4c — single Opus agent makes final boundary calls
+
Agent Configuration
Keycodebook_reconciler
Modelclaude-opus-4-6
Temperature0 — boundary decisions must be deterministic and internally consistent; no creative variation
Max tokens32,000
Persona"You are a senior qualitative methodologist reconciling two independently built codebooks. Your job is to produce a single, high-quality global codebook by working through each category systematically: full agreements (keep the better definition), conceptual agreements with different names (choose the more precise name), one-agent-only codes (apply the 10% participant threshold), and boundary disagreements (tighten the boundary with sharper inclusion and exclusion criteria). Record your reasoning at every step. The reconciled codebook must be more rigorous than either input."
Prompt{persona}

{study_context_block — who was interviewed, how they communicate, dominant topics from Step 1.1}

TASK: Reconcile two independently built global codebooks into one high-quality final codebook.

STUDY PARAMETERS:
Total participants: {total_participants}
Minimum participant threshold: ~{min_participant_threshold} (codes proposed by only one agent and below this threshold should be excluded)

CODEBOOK A — Parsimony Agent (biased toward merging):
{a_text — JSON of all parsimony themes}

CODEBOOK B — Distinction Agent (biased toward separation):
{b_text — JSON of all distinction themes}

RECONCILIATION PROTOCOL — work through these four categories in order:

1. FULL AGREEMENT — Both agents proposed the same code with the same concept. High confidence. Action: Take the better-written definition. Combine examples from both. Record as reconciliation_action = "agreement"

2. CONCEPTUAL AGREEMENT, DIFFERENT NAMES — Same meaning units grouped together but named differently. Action: Choose the more precise, specific name. Write a merged definition. Record as reconciliation_action = "renamed"

3. ONE-AGENT-ONLY CODES — One agent proposed a code the other did not. Action: If the territory could plausibly meet the {min_participant_threshold}-participant threshold, include it with a human review flag. If it clearly falls below threshold or overlaps an existing code, exclude it. Record as "included_one_agent" or "excluded_one_agent"

4. BOUNDARY DISAGREEMENTS — Both agents proposed codes covering similar territory but drew the line differently. Most important category. Write out where each agent drew the boundary. Propose a single tightened boundary with sharper inclusion and exclusion criteria. Record as "boundary_resolved"

BEFORE your output, write a brief landscape analysis (3-5 sentences) describing what you see across both codebooks: where they agree, where they diverge most, and the main reconciliation challenges. This is your chain-of-thought reasoning.

Output JSON: {"landscape_analysis", "reconciled_themes": [{code_name, definition, inclusion_criteria, exclusion_criteria, examples, negative_examples, human_review_flag, human_review_note}], "reconciliation_log": [{reconciliation_action, final_code_name, parsimony_name, distinction_name, note}], "reconciliation_summary": {agreements, renames, one_agent_included, one_agent_excluded, boundary_resolutions, total_final_codes, human_review_flags}}

Receives the study context from Step 1.1. The reconciler is the highest-stakes interpretive step in the discovery pipeline — every boundary call here propagates into every downstream code assignment. Priming it with the same who/how/what context document used by the extractor, clusterer, and architects lets it weigh boundary disagreements with awareness of the actual vocabulary participants used and the dominant topics in the dataset, sharpening inclusion and exclusion criteria.

Four Reconciliation Categories

  • Full agreement — same code, similar definitions. Take the better-written definition, combine examples from both.
  • Conceptual agreement, different names — same meaning units, different code names. Pick the more precise name, record why.
  • One agent only — one agent proposed a code the other did not. Apply the ~10% participant threshold. If it meets the threshold, include with a human review flag.
  • Boundary disagreement — both agents drew the line differently. Reconciler proposes a single tightened boundary with sharper inclusion/exclusion criteria. Highest-priority items for human review.
Why Opus: Reconciliation requires holding two complete codebooks simultaneously, tracking reasoning for each boundary call, and maintaining consistency across dozens of decisions in one pass. Opus scores meaningfully higher on extended reasoning benchmarks (GPQA Diamond) — that advantage is exactly what this step needs.
Output → A6 Data Files - Simulated/
codebook-audit-trail.csv — code evolution: both drafts + reconciliation decisions + chain-of-thought reasoning
1.7

Human Review Gate

Two automated agents run first, then the pipeline pauses for human approval
+

Before the pipeline pauses for human review, two more agents run automatically after reconciliation:

Per-Question Validator (Phase 5)
Keyvalidator
Modelclaude-sonnet-4-20250514
Temperature0
Max tokens16,000
Persona"You are a qualitative methodologist applying a global codebook to a specific interview question. Your job is to identify which global themes are relevant to this question's responses, add question-specific coding notes where needed, and flag any patterns the global codebook does not cover. Be precise: only mark a theme as applicable if it genuinely appears in the sample responses."
Prompt{persona}

TASK: Apply the global codebook to this specific interview question.

The global codebook was built from ALL thematic questions in this study. Your job is to:
1. Identify which global themes actually appear in responses to THIS question
2. Add question-specific coding notes where the global definition needs clarification
3. Flag any patterns in this question's responses that the global codebook doesn't cover

QUESTION: "{question_text}"
TOTAL PARTICIPANTS: {total_participants}

GLOBAL CODEBOOK (study-wide canonical themes):
{themes_text — name, definition, include/exclude for each theme}

SAMPLE RESPONSES TO THIS QUESTION:
{sample_text — 15 random participant responses}

INSTRUCTIONS:

1. APPLICABLE THEMES: Code each sample response against the global codebook. List every theme that appears in at least one sample response.

2. QUESTION-SPECIFIC NOTES: For applicable themes, add a note if the global definition needs clarification for this question's framing. Example: the global theme "Ease of Use" has a positive definition, but in a frustrations question, it will always appear as an absence — note this for coders.

3. COVERAGE GAPS: Are there patterns in these sample responses that no global theme captures? If so, describe them.

4. QUESTION-SPECIFIC ADDITIONS: For any substantial uncovered patterns, propose a new theme entry to be added to this question's codebook only. Only create additions for genuine patterns, not single outliers.

Output JSON: {"applicable_themes": [...], "coverage_gaps": [...], "question_specific_additions": [...], "validation_notes": "..."}
Dimension Architect (Phase 6)
Keydimension_architect
Modelclaude-opus-4-6
Temperature0
Max tokens16,000
Persona"You are a senior market research methodologist designing the segmentation variable structure for a B2B interview study. You understand the distinction between defining variables (go into cluster analysis), outcome variables (validate clusters, never enter clustering), and profiling variables (describe segments after clustering). You apply the N/10 rule for maximum cluster variables, the variance filter (20-80% prevalence for binary themes), and the temporal contamination principle (Module B data is downstream of the tool choice and must not be used as defining variables). You group themes into composite dimensions using conceptual grouping, not statistical methods."
Prompt{persona}

TASK: Design the segmentation dimension structure for this interview study.

You have a completed codebook with {N_themes} global themes and {N_questions} coded questions. Your job is to produce a `dimensions` section that classifies every variable into one of three roles and groups thematic variables into composite dimensions for cluster analysis.

OUTCOME VARIABLE (from screener — never enters clustering):
{outcome_text}

QUESTION CONTEXT (researcher-provided temporal and purpose hints):
{context_text}

GLOBAL THEMES (from thematic questions):
{themes_text}

NON-THEMATIC QUESTIONS (categorical, rank_order, binary):
{simple_text}

HARD CONSTRAINT: Maximum {max_dimensions} defining dimensions (derived from N/10 rule: sample_size // 10). This is a ceiling, not a target.

THREE ROLES — classify every variable into exactly one:

DEFINING — goes into cluster analysis. Must pass all three tests: (1) Varies across participants (binary: 20-80% prevalence); (2) Variation predicts different purchase behavior; (3) Not redundant with another defining variable.

OUTCOME — never enters clustering. Used after clustering to validate that segments predict something useful. TEMPORAL RULE: Module B data (current-state questions) is downstream of the tool choice and must be treated as profiling, not defining.

PROFILING — never enters clustering. Used after clustering to describe and communicate each segment.

DIMENSION TYPES: composite_binary (multiple binary theme codes, OR logic), ordinal_encoded (single ordered categorical), binary_field (single yes/no), passthrough (as-is).

INSTRUCTIONS:
Step 1 — Identify the outcome variable from study config.
Step 2 — Identify firmographic defining dimensions (company size, seniority, team size, budget authority).
Step 3 — Group thematic themes from Module A (historical evaluation) into composite defining dimensions. Apply variance filter logic.
Step 4 — Classify Module B themes as profiling (temporal contamination).
Step 5 — Check defining count against the {max_dimensions} ceiling; remove weakest differentiators if over.
Step 6 — Assign all remaining themes and fields to profiling dimensions. Nothing may be left unclassified.

Output JSON: {"dimensions": [{name, label, purpose, type, component_themes, source_questions, logic, rationale}], "defining_count", "max_dimensions_applied", "architecture_notes"}
Why Opus for dimension architecture: An error here — classifying a Module B variable as defining, or including a near-zero-variance binary as a defining dimension — will silently distort every downstream segment. The error cannot be recovered by the segmentation pipeline. Opus is used for irreversible, high-stakes structural decisions.

Human Review Priorities

  • Boundary-disagreement cases flagged by the reconciliation agent — where the instrument is weakest
  • Codes proposed by only one agent — assess whether they are genuinely distinct
  • Non-thematic question structures (categories, buckets, binary criteria) from Step 1.2 — the classification agent's proposals are authoritative but should be verified
  • Dimension classifications (defining vs. outcome vs. profiling) — any Module B variable classified as defining is a red flag
The most important check: Read each code definition and test it against 3-5 actual participant responses. If you cannot confidently predict whether a response would receive the code, the definition needs sharpening. A weak definition produces systematic error across the entire dataset.
Output → A6 Data Files - Simulated/
codebook.json — final approved codebook with dimensions section
2

Codebook Application

The application phase applies the finalized codebook to all participant transcripts using two independent coding agents, calculates inter-rater reliability, and resolves disagreements. Script: run_coding.py

2.1

Dual-Agent Coding

Two independent agents apply the codebook to every participant
+

Two independent agents each receive the finalized codebook and all participant responses. Each agent independently processes every participant's responses to every thematic question and produces a participant-level output: for each participant, which codes apply.

Unit of analysis: Agents use meaning unit segmentation internally as a thinking tool, but they report at the participant level: which codes apply to this participant's responses to this question. This is the correct unit of analysis for both Kappa calculation and downstream frequency analysis.

Why Dual Agents

Independent application by two agents replicates the intercoder reliability design from qualitative research (O'Connor & Joffe, 2020). Disagreements between the two agents flag cases where the codebook definition is ambiguous enough to produce different readings — exactly the cases that need a resolver and may warrant codebook refinement.

Concurrency: Both coding agents run in parallel using a semaphore-controlled thread pool (18 concurrent workers). The API output token rate limit is the binding constraint, not compute.

Output → A6 Data Files - Simulated/
agent-1-codes.json, agent-2-codes.json
2.2

Kappa Quality Gate

Inter-rater reliability calculated at the participant × code level
+

After both agents have coded all participants, Cohen's Kappa is calculated per code and as an overall weighted average at the participant × code level.

Why participant × code, not meaning unit × code: Our downstream analysis asks "what percentage of participants expressed theme X" — a participant-level question. Kappa at the meaning unit level would be influenced by segmentation variability between agents. Participant-level Kappa measures what actually matters.

0.81+
Almost Perfect
0.61–0.80
Substantial
0.41–0.60
Moderate
Below 0.65
Flag for Review

Source: Landis & Koch, 1977

Benchmark: HR Leaders BambooHR achieved overall weighted Kappa of 0.909. The four codes that fell below threshold were definition problems, not model capability problems — Opus would not have fixed them.
On the independence limitation: Two LLM agents running the same underlying model are not independent in the same way two human coders are. Kappa scores are valid as internal quality metrics but should be noted as LLM-to-LLM agreement if presented to a research-savvy client.
Output → A6 Data Files - Simulated/
reliability.txt, reliability-summary.json, flagged-items.json
2.3

Disagreement Resolution

A third agent resolves every contested coding decision
+

For any participant × code combination where the two agents disagreed, a third resolver agent reviews both agents' reasoning alongside the participant's actual transcript and the codebook definition, and makes a final determination.

The resolver records its reasoning in the output file alongside the final code assignment. This creates an audit trail for every contested coding decision.

Output → A6 Data Files - Simulated/
application-coding-detail.json — full per-agent detail with resolver notes coding-summary.md — run summary with participant counts and quality metrics
3

Dataset Assembly

Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all analysis and reporting downstream. Scripts: build-master-dataset.py and build-frequency-report.py

3.1

Master Participant Dataset

One row per participant — grows as analysis progresses
+

Takes final_codes.json, codebook.json, and roster.json and assembles a flat CSV with one row per participant.

Column Structure (in order)

  • participant_id
  • All roster variables (company size, industry, current tool, seniority, etc.)
  • All non-thematic coded variables as individual columns (categorical, ordinal, binary)
  • For each thematic question × each code: {qid}_{code_name_snake_case} = 0 or 1
  • Factor scores appended after PCA (one column per factor)
  • Segment assignment appended after clustering
Why question × code columns: A participant who mentioned "reporting analytics gaps" in a frustrations question is analytically different from one who mentioned it in an evaluation criteria question. Storing Q5_reporting_analytics_gap and Q8_reporting_analytics_gap as separate columns preserves that distinction. Collapsing to a single column per code loses it permanently.

With 140 codes across 5 thematic questions, this produces approximately 700 binary columns. This is correct — do not collapse them.

Output → A6 Data Files - Simulated/
master-participants.csv — one row per participant, all coded variables + roster
3.2

Frequency Report

First analytical view of the data — used before writing any client section
+

Reads master-participants.csv and codebook.json and produces code frequency tables and cross-tabs by firmographic variable.

What to Look For

  • The 5-8 most prevalent themes overall — these anchor the narrative
  • Themes with the most striking cross-tab differences by firmographic or segment variable — these become the report's key analytical findings
  • Themes that appear primarily in one question context vs. across multiple questions (pervasive vs. situational)
Output → A6 Data Files - Simulated/
frequency-report.html — sortable tables, bar charts, collapsible cross-tab sections frequency-report.csv — machine-readable flat version
4

Segmentation

Defining dimensions are identified and validated, PCA reduces the variable space, and k-means clustering discovers natural participant groups. Scripts: segment-prep.py and run-segmentation.py

4.1

Dimension Classification and Variance Check

Identifying which variables go into clustering
+

The codebook's dimensions section classifies each coded variable. The segment-prep script reads these classifications and applies variance filters to produce a clean input for clustering.

Three Variable Types

  • Defining — go into clustering. Must vary meaningfully across participants, predict different purchase behavior, and not be redundant with another variable already in the set.
  • Outcome — do NOT go into clustering. Current tool adoption, evaluation status, WTP. Used after clustering to validate that segments predict something useful. Putting outcomes into clustering builds the answer into the question.
  • Profiling — do NOT go into clustering. Used to describe and communicate what each segment looks like for client deliverables.
The Module A/B rule: Module A variables (historical evaluation triggers, criteria, rejection reasons) are defining variable candidates — they capture state of mind that preceded the purchase. Module B variables (current satisfaction, frustrations) are profiling by default — they are downstream of the tool choice and contaminated by reverse causality.

Variance Filter

Binary variables outside the 20-80% prevalence range are excluded from clustering. A variable where 95% of participants scored 1 carries almost no discriminating power.

N/10 rule: Maximum defining dimensions = sample_size / 10. With 180 participants, maximum 18 defining dimensions. More than this produces unstable clustering at our sample sizes.

Output → A5 Report Segmentation - Simulated/
segmentation-ready.csv — defining dimensions after variance filter segmentation-validation-report.txt — variance check results
4.2

PCA and K-Means Clustering

Discovering natural participant groups from defining dimensions
+

The run-segmentation script standardizes the defining dimensions, reduces with PCA, and runs k-means clustering with automatic k selection.

Steps

  • Standardization: All defining dimension columns scaled to mean 0, standard deviation 1
  • PCA: Reduces defining dimensions to fewer orthogonal factors capturing the variance structure
  • K selection: Silhouette score computed for k=2 through max_k (default 6). K with the highest silhouette score is selected — measures how similar each participant is to their own cluster vs. other clusters
  • K-means on PCA factor scores using the selected k
  • Segment assignment written back to master-participants.csv

Why PCA before clustering: Clustering directly on many binary dimensions suffers from the curse of dimensionality — distance metrics become less meaningful as dimensions increase. PCA reduces the space to a manageable number of orthogonal factors while preserving most of the variance.

Bootstrap stability (Dolnicar et al., 2018): For production deliverables, draw 200+ bootstrap resamples, re-run clustering, measure Jaccard stability index. A stability index above 0.75 is required before presenting segment solutions. Below 0.6 — consolidate to a simpler k.
Output → A5 + A6
segmentation-assignments.csv — segment assignments with factor scores segmentation-pca-report.txt — PCA statistics, silhouette scores, cluster profiles master-participants.csv — factor scores + segment assignment appended
4.3

Segment Profiling and Validation

Checking that segments are meaningful and actionable
+

After clustering, each segment is cross-tabbed against outcome variables and profiling variables to build segment descriptions and validate the solution.

Kotler's Five Criteria (Kotler & Keller, 2016)

  • Measurable — size and characteristics can be quantified
  • Substantial — large enough to warrant a distinct strategy (minimum 8-10% of sample)
  • Accessible — reachable through distinct channels or sales motions
  • Differentiable — responds differently to the marketing mix
  • Actionable — effective programs can be designed for each segment
Lead with relative risk ratios, not raw percentages. "Segment A is 3.8x more likely to be evaluating alternatives" requires no external market estimate, is robust to oversampling, and translates directly into sales prioritization. Include raw percentages alongside for transparency.

Observable Identifiers (for Sales)

For each segment, identify 2-3 observable proxies a sales rep can assess without a full research interview: company size (LinkedIn), seniority and job title, industry/company type, tech stack (G2, job postings), buying signals (recent funding, tool migration postings).

5

Frequency and Cross-Tab Analysis

Before writing any section of the client report, the frequency report reveals which themes are prevalent, which are rare, and where the most interesting cross-tab differences appear.

5.1

Finding the Key Insights

What to look for in the frequency report before writing the report narrative
+

What to Look For

  • The 5-8 most prevalent themes overall — these anchor the executive summary and key findings
  • Themes with the most striking cross-tab differences by segment or firmographic — these become the report's analytical centerpiece
  • Themes appearing primarily in one question context vs. across multiple questions — pervasive themes vs. situational themes
  • Themes with high frequency in a specific segment but low frequency overall — the segment-specific story
Cross-tab highlighting direction (critical lesson from HR Leaders BambooHR): Highlighting direction must match percentage direction. Row percentages = highlight top 2 values across each row. Column percentages = highlight top 2 values down each column. This error affected all four original cross-tabs and had to be corrected retroactively.

Competitive Vulnerability Analysis

When rejection reason data exists and at least 3 competitors have rejection sample sizes of n ≥ 10, build a competitive vulnerability summary showing each competitor's top 3 rejection reasons with positioning angles for the client's sales team.

Include when rejection reasons show differentiated patterns across competitors and the client's strengths (pricing, implementation speed, ease of use) map to competitors' top rejection reasons.

6

Report Building

The client report is an HTML document built with Astro and deployed to Cloudflare Pages. It is the primary deliverable of the study — interactive, printable, and structured to serve marketing, product, and sales simultaneously.

6.1

10-Section Report Structure

Standard structure developed from HR Leaders BambooHR study
+
  • 1. Executive Summary — research objective, methodology stat cards, key findings (insight boxes), segment profile, strategic implications, priority recommendations table, suggested next steps
  • 2. Team-Specific Recommendations — Marketing/Product/Sales each with strategic layer (3 insight boxes) and tactical layer (numbered table: Category, Recommendation, Source)
  • 3. Participants — current tool landscape (stacked bars by company size), sample profile demographics split by segment (seniority, function, company size, industry as side-by-side bar charts)
  • 4. Current Approach: Satisfaction, Frustrations and Goals — satisfaction by tool (sentiment bars), frustration charts (stacked by intensity: gray/amber/red), good outcomes cross-tab, representative quotes
  • 5. Category Entry Points — evaluation triggers and motivations (bar chart), brands considered by motivation (cross-tab), timeline, tools considered count
  • 6. Purchase Push and Pull — head-to-head win/loss table, incumbent retention table, attraction factors, brand by attraction factor (cross-tab), rejection reasons by brand (bar charts), competitive vulnerability summary table
  • 7. Brand Perceptions — awareness stat cards, sentiment by brand (users vs. non-users), perception themes by brand (side-by-side bar charts)
  • 8. Spend and WTP — spend distribution, WTP distribution (bar charts in two-column layout)
  • 9. Business Challenges — challenge frequency (bar chart), users vs. non-users table, challenges × frustrations cross-tab
  • 10. Segmentation — segment overview table, cross-tabs by frustration/competitive landscape/perceptions/challenges, implications insight boxes
6.2

Chart and Layout Conventions

Consistent visual treatment across all report sections
+

Bar Charts

  • Percentages displayed right-justified inside the bar fill (white text) — no separate label outside the bar
  • Top bar = 100% width; all others scale proportionally to the maximum count

Sort Orders

  • Seniority: Manager, Director, VP (ascending hierarchy)
  • Company size: 100-500, 500-2,000, 2,000-10,000 (ascending)
  • Industry: consistent order across all segments; show 0% bars for categories with no respondents (do not omit)

Stacked Intensity Bars (Frustrations)

  • Gray = low intensity, amber = moderate, red = high
  • Total count displayed to the right of each bar
  • Percentages inside segments use that item's total as denominator (not full sample)
Source notes: Every chart and table cites exact question numbers and full question wording. Insight boxes: at least one per section with a strategic recommendation.
6.3

PDF Print Optimization

Hard-won lessons from HR Leaders BambooHR — apply these from the start
+
Tables: Do NOT use break-inside: avoid on tables. Large tables push entirely to the next page, creating huge white gaps. Let tables break between rows; protect individual rows with break-inside: avoid on tr.

The Heading Chain

Use this CSS pattern to keep headings with their charts:

h2, h3, h4, h3 + p, h4 + p, .sentiment-legend, .stacked-legend { page-break-after: avoid; break-after: avoid; }

This creates a chain: heading stays with description, description stays with legend/chart.

Cross-Tabs on Landscape Pages

  • Wrap each h3 + p + .crosstab in a <div class="crosstab-section">
  • Apply page: landscape to the wrapper, NOT to .crosstab alone (otherwise the title stays on the portrait page)
  • Use table-layout: fixed; width: 100%, font-size 9px data / 8px headers
  • Allow headers to wrap with white-space: normal

Other Critical Settings

  • Page margins: @page { margin: 16mm; }
  • Preserve colors: add !important on all background colors + -webkit-print-color-adjust: exact !important on *
  • Section breaks: details.accordion { page-break-before: always; } with first-of-type excluded
  • Force accordions open in print: display: block !important on details and body
6.4

Deployment

Build and deploy to Cloudflare Pages
+
Steps
1. npm run build (inside the report Astro folder) 2. export CLOUDFLARE_API_TOKEN=... && export CLOUDFLARE_ACCOUNT_ID=... (explicit export — source .env does NOT work) 3. npx wrangler pages deploy dist --project-name [name] --branch master 4. git commit and push to GitHub
First deploy: Projects must be created before the first deployment. Run wrangler pages project create [name] --production-branch master before the deploy command.

Pipeline Outputs

All data files produced by a complete study

All output files go into A6 Data Files - Simulated/ or B6 Data Files - Real/ for new studies. Segmentation outputs go into the A5/B5 folder. (The HR Leaders BambooHR study uses A3/A4/A5 — a pre-convention study that is not being reorganized.)

FileCreated byContents
study-context.jsonDiscovery 1.1Who was interviewed, communication patterns, dominant topics
questions-registry.csvDiscovery 1.2One row per question with coding type and code count
meaning-units-log.csvDiscovery 1.3All meaning units with exact quotes and descriptive labels
codebook-audit-trail.csvDiscovery 1.5-1.6Code evolution: both agent drafts + reconciliation decisions with reasoning
codebook.jsonAfter human reviewFinal approved codebook with all code definitions, criteria, and examples
agent-registry.jsonDiscovery + ApplicationFull record of every agent used: model, temperature, persona hash, role, run date
agent-1-codes.jsonApplication 2.1Raw coding output from Agent 1
agent-2-codes.jsonApplication 2.1Raw coding output from Agent 2
application-coding-detail.jsonApplication 2.3Full per-agent detail with resolver notes for every contested decision
reliability.txtApplication 2.2Human-readable Kappa report
reliability-summary.jsonApplication 2.2Machine-readable Kappa data per code
flagged-items.jsonApplication 2.2Codes with Kappa below 0.65 and ambiguous definitions
coding-summary.mdApplication endRun summary with participant counts and quality metrics
master-participants.csvDataset 3.1 (grows)One row per participant, all coded variables + roster, grows with factor scores and segment assignments
frequency-report.htmlDataset 3.2Sortable tables, bar charts, collapsible cross-tab sections by firmographic variable
frequency-report.csvDataset 3.2Machine-readable frequencies
segmentation-ready.csvSegmentation 4.1Defining dimensions after variance filter (A5 folder)
segmentation-validation-report.txtSegmentation 4.1Variance checks on all coded variables (A5 folder)
segmentation-assignments.csvSegmentation 4.2Segment assignments with factor scores (A5 folder)
segmentation-pca-report.txtSegmentation 4.2PCA statistics, silhouette scores, cluster profiles (A5 folder)

Research Citations

Methodological grounding

Every major design decision in this pipeline traces to published research. These are the sources cited when explaining the methodology to clients or peer reviewers.

Bonoma, T.V. & Shapiro, B.P. (1984). Evaluating Market Segmentation Approaches. Industrial Marketing Management, 13(4), 257-268. — Nested segmentation model, 3-4 segment expectation for B2B markets.
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77-101. — Cited as the method we deliberately do not use for quantified studies.
Cortez, R., Clarke, A.H. & Freytag, P.V. (2021). B2B market segmentation: A systematic review and research agenda. Journal of Business Research, 126, 415-428. — Four-phase segmentation framework.
Dolnicar, S., Grun, B. & Leisch, F. (2018). Market Segmentation Analysis. Springer. — Bootstrap stability analysis, Jaccard stability threshold of 0.75.
Gao, J., et al. (2024). CollabCoder: A lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with LLMs. CHI 2024. — Multi-agent qualitative coding design; basis for dual-agent codebook construction.
Graneheim, U.H., & Lundman, B. (2004). Qualitative content analysis in nursing research. Nurse Education Today, 24, 105-112. — Meaning unit definition and segmentation standards.
Hsieh, H.-F., & Shannon, S.E. (2005). Three approaches to qualitative content analysis. Qualitative Health Research, 15(9), 1277-1288. — Directed Content Analysis as the codebook-building philosophy.
Kotler, P. & Keller, K.L. (2016). Marketing Management (15th ed.). Pearson. — Five criteria for segment quality: measurable, substantial, accessible, differentiable, actionable.
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. — Kappa agreement thresholds (0.81+ almost perfect; 0.61-0.80 substantial).
Mayring, P. (2000). Qualitative content analysis. Forum: Qualitative Social Research, 1(2). — Rule-based code definitions with inclusion/exclusion criteria; frequency analysis as a core output.
O'Connor, C., & Joffe, H. (2020). Intercoder reliability in qualitative research: Debates and practical guidelines. International Journal of Qualitative Methods, 19. — Justification for independent dual-agent coding and Kappa calculation.
Ritchie, J., & Spencer, L. (1994). Qualitative data analysis for applied policy research. In Bryman & Burgess (Eds.), Analysing Qualitative Data. Routledge. — Framework Analysis; participant × code matrix for systematic cross-case comparison.
Saldana, J. (2016). The Coding Manual for Qualitative Researchers (3rd ed.). SAGE. — First-cycle coding; descriptive labels as a distinct step from final codes.
Wedel, M. & Kamakura, W.A. (2000). Market Segmentation: Conceptual and Methodological Foundations (2nd ed.). Kluwer Academic. — Statistical criteria for k selection; substantiality filter.
Yankelovich, D. & Meer, D. (2006). Rediscovering Market Segmentation. Harvard Business Review, 84(2), 122-131. — Segmentation purpose must precede method; most segmentation fails because it is designed without a clear downstream decision in mind.