A complete methodology for running large-scale qualitative interview studies — from participant simulation through final client report — with full methodological grounding and quality standards at every step.
Pipeline Overview
The pipeline is hybrid by design: qualitative rigor at the codebook-building stage, quantitative discipline at the analysis and reporting stage. That combination makes it possible to run 150-200 participant studies and produce defensible, statistically grounded claims — while maintaining the insight depth clients expect from qualitative research.
Design the interview guide, configure the study, set up the roster, simulate participants for pipeline testing, and process real Maze transcripts when real interviews are complete.
Seven sub-steps convert raw participant responses into a finalized codebook instrument — the most methodologically intensive phase of the pipeline.
Two independent agents apply the finalized codebook to all participants. Inter-rater reliability is calculated, disagreements are resolved, and quality flags are issued.
Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all reporting and analysis downstream.
Defining dimensions are identified and validated, PCA reduces the variable space, and k-means clustering discovers natural participant groups.
Frequency and cross-tab analysis surfaces the key findings. The client report is built as an interactive HTML document, deployed to Cloudflare Pages.
Methodological Foundation
The pipeline does not use Braun and Clarke's reflexive thematic analysis — that method was designed for interpretive, meaning-centered research and its authors explicitly argue that counting theme frequencies does not add analytic value. Our goals require a different foundation: methods built from the ground up to support systematic coding that produces comparable, quantifiable data across participants.
Codes are built inductively on the first study, then treated as a fixed measurement instrument for all subsequent studies in the same domain. This is what makes cross-study comparison valid.
Hsieh & Shannon, 2005Originally developed for large-scale applied policy research. Produces a participant × code matrix enabling systematic cross-case comparison — the "how does this theme vary by segment" question.
Ritchie & Spencer, 1994Structured, rule-based approach that explicitly bridges qualitative interpretation and quantitative analysis. Each code requires a definition, decision rules, and inclusion/exclusion criteria.
Mayring, 2000Multiple independent agents with differentiated personas produce codebooks of comparable quality to expert human-coded codebooks, when structured reconciliation is applied.
CollabCoder — Gao et al., CHI 2024Before any analysis can run, the study must be designed to generate the right data. Every downstream analysis decision — what to code, what to segment on, what to report — flows from what questions were asked. A poorly designed guide cannot be rescued by better analysis.
The interview guide is the instrument. Every downstream analysis decision flows from what questions were asked. There are three question modules, and the distinction between them is the most important design decision in the study.
When writing evaluation trigger questions, use a Jobs-to-be-Done framing: what was the participant trying to accomplish, what context made them start looking, what would need to be true for them to take action? People do not naturally articulate evaluation criteria — they tell stories about situations. The stories contain the criteria.
Before running real interviews, the full pipeline is tested with simulated participants. Simulation allows you to catch bugs, validate script paths, and build an initial codebook before spending budget on real transcripts.
Simulation is not one-size-fits-all. Before running, explicitly decide on these roster parameters:
Decide these before launching and document them in the study's roster design notes. The simulation is only as realistic as the roster it draws from.
Each participant is simulated with one of three verboseness levels. These defaults are empirically calibrated from a real Maze study.
| Level | Share | Observed avg | Observed range |
|---|---|---|---|
| Not Verbose | ~34% | 808 words | 500–1,200 |
| Somewhat Verbose | ~43% | ~1,500 words | 1,200–2,100 |
| Very Verbose | ~23% | 2,100+ words | 1,900–3,000 |
Simulated participants are assigned realistic demographic variation from the roster parameters above. The simulation agent answers as a realistic professional in this domain — with the vocabulary, concerns, and communication patterns that role and domain entail.
Real interviews conducted via Maze are exported as a CSV where each column is one participant's transcript. The Maze export has a known inconsistency: some participants have their full transcript in a single cell; others have it split across multiple rows.
The discovery phase converts raw participant responses into a structured codebook — the instrument that defines exactly what themes exist, how they are bounded, and what counts as an inclusion. Every downstream analysis depends on getting this right. Script: run_discovery.py
Before any other coding step runs, one Sonnet agent reads a random sample of 50 participant responses and produces a structured four-section study context document.
context_generatorWhat it is not: Descriptive, not prescriptive. It tells downstream agents what the study population is like — not which codes to create. Analytical conclusions come from the data.
Classification runs in two sequential sub-steps. Part 1a determines question type cheaply. Part 1b defines the structural codebook for non-thematic questions with full data coverage.
extractor_1extractor_1Classification is a structured decision with a finite, well-defined outcome space — the four types are exhaustive and mutually exclusive. It is not an interpretive judgment the way codebook construction is. Two agents classifying the same question would almost always agree; the rare disagreement would be on edge cases better resolved by reading more responses, not by running a second agent. The complexity and cost of dual-agent classification with reconciliation is not justified by the improvement in output quality.
One Sonnet agent processes all participant responses to thematic questions. Before reading any responses, it receives the study context document from Step 1.1 — who was interviewed, how they communicate, and what topics they discuss. This primes the agent with the professional vocabulary and communication style of the participants so it can make better interpretation decisions.
extractor_1Dunivin (2024), Scalable qualitative coding with LLMs, establishes the central principle: LLMs require more precise codebook descriptions than human coders do because they lack the contextual understanding human coders develop through training and discussion. Every guard rule below is a precision instruction the model would otherwise miss.
Reliability target grounding. Dunivin (2024) reports that GPT-4 with chain-of-thought prompting achieved Cohen's κ ≥ 0.79 (excellent agreement) on 3 of 9 codes and κ ≥ 0.6 (substantial) on 8 of 9 codes against human coders. Our HR Leaders BambooHR study achieved an overall weighted κ of 0.909 — above the strongest results in the published literature. This is why we are confident the single-extractor design is sufficient.
Chain-of-thought grounding. Dunivin (2024) finds that requiring the model to reason about each code before assigning it improves coding fidelity. This is the basis for the landscape analysis requirement in Steps 1.4 and 1.5 — clustering and codebook construction agents must write what they observe before proposing structure.
The HR Leaders BambooHR study produced an overall weighted Kappa of 0.909. The four codes that fell below threshold were definition problems, not extraction failures — a second extractor would not have fixed them. Dual extraction was adding methodological complexity without improving downstream reliability. The complexity budget is better spent at the codebook construction step, where boundary-drawing actually matters.
clustererClustering per question first reduces cognitive load on the global construction agents. Instead of receiving thousands of raw descriptive labels in one block, they receive pre-organized clusters per question — making the landscape analysis step more tractable.
global_synthesizer_parsimonyglobal_synthesizer_distinctioncodebook_reconcilerBefore the pipeline pauses for human review, two more agents run automatically after reconciliation:
validatordimension_architectThe application phase applies the finalized codebook to all participant transcripts using two independent coding agents, calculates inter-rater reliability, and resolves disagreements. Script: run_coding.py
Two independent agents each receive the finalized codebook and all participant responses. Each agent independently processes every participant's responses to every thematic question and produces a participant-level output: for each participant, which codes apply.
Independent application by two agents replicates the intercoder reliability design from qualitative research (O'Connor & Joffe, 2020). Disagreements between the two agents flag cases where the codebook definition is ambiguous enough to produce different readings — exactly the cases that need a resolver and may warrant codebook refinement.
Concurrency: Both coding agents run in parallel using a semaphore-controlled thread pool (18 concurrent workers). The API output token rate limit is the binding constraint, not compute.
After both agents have coded all participants, Cohen's Kappa is calculated per code and as an overall weighted average at the participant × code level.
Why participant × code, not meaning unit × code: Our downstream analysis asks "what percentage of participants expressed theme X" — a participant-level question. Kappa at the meaning unit level would be influenced by segmentation variability between agents. Participant-level Kappa measures what actually matters.
Source: Landis & Koch, 1977
For any participant × code combination where the two agents disagreed, a third resolver agent reviews both agents' reasoning alongside the participant's actual transcript and the codebook definition, and makes a final determination.
The resolver records its reasoning in the output file alongside the final code assignment. This creates an audit trail for every contested coding decision.
Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all analysis and reporting downstream. Scripts: build-master-dataset.py and build-frequency-report.py
Takes final_codes.json, codebook.json, and roster.json and assembles a flat CSV with one row per participant.
{qid}_{code_name_snake_case} = 0 or 1Q5_reporting_analytics_gap and Q8_reporting_analytics_gap as separate columns preserves that distinction. Collapsing to a single column per code loses it permanently.With 140 codes across 5 thematic questions, this produces approximately 700 binary columns. This is correct — do not collapse them.
Reads master-participants.csv and codebook.json and produces code frequency tables and cross-tabs by firmographic variable.
Defining dimensions are identified and validated, PCA reduces the variable space, and k-means clustering discovers natural participant groups. Scripts: segment-prep.py and run-segmentation.py
The codebook's dimensions section classifies each coded variable. The segment-prep script reads these classifications and applies variance filters to produce a clean input for clustering.
Binary variables outside the 20-80% prevalence range are excluded from clustering. A variable where 95% of participants scored 1 carries almost no discriminating power.
N/10 rule: Maximum defining dimensions = sample_size / 10. With 180 participants, maximum 18 defining dimensions. More than this produces unstable clustering at our sample sizes.
The run-segmentation script standardizes the defining dimensions, reduces with PCA, and runs k-means clustering with automatic k selection.
Why PCA before clustering: Clustering directly on many binary dimensions suffers from the curse of dimensionality — distance metrics become less meaningful as dimensions increase. PCA reduces the space to a manageable number of orthogonal factors while preserving most of the variance.
After clustering, each segment is cross-tabbed against outcome variables and profiling variables to build segment descriptions and validate the solution.
For each segment, identify 2-3 observable proxies a sales rep can assess without a full research interview: company size (LinkedIn), seniority and job title, industry/company type, tech stack (G2, job postings), buying signals (recent funding, tool migration postings).
Before writing any section of the client report, the frequency report reveals which themes are prevalent, which are rare, and where the most interesting cross-tab differences appear.
When rejection reason data exists and at least 3 competitors have rejection sample sizes of n ≥ 10, build a competitive vulnerability summary showing each competitor's top 3 rejection reasons with positioning angles for the client's sales team.
Include when rejection reasons show differentiated patterns across competitors and the client's strengths (pricing, implementation speed, ease of use) map to competitors' top rejection reasons.
The client report is an HTML document built with Astro and deployed to Cloudflare Pages. It is the primary deliverable of the study — interactive, printable, and structured to serve marketing, product, and sales simultaneously.
break-inside: avoid on tables. Large tables push entirely to the next page, creating huge white gaps. Let tables break between rows; protect individual rows with break-inside: avoid on tr.Use this CSS pattern to keep headings with their charts:
This creates a chain: heading stays with description, description stays with legend/chart.
<div class="crosstab-section">page: landscape to the wrapper, NOT to .crosstab alone (otherwise the title stays on the portrait page)table-layout: fixed; width: 100%, font-size 9px data / 8px headerswhite-space: normal@page { margin: 16mm; }!important on all background colors + -webkit-print-color-adjust: exact !important on *details.accordion { page-break-before: always; } with first-of-type excludeddisplay: block !important on details and bodywrangler pages project create [name] --production-branch master before the deploy command.Pipeline Outputs
All output files go into A6 Data Files - Simulated/ or B6 Data Files - Real/ for new studies. Segmentation outputs go into the A5/B5 folder. (The HR Leaders BambooHR study uses A3/A4/A5 — a pre-convention study that is not being reorganized.)
| File | Created by | Contents |
|---|---|---|
| study-context.json | Discovery 1.1 | Who was interviewed, communication patterns, dominant topics |
| questions-registry.csv | Discovery 1.2 | One row per question with coding type and code count |
| meaning-units-log.csv | Discovery 1.3 | All meaning units with exact quotes and descriptive labels |
| codebook-audit-trail.csv | Discovery 1.5-1.6 | Code evolution: both agent drafts + reconciliation decisions with reasoning |
| codebook.json | After human review | Final approved codebook with all code definitions, criteria, and examples |
| agent-registry.json | Discovery + Application | Full record of every agent used: model, temperature, persona hash, role, run date |
| agent-1-codes.json | Application 2.1 | Raw coding output from Agent 1 |
| agent-2-codes.json | Application 2.1 | Raw coding output from Agent 2 |
| application-coding-detail.json | Application 2.3 | Full per-agent detail with resolver notes for every contested decision |
| reliability.txt | Application 2.2 | Human-readable Kappa report |
| reliability-summary.json | Application 2.2 | Machine-readable Kappa data per code |
| flagged-items.json | Application 2.2 | Codes with Kappa below 0.65 and ambiguous definitions |
| coding-summary.md | Application end | Run summary with participant counts and quality metrics |
| master-participants.csv | Dataset 3.1 (grows) | One row per participant, all coded variables + roster, grows with factor scores and segment assignments |
| frequency-report.html | Dataset 3.2 | Sortable tables, bar charts, collapsible cross-tab sections by firmographic variable |
| frequency-report.csv | Dataset 3.2 | Machine-readable frequencies |
| segmentation-ready.csv | Segmentation 4.1 | Defining dimensions after variance filter (A5 folder) |
| segmentation-validation-report.txt | Segmentation 4.1 | Variance checks on all coded variables (A5 folder) |
| segmentation-assignments.csv | Segmentation 4.2 | Segment assignments with factor scores (A5 folder) |
| segmentation-pca-report.txt | Segmentation 4.2 | PCA statistics, silhouette scores, cluster profiles (A5 folder) |
Research Citations
Every major design decision in this pipeline traces to published research. These are the sources cited when explaining the methodology to clients or peer reviewers.