A complete methodology for running large-scale qualitative interview studies — from participant simulation through final client report — with full methodological grounding and quality standards at every step.
Pipeline Overview
The pipeline is hybrid by design: qualitative rigor at the codebook-building stage, quantitative discipline at the analysis and reporting stage. That combination makes it possible to run 150-200 participant studies and produce defensible, statistically grounded claims — while maintaining the insight depth clients expect from qualitative research.
Design the interview guide, configure the study, set up the roster, simulate participants for pipeline testing, and process real Maze transcripts when real interviews are complete.
Six sub-steps convert raw participant responses into a finalized codebook instrument — the most methodologically intensive phase of the pipeline.
Two independent agents apply the finalized codebook to all participants. Inter-rater reliability is calculated, disagreements are resolved, and quality flags are issued.
Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all reporting and analysis downstream.
Basis variables are identified and validated, block-wise specific MCA compresses the binary code set into interpretable dimensions, and Ward's hierarchical clustering with bootstrap stability testing discovers natural participant groups.
Frequency and cross-tab analysis surfaces the key findings. The client report is built as an interactive HTML document, deployed to Cloudflare Pages.
Methodological Foundation
The pipeline does not use Braun and Clarke's reflexive thematic analysis — that method was designed for interpretive, meaning-centered research and its authors explicitly argue that counting theme frequencies does not add analytic value. Our goals require a different foundation: methods built from the ground up to support systematic coding that produces comparable, quantifiable data across participants.
Codes are built inductively on the first study, then treated as a fixed measurement instrument for all subsequent studies in the same domain. This is what makes cross-study comparison valid.
Hsieh & Shannon, 2005Originally developed for large-scale applied policy research. Produces a participant × code matrix enabling systematic cross-case comparison — the "how does this theme vary by segment" question.
Ritchie & Spencer, 1994Structured, rule-based approach that explicitly bridges qualitative interpretation and quantitative analysis. Each code requires a definition, decision rules, and inclusion/exclusion criteria.
Mayring, 2000Multi-perspective extraction and clustering feed a single high-capability codebook architect that runs internal parsimony and distinction passes, producing codebooks of comparable quality to expert human-coded ones.
CollabCoder — Gao et al., CHI 2024Before any analysis can run, the study must be designed to generate the right data. Every downstream analysis decision — what to code, what to segment on, what to report — flows from what questions were asked. A poorly designed guide cannot be rescued by better analysis.
The interview guide is the instrument. Every downstream analysis decision flows from what questions were asked. There are three question modules, and the distinction between them is the most important design decision in the study.
When writing evaluation trigger questions, use a Jobs-to-be-Done framing: what was the participant trying to accomplish, what context made them start looking, what would need to be true for them to take action? People do not naturally articulate evaluation criteria — they tell stories about situations. The stories contain the criteria.
Before running real interviews, the full pipeline is tested with simulated participants. Simulation allows you to catch bugs, validate script paths, and build an initial codebook before spending budget on real transcripts.
Simulation is not one-size-fits-all. Before running, explicitly decide on these roster parameters:
Decide these before launching and document them in the study's roster design notes. The simulation is only as realistic as the roster it draws from.
Each participant is simulated with one of three verboseness levels. These defaults are empirically calibrated from a real Maze study.
| Level | Share | Observed avg | Observed range |
|---|---|---|---|
| Not Verbose | ~34% | 808 words | 500–1,200 |
| Somewhat Verbose | ~43% | ~1,500 words | 1,200–2,100 |
| Very Verbose | ~23% | 2,100+ words | 1,900–3,000 |
Simulated participants are assigned realistic demographic variation from the roster parameters above. The simulation agent answers as a realistic professional in this domain — with the vocabulary, concerns, and communication patterns that role and domain entail.
research/00 How to simulate participants/simulate.pyParticipant simulation is synthetic text generation, not expert reasoning. Opus's reasoning advantage delivers effectively zero value on this task while costing ~5× Sonnet (~$29/study vs. ~$6/study for 180 participants). Haiku 4.5 would cut the cost from ~$6 to ~$2 per study but carries real risk on rule adherence: the prompt contains 11 numbered simulation rules plus the full interview guide, and smaller models tend to drop rules late in long prompts. Sonnet reliably holds all rules across the batch and produces distinct-sounding participants. The extra ~$4/study over Haiku is a cheap insurance policy on the input data every downstream pipeline step depends on.
After each batch returns, simulate.py runs a word-count check against the verboseness spec, overwrites the reported total_words with the ground-truth count, and flags any participant whose word count falls outside their target range or who answered fewer questions than the interview guide contains. Warnings print per batch but do not fail the run — the roster designer reviews warnings and decides whether to re-run individual batches.
Real interviews conducted via Maze are exported as a CSV where each column is one participant's transcript. The Maze export has a known inconsistency: some participants have their full transcript in a single cell; others have it split across multiple rows.
The discovery phase converts raw participant responses into a structured codebook — the instrument that defines exactly what themes exist, how they are bounded, and what counts as an inclusion. Every downstream analysis depends on getting this right.
Hybrid execution model. Discovery runs across two modes. Steps 1.1, 1.2a, 1.2b, and 1.6 run in the Claude Code conversation window using the Claude Code subscription (no per-call API charges). Steps 1.3 (extraction) and 1.4 (clustering) run via run_extract_cluster.py on the Anthropic API because of their high parallelism. Step 1.5 (enhanced six-pass codebook construction) runs via run_codebook.py on the API because its ~60–75K token structured JSON output exceeds what can be produced reliably in a single conversation session.
Before any other coding step runs, Claude Code (running in the conversation window using the Claude Pro subscription) reads a random sample of 50 participant responses and produces a structured four-section study context document. This step used to run as a Python+API subprocess; moving it into conversation mode eliminates its API charge entirely.
context_generatorWhat it is not: Descriptive, not prescriptive. It tells downstream agents what the study population is like — not which codes to create. Analytical conclusions come from the data.
Classification runs in two sequential sub-steps, both now executed in the Claude Code conversation window using the Claude Pro subscription. Part 1a determines question type cheaply. Part 1b defines the structural codebook for non-thematic questions with full data coverage. Output is written directly to questions-registry.csv in the study data folder.
question_classifierClassification is a high-leverage, low-volume decision. The pipeline makes one call per question, typically 15-25 calls per study, so the cost delta between Sonnet and Opus is trivial (roughly $1-2 extra per study). The failure mode, however, is catastrophic and silent. A thematic question misclassified as categorical collapses all inductive themes into a handful of named buckets and cannot be recovered downstream. A categorical question misclassified as thematic fills the codebook with noisy open-ended clusters that should have been clean counts. A rank-order question misclassified as thematic loses the ordinal structure entirely.
Because the classification drives every downstream branch of the pipeline, and because the error is invisible until a human reviewer catches it in the finished codebook, the strongest available reasoning model is warranted even though the task itself is usually straightforward. Opus is also better at catching the subtle case where the question text looks open-ended but the responses themselves fall cleanly into a small set of categories, or where a nominally closed question receives rich open-ended explanations that deserve thematic coding. Sonnet tends to classify from the question text alone; Opus weighs both the text and how participants actually answered.
The 5-participant sample is retained because it is sufficient for the reasoning model to spot the pattern — going larger would raise cost without materially improving accuracy on a decision this structural.
extractor_1Classification is a structured decision with a finite, well-defined outcome space — the four types are exhaustive and mutually exclusive. It is not an interpretive judgment the way codebook construction is. Two agents classifying the same question would almost always agree; the rare disagreement would be on edge cases better resolved by reading more responses, not by running a second agent. The complexity and cost of dual-agent classification with reconciliation is not justified by the improvement in output quality.
One Sonnet agent processes all participant responses to thematic questions. Before reading any responses, it receives the study context document from Step 1.1 — who was interviewed, how they communicate, and what topics they discuss. This primes the agent with the professional vocabulary and communication style of the participants so it can make better interpretation decisions.
extractor_1Dunivin (2024), Scalable qualitative coding with LLMs, establishes the central principle: LLMs require more precise codebook descriptions than human coders do because they lack the contextual understanding human coders develop through training and discussion. Every guard rule below is a precision instruction the model would otherwise miss.
Reliability target grounding. Dunivin (2024) reports that GPT-4 with chain-of-thought prompting achieved Cohen's κ ≥ 0.79 (excellent agreement) on 3 of 9 codes and κ ≥ 0.6 (substantial) on 8 of 9 codes against human coders. Our HR Leaders BambooHR study achieved an overall weighted κ of 0.909 — above the strongest results in the published literature. This is why we are confident the single-extractor design is sufficient.
Chain-of-thought grounding. Dunivin (2024) finds that requiring the model to reason about each code before assigning it improves coding fidelity. This is the basis for the landscape analysis requirement in Steps 1.4 and 1.5 — clustering and codebook construction agents must write what they observe before proposing structure.
The HR Leaders BambooHR study produced an overall weighted Kappa of 0.909. The four codes that fell below threshold were definition problems, not extraction failures — a second extractor would not have fixed them. Dual extraction was adding methodological complexity without improving downstream reliability. The complexity budget is better spent at the codebook construction step, where boundary-drawing actually matters.
Extraction is the highest-volume call in the entire pipeline. For a typical 180-participant × 13-thematic-question study, that is ~2,340 extraction calls — roughly 10× the number of Opus calls everywhere else in discovery combined. At ~800 output tokens each, that is ~1.9M output tokens just for extraction. Opus is ~5× the cost of Sonnet, so upgrading extraction is the single biggest cost delta you could make to the pipeline.
The HR Leaders BambooHR study hit weighted κ = 0.909 with Sonnet extraction. That is above the strongest published results (Dunivin 2024 reports GPT-4 at κ ≥ 0.79 on 3 of 9 codes). When you debrief that study, the four codes that fell below threshold were definition problems at construction time, not extraction misses. Opus at extraction would not have fixed them.
Extraction is a more mechanical task than construction. It is "find spans that express one idea, write a short label." It is not extended reasoning. Opus's reasoning advantage (GPQA Diamond, etc.) is smallest on pattern-recognition tasks like this. The reasoning-heavy step is codebook construction — which is exactly where we already spend the Opus budget.
The documented failure modes for LLM qualitative extraction are:
On #1 and #3 (tone and implicature), Opus is genuinely better. On #2 and #4 (structural), the gap is small — Sonnet handles these well with explicit prompt instructions, which the current extractor has. The current extractor prompt includes guard rules for sarcasm, hedging, negation, and compound sentences; if a future study surfaces a concentration of errors in the tone categories specifically, extraction can be upgraded to Opus as a targeted fix rather than a blanket cost increase.
clustererClustering per question first reduces cognitive load on the global construction agent. Instead of receiving thousands of raw descriptive labels in one block, it receives pre-organized clusters per question — making the landscape analysis step more tractable.
Batching: One call per question. The clusterer runs once per thematic question, processing all meaning units for that question in a single Opus call. Questions are processed in parallel via thread pool, but within a question there is no internal batching. Input is capped at 500 meaning unit descriptions per question (evenly sampled if more) — comfortably within Opus's context window.
No minimum or maximum on the number of clusters. The prompt explicitly instructs the agent to let the data decide how many clusters to create. If a question's responses contain 7 distinct ideas, the agent should make 7 clusters; if they contain 35, the agent should make 35. Under-clustering is irreversible (distinctions collapsed here cannot be recovered); over-clustering is recoverable downstream. The agent is told to err toward finer-grained clusters.
Receives the study context from Step 1.1. Before reading any meaning unit descriptions, the clusterer is primed with the same who/how/what context document the extractor sees. This helps it recognize when descriptions that look superficially different are referring to the same underlying concept in the participants' shared vocabulary.
Terminology note: At this stage we deliberately use the phrase "meaning unit description" rather than "code." The word "code" is reserved for the final codebook entries produced in Step 1.5. The clusterer is grouping descriptive labels — it is not creating codes.
Why minimums of 5: The clusterer must include at least 5 representative meaning unit descriptions and at least 5 participant IDs (with quotes attached downstream) per cluster. The richer the cluster summary, the better the downstream architect can decide what to merge, split, and define. Opus is configured with a 32,000-token max output here, so the richer cluster summaries fit comfortably without crowding the budget.
codebook_architectWhy this step stays on the Anthropic API (not conversation mode): The enhanced output is roughly 60–75K tokens of structured JSON — themes with definitions, inclusion/exclusion criteria, adjacency tests, positive examples, and boundary examples, plus the decisions log and landscape analysis. This exceeds what can be produced reliably in a single sustained conversation session. Running it via run_codebook.py on the API produces it in one validated call; conversation-mode attempts risk dropped fields or broken schemas in the single most-critical file in the entire pipeline.
Why a single Opus agent (vs. the prior dual + reconciler architecture): The previous design used two Sonnet agents (parsimony + distinction) running in parallel and an Opus reconciler making the final calls. That architecture cost three API calls for one decision and added an integration step that could itself introduce errors. A single Opus call with internal parsimony pass + distinction critique chain-of-thought captures the bulk of the divergence-then-reconcile benefit at a fraction of the cost and complexity. Opus is strong enough to hold both lenses simultaneously inside one extended reasoning pass.
After run_codebook.py commits the global codebook, Step 1.6 runs in the Claude Code conversation window (not as a Python+API subprocess). Claude Code performs the per-question validator pass and the dimension architect pass sequentially, assembles the final codebook.json, and pauses for human review. Moving this step into conversation mode eliminates its API cost entirely.
The two reasoning passes Claude Code executes in conversation:
validatordimension_architectcodebook-audit-trail.csv — these are the merge and split decisions the architect flagged as non-obvious, exactly where the instrument is weakestA low-cost dry run on 20 stratified participants before committing to the full Phase 2 coding run. Catches soft definitions before they propagate across the full dataset and become expensive arbiter calls.
Pilot calibration runs the full dual-coder application pipeline against a stratified sample of 20 participants, reviews the disagreements, and refines any code definitions that proved soft in practice — all before committing to the full ~180-participant Phase 2 run where definition problems become expensive.
Even with the enhanced Step 1.5 prompt (adjacency tests, boundary examples, banned "and" in names), some definitions only reveal their softness when two independent coders apply them to real responses. Catching those in a 20-participant pilot costs roughly 1/9th of the full run. Catching them only after a full run means re-running the arbiter across hundreds of disagreements, or worse, shipping a weaker final dataset.
run_coding.py on the pilot subset. The application pipeline runs dual-coder extraction + Kappa + arbiter against only the 20 pilot participants. Cost is roughly $5–8 instead of ~$50 for the full run.definition_ambiguous=true flag from the arbiter. These are the codes whose definitions need tightening.The application phase applies the finalized codebook to all participant transcripts using two independent coding agents, calculates inter-rater reliability, and resolves disagreements. Script: run_coding.py
Two independent agents each receive the finalized codebook and all participant responses. Each agent independently processes every participant's responses to every thematic question and produces a participant-level output: for each participant, which codes apply.
Independent application by two agents replicates the intercoder reliability design from qualitative research (O'Connor & Joffe, 2020). Disagreements between the two agents flag cases where the codebook definition is ambiguous enough to produce different readings — exactly the cases that need a resolver and may warrant codebook refinement.
agent_1inclusion_first — the "INCLUDE when" criteria are presented before the "EXCLUDE when" criteria in the formatted codebook, subtly biasing the reading toward applying the code when the evidence partially matches.agent_2exclusion_first — the "EXCLUDE when" criteria are presented before the "INCLUDE when" criteria, subtly biasing the reading toward rejecting the code unless the evidence clearly meets the definition.Application coding is the highest-volume phase in the pipeline. A typical study is ~180 participants × ~15 thematic questions × 2 agents = ~5,400 coding calls, plus the arbiter calls on disagreements. Running all of that on Opus would cost roughly 5× more and slow the wall-clock significantly. That cost would be hard to justify for a task where Sonnet is already strong.
Application coding is fundamentally a pattern-matching problem, not a novel reasoning problem. The codebook already exists. The inclusion and exclusion criteria, the definition, the adjacency tests, and the three positive and three boundary examples — all the hard thinking has been done upstream by the Opus codebook architect in Step 1.5. The coder's job is to read a participant's words and decide whether they match the criteria. Sonnet's accuracy on that task is close to Opus's, and the HR Leaders BambooHR simulated run hit a weighted κ = 0.909 with this exact setup, well above the Landis & Koch "almost perfect" threshold of 0.81 and well above the client-deliverable threshold set at κ ≥ 0.65. The four codes that fell below threshold in that run were definition problems in the codebook, not coder errors — Opus coders would not have rescued them.
The productive signal in dual coding is disagreement — the arbiter cannot do its job if both coders agree on everything. That disagreement is generated by prompting Agent 1 as inclusive (temperature 0) and Agent 2 as conservative (temperature 0.3), and by flipping the order in which the codebook inclusion and exclusion criteria are presented. Switching one or both agents to Opus would not create more useful disagreement; it would just make both agents more confident. The goal is two coherent but different coding philosophies, not two different raw IQs.
The one failure mode this architecture does not catch is when both Sonnet coders confidently agree on the wrong answer. No disagreement means no arbiter trigger. That risk is a codebook-quality problem, not a model-choice problem — if the definition and the positive and negative examples are sharp enough, two independent Sonnet agents with different temperatures and different emphasis orders will almost always diverge on ambiguous cases. The fix for that failure mode is better definitions in Step 1.5, which is why the codebook architect is required to provide adjacency tests plus exactly 3 positive and 3 boundary examples per code.
Concurrency: Both coding agents run in parallel using a semaphore-controlled thread pool (18 concurrent workers). The API output token rate limit is the binding constraint, not compute.
After both agents have coded all participants, Cohen's Kappa is calculated per code and as an overall weighted average at the participant × code level.
Why participant × code, not meaning unit × code: Our downstream analysis asks "what percentage of participants expressed theme X" — a participant-level question. Kappa at the meaning unit level would be influenced by segmentation variability between agents. Participant-level Kappa measures what actually matters.
Source: Landis & Koch, 1977
For any participant × code combination where the two agents disagreed, a third resolver agent reviews both agents' reasoning alongside the participant's actual transcript and the codebook definition, and makes a final determination.
The resolver records its reasoning in the output file alongside the final code assignment. This creates an audit trail for every contested coding decision.
agent_3balanced — inclusion and exclusion criteria shown in neutral order so the arbiter reads the rulebook straight rather than with either coder's biasDisagreements are where the hard calls cluster. By definition, the arbiter only sees cases where two reasonable coders looked at the same evidence and came to different conclusions — the edge cases involving sarcasm, hedged language, compound statements, and borderline definition fits. This is exactly the shape of task where Opus's reasoning advantage is largest, because the decision requires weighing competing considerations rather than pattern-matching to a clear example.
The arbiter is also low-volume and high-leverage. If the codebook is clean, dual-coder agreement on most items runs 80–90%, meaning the arbiter fires on only the remaining 10–20%. On a 5,400-call study, that is roughly 540–1,080 Opus calls — a rounding error in cost — yet each of those calls directly determines a final code assignment that enters the participant database. Spending more per call on the highest-stakes decisions is exactly the right place to put the token budget.
This mirrors the architecture used on the discovery side. The Step 1.5 codebook architect is Opus because its decisions propagate everywhere downstream. The arbiter is the mirror image on the application side: a small number of decisions that disproportionately determine final output quality. Running the arbiter on Sonnet would save almost nothing and would risk propagating coder-level confusion into the final dataset on exactly the cases that matter most.
The arbiter also carries a second responsibility beyond code assignment: it flags whether the underlying definition was ambiguous. Those flags feed back into codebook refinement and are the primary signal for whether a code needs rewording before the next study. That meta-judgment — "is this disagreement about the evidence, or about the rulebook?" — requires reasoning about the codebook itself, not just applying it.
Coded data is assembled into a flat participant-by-code matrix — the single source of truth for all analysis and reporting downstream. Scripts: build-master-dataset.py and build-frequency-report.py
Takes final_codes.json, codebook.json, and roster.json and assembles a flat CSV with one row per participant.
{qid}_{code_name_snake_case} = 0 or 1Q5_reporting_analytics_gap and Q8_reporting_analytics_gap as separate columns preserves that distinction. Collapsing to a single column per code loses it permanently.With 140 codes across 5 thematic questions, this produces approximately 700 binary columns. This is correct — do not collapse them.
Reads master-participants.csv and codebook.json and produces code frequency tables and cross-tabs by firmographic variable.
Basis variables are identified and validated, block-wise specific MCA compresses the binary code set into interpretable dimensions, and Ward's hierarchical clustering with bootstrap stability testing discovers natural participant groups. Scripts: segment-prep.py and run-segmentation.py
The segment-prep.py script reads final_codes.json and the dimensions section of codebook.json, classifies every variable by Wedel & Kamakura (2000) role, then applies two filters to the binary basis set and groups the survivors into blocks so run-segmentation.py can run specific MCA block by block in Step 4.2.
Binary basis variables with prevalence outside the symmetric 9-91% band are dropped from the clustering input. A code endorsed by 95% or 5% of participants carries almost no discriminating power and will dominate the first MCA dimension as noise. The filter is symmetric on purpose: dropping a 92%-prevalence code is as important as dropping an 8%-prevalence one, because near-universal codes create shared-presence artifacts that mirror the shared-absence artifacts at the other end. This matches the Le Roux & Rouanet (2010) guidance for MCA inputs and Dolnicar et al. (2018) on unstable low-prevalence binary indicators.
Dropped variables are logged in segmentation-validation-report.txt with their prevalence and the reason for exclusion, so the researcher can confirm that nothing important was removed.
For every pair of surviving binary basis variables within the same conceptual block, the script computes Yule's Q (Yule, 1900; Agresti, 2013, ch. 2.4):
Q = (ad - bc) / (ad + bc) where a=both positive, b=first only, c=second only, d=both negative.
Yule's Q ranges from -1 to +1 regardless of marginal frequencies. This is the key property for sparse interview-coded data. Phi (the Pearson correlation between two binary variables) is artificially bounded by the base rates: two 10%-prevalence codes can be nearly perfectly associated and still only reach phi ≈ 0.5, which makes phi thresholds impossible to interpret consistently across variables with different marginals. Yule's Q does not have this problem.
Pairs with |Q| > 0.85 are flagged for manual review. The script does not automatically merge — the researcher decides: merge the pair into a single composite code, drop the weaker one, or declare the weaker one a supplementary variable that contributes to the MCA projection but not to the dimension definition. This replaces the earlier "80% raw overlap" rule, which had no research grounding and misbehaved under unequal base rates.
Basis variables are grouped into conceptual blocks so Step 4.2 can run specific MCA block by block. Block assignment follows this precedence: (1) an explicit block field on the dimension, (2) the temporal_layer field, (3) the primary source question. In practice this means pain-point codes form one block, evaluation-criteria codes form another, rejection-reason codes form a third, and so on. Block-wise MCA is the Greenacre (2017) and Husson, Lê & Pagès (2017) recommendation for categorical variables with natural conceptual groupings — it prevents a single block from dominating the global dimensions and makes each dimension interpretable as "how participants vary within this conceptual area."
The mapping is written to segmentation-blocks.json and consumed directly by run-segmentation.py.
sample_size / 10 heuristic (Wedel & Kamakura, 2000; Dolnicar et al., 2018) constrains the number of dimensions entering the clustering step, not the raw basis count. With N = 180 and 9-12 retained MCA dimensions, the ratio is a comfortable 15-20:1. This is why we no longer cap the pre-MCA basis count — the compression happens during MCA, and the ratio is checked on the compressed output.After reviewing the first-pass validation report, the researcher authors segmentation-transforms.json per study, listing the merges and reclassifications that resolve the Yule's Q flags. The script is re-run with the transforms file as an optional fifth argument, applies the transforms declaratively, and writes the final basis set. Re-running with the same config file is idempotent — same inputs plus same transforms always produce the same outputs.
Two transform types are supported today:
merge_binary_or — creates a new column (prefixed seg_) as the logical OR of two or more existing binary basis components. The new variable goes into the clustering basis; the original components are preserved in segmentation-profile.csv.reclassify_basis_to_descriptor — removes a variable from the clustering basis but keeps it in segmentation-profile.csv for post-hoc segment description. Used when a Yule's Q flag reveals a structural stayer/switcher contamination or another latent-variable problem that a merge cannot fix.Every transform entry carries a free-text reason field that is copied verbatim into the provenance log. Nothing about a transform is implicit.
roster.json, final_codes.json, and codebook.json are inputs only. Every operation in Step 4.1 is additive or reclassifying. Nothing is deleted from the dataset. This is by design: it prevents accidental contamination of the coded study, and it keeps the second-pass result fully reproducible from the same source files plus the same transforms config.segmentation-profile.csv with a role of basis_dropped_by_prevalence_filter.segmentation-profile.csv with a role of basis_reclassified_to_descriptor_by_transform.seg_ prefix so they are visually distinguishable from codes assigned by the coders in Phase 2. The original components of every merge are preserved in segmentation-profile.csv with a role of basis_merged_into_seg_variable (component preserved).segmentation-provenance.json — prevalence drops with percentages, raw Yule's Q flags, each transform with its declared rationale, the final basis list, the final blocks mapping, and a column_roles dictionary that assigns a role to every column in both CSVs. Any value in either CSV can be traced back to its origin in one lookup.segmentation-ready.csv and segmentation-profile.csv are complements over the same participant set. Every variable that ever existed in the flattened dimensions list appears in exactly one of the two files. The provenance file asserts this.Dimension retention in MCA is an interpretive decision, not a mechanical one. A "retain 75% of corrected variance" rule will sometimes agree with the analyst and sometimes over- or under-retain. Le Roux & Rouanet (2010, ch. 7), Husson, Lê & Pagès (2017, ch. 3), and Greenacre (2017, ch. 11) all argue the analyst must read the scree, inspect the top contributing variables on each dimension's positive and negative poles, and decide which dimensions carry interpretable structure. That review cannot be automated, so we split it out as its own step.
The run-mca.py script runs block-wise specific MCA with Benzécri correction on every block in segmentation-blocks.json, extracts per-dimension positive and negative pole loadings from the presence ("_1") column coordinates, and produces a review report with a recommendation for every block. The researcher reviews the recommendation, approves or revises it, and Claude writes mca-dimensions.json with the final retention counts plus interpretive labels. Only then does run-segmentation.py run.
mca-review.md; it must also appear in the conversation, even when that produces a long message.A dimension driven almost entirely by a single binary code — for example, a dimension where one code loads at +2.5 and every other variable sits under ±0.3 — is mathematically a dimension but substantively just a continuous re-expression of that one code. Letting it into clustering re-weights that single code by whatever share of variance the dimension claims. Only human review catches this. The HR Leaders BGR simulated pass flagged exactly this failure mode on the Q15 block, where Dim 2 loaded almost entirely on aggressive_sales_experience; the analyst correctly dropped it.
Without mca-dimensions.json, run-segmentation.py in Step 4.2b refuses to run. The analyst-in-the-loop step is not optional and cannot be skipped by accident.
The method we use is block-wise specific Multiple Correspondence Analysis followed by Ward's hierarchical clustering. The run-segmentation.py script re-fits MCA block by block, applies the Benzécri eigenvalue correction, keeps only the dimensions approved in mca-dimensions.json (from Step 4.2a), standardizes them, then runs Ward's minimum-variance linkage across candidate values of k. The best solution is selected using bootstrap stability combined with silhouette score. The entire pipeline is executed twice — once without and once with log(company size) as a basis variable — so the question "should firmographics define segments?" is answered empirically rather than editorially.
Interview-coded data has three properties that rule out naive approaches: (1) the basis set is almost entirely binary code presence/absence; (2) there are 60-140 codes across N = 150-250 participants; (3) the codes cluster into conceptual groups (pain points, evaluation criteria, rejection reasons) that should not be mixed together during dimension extraction. Block-wise specific MCA is the textbook match for data with all three properties (Le Roux & Rouanet, 2010; Greenacre, 2017; Husson, Lê & Pagès, 2017). Ward's linkage (Ward, 1963) is deterministic, interpretable via the dendrogram, and the natural partner for Euclidean distance on standardized MCA scores.
Methods we considered and rejected:
Block-wise MCA compresses 60-140 binary codes into 9-12 interpretable axes that Ward can cluster cleanly, and the bootstrap tests the whole compression-plus-clustering pipeline end to end.
mca-dimensions.json — run-segmentation.py refuses to run without that file.λ_corr = (J / (J − 1))² × (λ_raw − 1/J)² where J is the number of active variables in the block. This is not optional in modern MCA practice; raw eigenvalues will make you throw away real structure.max_k (default 6). Ward is deterministic (same input → same output every time), produces a real dendrogram for interpretability, and is the pairing Husson, Lê & Pagès (2017) specifically recommend for MCA outputs. This pairing is pre-committed — we do not try multiple algorithms and pick whichever produces the prettiest answer.log(company_size) as a standalone standardized variable alongside the MCA dimensions. We compare the two solutions with the Adjusted Rand Index (Hubert & Arabie, 1985). Decision rule: if Run A's segments already track company-size tiers at ARI > 0.60, size is redundant and we ship Run A. If ARI(A, B) > 0.80, the two solutions are essentially the same and we ship Run A for parsimony. Otherwise Run B is adding real structure and we ship Run B.SE = σ / √B. It does not depend on sample size or number of variables. Hennig's clusterboot defaults to B = 100; Dolnicar et al. (2018) explicitly recommend ≥ 200 for published results; Efron & Tibshirani (1993) place B = 50-200 as the defensible range for standard-error estimation. At B = 200, the 95% band around our mean Jaccard is roughly ±0.02 — an order of magnitude tighter than the 0.75 / 0.60 decision thresholds. Going higher is diminishing returns. Full justification in research/segmentation-framework-notes.md § 6.6.Before any segment is described to a client, the solution from Step 4.2 must pass three gates: statistical stability, criterion-variable validation, and Kotler's five strategic criteria. Only solutions that clear all three become part of the report.
Read segmentation-report.txt and confirm:
master-participants.csv as a sensitivity check.Cross-tab the selected segments against every criterion variable from segmentation-profile.csv — current tool adoption, active evaluation status, willingness to pay, renewal intent, and so on. Criterion variables were deliberately held out of the clustering step (Step 4.1), so this is an honest out-of-sample test of whether the segments have behavioral meaning.
Even a statistically stable, behaviorally predictive solution must clear five strategic tests before it is worth presenting:
Segment profiling is done by cross-tabbing the shipped segment assignment against the original binary basis variables (plus all descriptor variables). The MCA dimensions were compression intermediates — useful for building reproducible segments, useless for describing them to a client. The deliverable says "Segment 2 mentions reporting gaps 3.1× more often than the overall sample," not "Segment 2 scores high on block-2 dimension 4." Each segment's profile sheet includes:
For each segment, identify 2-3 observable proxies a sales rep can assess without a full research interview — roughly in ascending order of observation cost:
If a segment cannot be identified from any combination of these — if membership truly requires a 45-minute research interview to determine — the segment has failed the accessibility test and must be either merged with a neighbor or demoted from "segment" to "persona note" in the final deliverable.
Before writing any section of the client report, the frequency report reveals which themes are prevalent, which are rare, and where the most interesting cross-tab differences appear.
When rejection reason data exists and at least 3 competitors have rejection sample sizes of n ≥ 10, build a competitive vulnerability summary showing each competitor's top 3 rejection reasons with positioning angles for the client's sales team.
Include when rejection reasons show differentiated patterns across competitors and the client's strengths (pricing, implementation speed, ease of use) map to competitors' top rejection reasons.
The client report is an HTML document built with Astro and deployed to Cloudflare Pages. It is the primary deliverable of the study — interactive, printable, and structured to serve marketing, product, and sales simultaneously.
break-inside: avoid on tables. Large tables push entirely to the next page, creating huge white gaps. Let tables break between rows; protect individual rows with break-inside: avoid on tr.Use this CSS pattern to keep headings with their charts:
This creates a chain: heading stays with description, description stays with legend/chart.
<div class="crosstab-section">page: landscape to the wrapper, NOT to .crosstab alone (otherwise the title stays on the portrait page)table-layout: fixed; width: 100%, font-size 9px data / 8px headerswhite-space: normal@page { margin: 16mm; }!important on all background colors + -webkit-print-color-adjust: exact !important on *details.accordion { page-break-before: always; } with first-of-type excludeddisplay: block !important on details and bodywrangler pages project create [name] --production-branch master before the deploy command.Pipeline Outputs
All output files go into A6 Data Files - Simulated/ or B6 Data Files - Real/ for new studies. Segmentation outputs go into the A5/B5 folder. (The HR Leaders BambooHR study uses A3/A4/A5 — a pre-convention study that is not being reorganized.)
| File | Created by | Contents |
|---|---|---|
| study-context.json | Discovery 1.1 | Who was interviewed, communication patterns, dominant topics |
| questions-registry.csv | Discovery 1.2 | One row per question with coding type and code count |
| meaning-units-log.csv | Discovery 1.3 | All meaning units with exact quotes and descriptive labels |
| codebook-audit-trail.csv | Discovery 1.5 | Architect's non-obvious merge and split decisions with reasoning |
| codebook-landscape-analysis.txt | Discovery 1.5 | Architect's chain-of-thought landscape analysis |
| codebook.json | After human review | Final approved codebook with all code definitions, criteria, and examples |
| agent-registry.json | Discovery + Application | Full record of every agent used: model, temperature, persona hash, role, run date |
| agent-1-codes.json | Application 2.1 | Raw coding output from Agent 1 |
| agent-2-codes.json | Application 2.1 | Raw coding output from Agent 2 |
| application-coding-detail.json | Application 2.3 | Full per-agent detail with resolver notes for every contested decision |
| reliability.txt | Application 2.2 | Human-readable Kappa report |
| reliability-summary.json | Application 2.2 | Machine-readable Kappa data per code |
| flagged-items.json | Application 2.2 | Codes with Kappa below 0.65 and ambiguous definitions |
| coding-summary.md | Application end | Run summary with participant counts and quality metrics |
| master-participants.csv | Dataset 3.1 (grows) | One row per participant, all coded variables + roster, grows with factor scores and segment assignments |
| frequency-report.html | Dataset 3.2 | Sortable tables, bar charts, collapsible cross-tab sections by firmographic variable |
| frequency-report.csv | Dataset 3.2 | Machine-readable frequencies |
| segmentation-ready.csv | Segmentation 4.1 | Final basis variables for clustering — complement of profile CSV over the same participants (A6 folder) |
| segmentation-profile.csv | Segmentation 4.1 | Every variable NOT in the clustering basis: descriptors, criterion, prevalence-dropped, reclassified, preserved merge components (A6 folder) |
| segmentation-blocks.json | Segmentation 4.1 | Block-to-variable mapping consumed by run-segmentation.py (A6 folder) |
| segmentation-transforms.json | Segmentation 4.1 | Hand-authored per study — declares the merges and basis→descriptor reclassifications applied after the first-pass validation review (A6 folder) |
| segmentation-provenance.json | Segmentation 4.1 | Machine-written audit trail: prevalence drops, Yule's Q flags, transforms applied, column roles for every CSV column (A6 folder) |
| segmentation-validation-report.txt | Segmentation 4.1 | Human-readable prevalence log and Yule's Q review (A6 folder) |
| mca-review.json | Segmentation 4.2a | Machine-readable block-wise MCA results: raw and Benzécri-corrected eigenvalues, variance %, per-dimension positive/negative pole loadings, recommended retention per block (A6 folder) |
| mca-review.md | Segmentation 4.2a | Human-readable MCA review with eigenvalue tables, ASCII scree bars, dimension interpretations, per-block recommendations. Eigenvalues and pole loadings must be presented inline in chat for analyst approval (A6 folder) |
| mca-dimensions.json | Segmentation 4.2a | Analyst-approved per-block retention counts + interpretive labels. Consumed by run-segmentation.py; missing file → Step 4.2b refuses to run (A6 folder) |
| segmentation-assignments.csv | Segmentation 4.2b | participant_id, run_a_segment, run_b_segment, selected_segment, selected_run (A6 folder) |
| segmentation-run-a-dimensions.csv | Segmentation 4.2b | Benzécri-corrected MCA dimension scores from the psychographic-only run (A6 folder) |
| segmentation-run-b-dimensions.csv | Segmentation 4.2b | MCA dimension scores from the run that includes log(company size) (A6 folder) |
| segmentation-report.txt | Segmentation 4.2b | Benzécri variance, silhouette, bootstrap Jaccard stability, ARI comparison, segment sizes (A6 folder) |
Research Citations
Every major design decision in this pipeline traces to published research. These are the sources cited when explaining the methodology to clients or peer reviewers.