Use when planning product experiments, writing testable hypotheses, estimating sample size, prioritizing tests, or interpreting A/B outcomes with practical statistical rigor.
✓Works with OpenClaudeDesign, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.
When To Use
Use this skill for:
- A/B and multivariate experiment planning
- Hypothesis writing and success criteria definition
- Sample size and minimum detectable effect planning
- Experiment prioritization with ICE scoring
- Reading statistical output for product decisions
Core Workflow
- Write hypothesis in If/Then/Because format
- If we change
[intervention] - Then
[metric]will change by[expected direction/magnitude] - Because
[behavioral mechanism]
- Define metrics before running test
- Primary metric: single decision metric
- Guardrail metrics: quality/risk protection
- Secondary metrics: diagnostics only
- Estimate sample size
- Baseline conversion or baseline mean
- Minimum detectable effect (MDE)
- Significance level (alpha) and power
Use:
python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute
- Prioritize experiments with ICE
- Impact: potential upside
- Confidence: evidence quality
- Ease: cost/speed/complexity
ICE Score = (Impact * Confidence * Ease) / 10
- Launch with stopping rules
- Decide fixed sample size or fixed duration in advance
- Avoid repeated peeking without proper method
- Monitor guardrails continuously
- Interpret results
- Statistical significance is not business significance
- Compare point estimate + confidence interval to decision threshold
- Investigate novelty effects and segment heterogeneity
Hypothesis Quality Checklist
- Contains explicit intervention and audience
- Specifies measurable metric change
- States plausible causal reason
- Includes expected minimum effect
- Defines failure condition
Common Experiment Pitfalls
- Underpowered tests leading to false negatives
- Running too many simultaneous changes without isolation
- Changing targeting or implementation mid-test
- Stopping early on random spikes
- Ignoring sample ratio mismatch and instrumentation drift
- Declaring success from p-value without effect-size context
Statistical Interpretation Guardrails
- p-value < alpha indicates evidence against null, not guaranteed truth.
- Confidence interval crossing zero/no-effect means uncertain directional claim.
- Wide intervals imply low precision even when significant.
- Use practical significance thresholds tied to business impact.
See:
references/experiment-playbook.mdreferences/statistics-reference.md
Tooling
scripts/sample_size_calculator.py
Computes required sample size (per variant and total) from:
- baseline rate
- MDE (absolute or relative)
- significance level (alpha)
- statistical power
Example:
python3 scripts/sample_size_calculator.py \
--baseline-rate 0.10 \
--mde 0.015 \
--mde-type absolute \
--alpha 0.05 \
--power 0.8
Related Testing Skills
Other Claude Code skills in the same category — free to download.
Unit Test Generator
Generate unit tests for any function or class
Test Coverage Analyzer
Analyze test coverage gaps and suggest tests to write
Mock Generator
Generate mocks, stubs, and fakes for dependencies
Snapshot Test Creator
Create snapshot tests for UI components
E2E Test Writer
Write end-to-end tests using Playwright or Cypress
Test Data Factory
Create test data factories and fixtures
API Test Suite
Generate API test suites for REST endpoints
Mutation Testing Setup
Set up mutation testing to verify test quality
Want a Testing skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.