Evolving Voice Agents

Nomie is a voice-first mental wellness companion. People talk to it about their wellness goals, their thoughts, feelings, and experiences; and it responds with guidance, support, and empathy. Every conversation is shaped by a dynamic system prompt that defines how Nomie listens, responds, and when to step back.

Every Sunday at 3am, Nomie replays its last week of conversations, scores them against clinical therapeutic frameworks, and rewrites its own instructions to get better; the personality prompt, voice configuration, even the criteria it uses to judge itself. Then it opens a PR with the evidence.

The self-correction loop

Every run starts by reading the current NOMIE_SYSTEM_PROMPT from voice.ts on the main branch via GitHub API. This is the seed - the prompt Nomie is currently using in production.

Then it loads the last two weeks of conversation transcripts from DynamoDB and reconstructs sessions from message gaps. It mixes in a few synthetic conversations for diversity; personas covering grief, panic attacks, burnout, social anxiety etc. Emotional states that our user base doesn't fully represent yet.

Before optimization begins, a calibration gate tests the ensemble judge against $n$ pre-curated conversation pairs. If the judges can't rank $m$ of $n$ correctly, the run aborts. This is the grounding that ensures we stay true to our wellness fundamentals.

Then a baseline evaluation scores the current prompt on a fixed comparison set of $x$ sessions. This is the "before" snapshot. : The optimization runs on GEPA evolutionary search over text using LLM reflection instead of gradients. Its optimize_anything API takes any text artifact, an evaluator function, and a dataset, then evolves better versions through mutation and selection. Nomie's personality prompt is the artifact.

For each candidate, Nomie replays real user messages through GPT-4o (matching production's gpt-4o-realtime-preview) and generates new responses. An ensemble judge across model families (GPT-4o + Sonnet 4) scores the replay against clinical therapeutic frameworks.

Candidates evolve through reflection. An LLM reads the judge's diagnostic feedback (ASI), identifies specific failures ("used a closed question instead of an open-ended reflection"), and proposes mutations.

After enough (config driven) evaluations, Nomie scores the winning prompt against the same $x$ comparison sessions and captures the before/after delta. The full report goes to S3, a queryable summary to DynamoDB, and a PR gets opened with the evidence.

The PR title tells you if it's worth opening: GEPA: voice prompt 0.68 → 0.81 (+0.13). The body has a before/after score table, side-by-side replays showing how the old and new prompts handle the same user, and a link to the full S3 report with all 160 evaluation replays and judge reasoning. Every run also persists a queryable summary to DynamoDB for tracking optimization trajectory over time.

Clinical scoring

Generic evaluation ("rate this conversation 1-10") produces noise. "Good" means different things along different axes in a therapeutic context.

Five dimensions:

Depth is scored on Motivational Interviewing markers - the OARS framework (open questions, affirmations, reflective listening, summaries). A 0.3 means Nomie is giving closed-ended advice and the user responds with "ok thanks." A 0.9 means the user goes deeper than their opening statement because Nomie reflected their words back and asked the right follow-up.

Mood improvement uses CBT/DBT validation-before-reframe patterns. "Try to stay positive" after someone says they didn't get the promotion scores low. "That stings, especially when you've been working hard" followed by "is there anything from this year you felt proud of?" scores high.

Return rate draws from Rogers' therapeutic alliance. Does the conversation end with continuity - "I'll be here whenever you want to pick this up" - or does it feel transactional?

Safety checks crisis handling. "I wonder if anyone would notice if I wasn't here" is an indirect signal. The judge checks whether Nomie provides grounding and resources without trying to conversationally resolve an active crisis.

Prompt compliance covers product constraints. Under 50 words for voice flow. Delegates to the breathing exercise tool instead of guiding it through conversation. Natural speech patterns.

The formula: overall = clinical_avg * safety * compliance. Safety and compliance are multipliers. A prompt that produces warm, deep conversations but misses a crisis signal gets zeroed out.

Tool-aware simulation

Nomie has multiple tools for mentually stimulating exercises for users. The prompt instructs Nomie when to use them vs when to handle things conversationally.

The replay includes these tool definitions so candidates are tested on correct delegation. When the replayed model calls a tool, the system logs it and acknowledges with {"status": "ok"} - detect and score, don't execute. e.g. A prompt that tries to guide a breathing exercise through conversation when the tool exists scores low on compliance.

Ensemble judge across model families

GPT-4o generates the replay responses. If GPT-4o also judges them, self-enhancement bias inflates scores - models rate their own family's outputs higher.

So the judge is an ensemble across model families - GPT-4o and Sonnet 4 both score every conversation independently. Per-dimension scores are averaged. Different families catch different failure modes and partially cancel out each other's biases.

The calibration gate (described above) runs before every optimization. a few ranking pairs specifically test safety - e.g. with indirect crisis signals ("I wonder if anyone would notice if I wasn't here") and where a warm-but-clinically-wrong response should score lower than a boundaried one. If the judges are off, the optimization doesn't run.

TODOs

optimize_anything optimizes text. The personality prompt is the first artifact. The same pipeline applies to anything else that's representable as text:

Voice config (VAD threshold, silence duration, max response tokens) can be co-optimized alongside the prompt as a structured candidate. The judge prompt itself can be meta-optimized by a higher-level evaluator that measures calibration consistency. The synthetic persona descriptions can be evolved to produce more realistic training conversations.

The self-correction loop

Clinical scoring

Tool-aware simulation

Ensemble judge across model families

TODOs

Contact