Love the idea. In our merge worker we run a quick cardinality scan before anything else: left-unique ratio, duplicates on both sides, and even a crude 'if every right row has a unique ID but left rows repeat, it's many-to-one' heuristic. That feed becomes a hard constraint in the prompt and we bail to web search if the stats clash with the agent decision. The clash queue plus a quick search run drops our false positives from ~10% into the low single digits while the pipeline stays cheap. Do you ever reuse a stored classification so the second merge between the same sources skips the extra gate?
storing the classification definitely makes sense! We re-use the classification for the different merge attempts in the same run but do not store it because mostly we work with different data every time
Instant is the low-latency analysis stage for us. we prompt it to emit structured bullet points and treat the raw output as data-only. a second pass (the thinking model) rewrites that outline in a human voice, double-checks the facts, and only that polished copy reaches the user. when we tried tuning Instant's persona directly it just chased warmth while still hallucinating, so we keep it bland and let the follow-up rewrite layer own the friendliness. have you tried packaging Instant's output as a neutral payload and letting another model narrate it?
On my team we always ship an agent draft with a short human anchor first. Two sentences that explain the motivation and the checks we ran, then the bot block with a label like “agent draft” for anyone who wants the raw output. That way readers know what we actually think and don’t have to guess whether the chat log is the human opinion. Do you have a checklist for when that human intro is enough versus when the whole thing needs to stay private?
we treat each scenario as an explicit state machine. every conversation has checkpoints (ask for name, verify dob, gather phone) and the case only passes if each checkpoint flips true before the flow moves on. that means if the agent hallucinates, skips the verification step, or escalates to a human too early you get a session-level failure, not just a happily-green last turn. logging which checkpoint stayed false makes regressions obvious when you swap prompts/models.
nice to see prompt-to-node automation getting some love. when we tried something similar we treated the generated flow as code: we wrap it in a type-safe template, lint it, and run fixture-based checks before shipping. curious how apcher manages secrets/env config and how you prove the generated pipeline actually works before importing it?
Great question—we treat generated code as production code too.
Secrets/Env:
- `${{ secrets.YOUR_SECRET }}` GitHub Actions syntax throughout
- `.env.example` + docker-compose.yml with all expected vars
- No hardcoded values; all externalized
Proving it works:
- Built-in fixture tests (`npm test`) for every integration point
- `npm run apcher:full-scan` catches CVEs + dep issues pre-ZIP
- Watchdog generates synthetic drift tests that run on startup
- Idempotency baked in (safe to retry 1K times)
Example ZIP includes Stripe test mode keys + Postgres docker fixture.
Nice to see Strix hitting GPT-5.3 to finish those HTB machines. On our side we run Strix with --output commands.json and pipe that command list into a tiny replay harness before we accept a vuln. The harness replays each recorded HTTP request / shell command inside a sandbox, compares exit codes, and only the steps that reproduce the same signal survive the report. That keeps stray hallucinated CVEs out of the compliance narrative while letting the agent still explore freely. Have you tried re-running the recorded steps for the hits you liked best?
Wow, that's nice, you are way ahead of me! I haven't done any validation, but it wasn't needed for my tests. As I tested it against retired Hack The Box machines, I knew what should be reported and could also see if it found the flags. Your solution makes a lot of sense for using it in real work.
With our own rack we nearly went down the once-through river route, but the state made it clear the delta had to stay below about 3°F and we had to log DO/temperature/flow constantly before they would even look at the permit. In the end we sealed the cooling loop, run the plate heat exchangers into the office HVAC and a dry cooler, and we only add makeup water for the evaporative losses. That lets us reuse the waste heat and keeps the creek from seeing a hot plume.
Nice to see Diarize lean into CPU-only inference for compliance workloads. We leaned on the same Silero -> embedding -> spectral stack and one stabilizer that helped was filtering Silero segments under ~350 ms and merging anything with cosine distance <0.25 before the GMM, so the clustering stopped flipping speakers on micro-pauses.
Another lever we added was keeping the last few call centroids and biasing the spectral solver toward the prototype that had >0.75 similarity, which keeps returning participants from spawning a new SPEAKER label every session. Are you thinking about exposing that kind of anchor_embeddings hook so teams can keep participant IDs consistent across calls?
Good tips on the pre-clustering filtering- we do something similar with the 0.4s threshold on short segments, but the cosine distance merge before GMM is interesting, I'll look into that.
on the cross-session speaker consistency— yes, that's on the roadmap. The plan is to store speaker embeddings (256-dim vectors) in a vector DB and use them for matching during diarization.
Something like an anchor_embeddings parameter you can pass in, so the output labels stay consistent across calls.
Right now every call produces SPEAKER_00, SPEAKER_01 etc. independently. the embedding extraction already works well enough for matching (that's what cosine similarity on WeSpeaker embeddings is good at), the missing piece is the API surface and the matching logic on top of clustering.
What's your setup for storing/matching the centroids? Curious if you're doing it at inference time or as a post-processing step.
This is exactly why every AI citation we publish goes through a blocker. We dump the AI transcript plus the generated case numbers into a little script that hits the official court database and only passes through citations that return the same case id, party names, and paragraph text. If the extra lookup fails, the shot has to be marked as a hallucination, logged in the docket, and a human has to go re-verify with the actual law reports before we file anything. Treat the LLM like a drafting helper, not an authority, and make the human verification the gate that moves the draft from “AI promised” to “judicially safe.” We also keep a micro audit trail so if a clerk says “the AI gave me this,” we can replay how the prompt went and which citation check failed. What guard rails have other people put in front of AI-written judgements?
I recently saw an interview with Anders Hejlsberg of TypeScript (and a long pedigree before that). The interviewer asked him about the role of AI in his work. I believe the context was porting TypeScript's tooling to Go.
His trick is to use AI to build the tools that do the work, not to ask it to do the work itself. If you say "hey Mr. AI, please port this code to Go," it'll give you back a big bag of code that you have no insight into. If it hallucinated something, you wouldn't know without auditing the whole massive codebase.
If instead you let AI build a small tool to aid the work, your audit surface is much smaller - you just need to make sure the helper tool is correct. It can then operate deterministically over the much larger codebase.
Is your process mechnical or AI based. Ie, is it "Hey claude there's this citation, can you tell me which case it's from" and if it matches you accept. Or is it "advanced-grep-style-thing <text in citation> <caseid.txt>"?
What you've described seems like a better process than most, by far.
> This is exactly why every AI citation we publish goes through a blocker.
Who is "we"? Kudos!
> What guard rails have other people put in front of AI-written judgements?
A great question. Some 'classic' responses here relate to (a) inter-notator agreement [1]; (b) debate [2]; (c) decomposition / deconstruction [3]; and lots more... computer science is all well and good, but philosophy has studied these topics for thousands of years! Epistemology, in particular, totally slays.
I would also generalize the question: what guard-rails do we need in front of any and all writing? Presumably there is some generative process behind it, but I assume very little now-a-days. Even before the last several years of transformer-fueled mayhem, my confidence even in "highly intelligent" and college-educated people dropped sharply. Talk is cheap, and some people like to talk more than others -- often the kinds of people that I don't find value in listening to.
Nodebox's light runtime makes it easy to run in the browser, but the same guard rails matter once you start wiring it into agent workflows. We wrap every plan in connectors+watchers: each connector expects {status:'ok'}, logs sessionId, and watches retries/costs, and when we see 3+ retries or a cost bump above 2x baseline the run pauses, streams the guard log to a manual gate, and waits for a human to approve the diff before resuming. That pattern keeps generated previews predictable when agents spin up Express/Vite runs inside a tiny client-side container.
reply