You could literally drop this into Claude Code or Codex and point it at a local fork of Zulip and have it build your bimodal version with triage and grazing styles.
I use an LLM behavior test to see if the semantic responses from LLMs using my MCP server match what I expect them to. This is beyond the regex tests, but to see if there's a semantic response that's appropriate. Sometimes the LLMs kick back an unusual response that technically is a no, but effectively is a yes. Different models can behave semantically different too.
If I had a nice CI/CD workflow that was built into GitHub rather than rolling my own that I have running locally, that might just make it a little more automatic and a little easier.
It looks like it does have an MCP Gateway https://github.com/github/gh-aw-mcpg so I may see how well it works with my MCP server. One of the components mine makes are agent elements with my own permissioning, security, memory, and skills. I put explicit programatic hard stops on my agents if they do something that is dangerous or destructive.
As for the domain, this is the same account that has been hosting Github projects for more than a decade. Pretty sure it is legit. Org ID is 9,919 from 2008.
This is where the desire to NOT anthropomorphize LLMs actually gets in the way.
We have mechanisms for ensuring output from humans, and those are nothing like ensuring the output from a compiler. We have checks on people, we have whole industries of people whose whole careers are managing people, to manage other people, to manage other people.
with regards to predictability LLMs essentially behave like people in this manner. The same kind of checks that we use for people are needed for them, not the same kind of checks we use for software.
> The same kind of checks that we use for people are needed for them
Those checks works for people because humans and most living beings respond well to rewards/punishment mechanisms. It’s the whole basis of society.
> not the same kind of checks we use for software.
We do have systems that are non deterministic (computer vision, various forecasting models…). We judge those by their accuracy and the likely of having false positive or false negatives (when it’s a classifier). Why not use those metrics?
LLM code completion compares unfavourably to the (heuristic, nigh-instant) picklist implementations we used to use, both at the low-level (how often does it autocomplete the right thing?) and at the high-level (despite many believing they're more effective, the average programmer is less effective when using AI tools). We need reasons to believe that LLMs are great and do all things, therefore we look for measurements that paint it in a good light (e.g. lines of code written, time to first working prototype, inclination to output Doom source code verbatim).
The reason we're all using (or pretending to use) LLMs now is not because they're good. It's almost entirely unrelated.
> The same kind of checks that we use for people are needed for them...
The whole benefit of computers is that they don't make stupid mistakes like humans do. If you give a computer the ability to make random mistakes all you have done is made the computer shitty. We don't need checks, we need to not deliberately make our computers worse.
The same thing happens when I have a project that I’m leading where I have 3-4 other developers. It’s not deterministic that they will follow my specs completely, correctly and not have subtle bugs.
If they are junior developers working in Java they may just as well build an AbstractFactoryConcurrentSingletonBean because that’s what they learned in school as an LLM would be from training on code it found on the Internet.
I'm looking at it right now as a tool I can hollow out and stuff in my own MCP server that also has personas, skills, an agentic loop, memory, all those pieces. I may even go simpler than that and simply take a look at it's gateway and channels and drag those over and slap them onto the MCP server I have and turn it into an independent application.
It looks far too risky to use, even if I have it sequestered in its own VM. I'm not comfortable with its present state.
Where I think agents become fascinating is when we give cc an interface to something like clawdebot, plus any logging/observability, and tell it to recreate the code base.
Had humans not been doing this already, I would have walked into Samsung with the demo application that was working an hour before my meeting, rather than the android app that could only show me the opening logo.
There are a lot of really bad human developers out there, too.
An embedded page at landr-atlas.com says:
Attention!
MacOS Security Center has identified that your system is under threat.
Please scan your MacOS as soon as possible to avoid more damage.
Don't leave this page until you have undertaken all the suggested steps
by authorised Antivirus.
[OK]
Thank you for the note. It's not a site I used all that often.
Whether you had anything to do with it or not, I have no idea. And, since you didn't follow best practices and tell me directly rather than trying to score points here, there's really no way of knowing whether you're the one who caused the problem in the first place.
I built a new site without Wordpress. That took in less than a day.
I don't imagine you will alter your behavior to align with general best security practices anytime soon.
> Whether you had anything to do with it or not, I have no idea. And, since you didn't follow best practices and tell me directly rather than trying to score points here, there's really no way of knowing whether you're the one who caused the problem in the first place.
Are you actually accusing me (slyly couched in weasel words, but still explicitly) of hacking your wordpress blog, then pointing it out on Hacker News to score points?
Yeah, you have a point /s: there's really no way to tell if I hacked your blog or not, nor any way of knowing whether any statement is true or not if you're nihilistic enough, but you're going to have to take my word that I didn't, and clean up your own mess without shifting the blame to me, or demanding I should have helped you. You're the one who chose to use wordpress, not me. FYI, "general best security practices" include DON'T USE WORDPRESS.
What possible evidence or delusional reasons do you have to imply that I hacked your wordpress blog? Is your security really that lax and password that easy to guess? And even if I did, then why would I post about it publicly or notify you privately? You sound pathologically paranoid and antisocially aggressive to make such baseless accusations out of the blue, to try to shift the blame to me for your own mistakes. That makes me glad I didn't try to contact you directly. Funny thing for you to complain about when you don't even openly publish your contact email address on your blog or hn profile like I do, though.
I think Claude Cowork should come with a requirement or a very heavily structured wizard process to ensure the machine has something like a Time Machine backup or other backups that are done regularly, before it is used by folks.
The failure modes are just too rough for most people to think about until it's too late.
It certainly doesn't seem to have a trouble creating MIT licenses, that's for sure. I've had it insert an MIT license against my express direction instead of the AGPL license.
reply