If AI is the key to compiling natural language into machine code like so many claim, then the AI should output machine code directly.
But of course it doesn't do that becaude we can't trust it the way we do a traditional compiler. Someone has to validate its output, meaning it most certainly IS meant for humans. Maybe that will change someday, but we're not there yet.
$1,000 is maybe 5$ per workday. I measure my own usage and am on the way to $6,000 for a full year. I'm still at the stage where I like to look at the code I produce, but I do believe we'll head to a state of software development where one day we won't need to.
Scroll further down (specifically to the section titled "Wait, $1,000/day per engineer?"). The quote in the quoted article (so from the original source in factory.strongdm.ai) could potentially be read either way, but Simon Willison (the direct link) absolutely is interpreting it as $1000/dev/day. I also think $1000/dev/day is the intended meaning in the strongdm article.
All true except that CLI tools are composable and don't pollute your context when run via a script. The missing link for MCP would be a CLI utility to invoke it.
How does the agent know what clis/tools it has available? If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.
The composition argument is compelling though. Instead of clis though, what if the agent could write code where the tools are made available as functions?
> what if the agent could write code where the tools are made available as functions?
Exactly, that would be of great help.
> If there's an `mcpcli --help` that dumps the tool calls, we've just moved the problem.
I see I worded my comment completely wrong... My bad. Indeed MCP tool definitions should probably be in context. What I dislike about MCP is that the IO immediately goes into context for the AI Agents I've seen.
Example: Very early on when Cursor just received beta MCP support I tried a Google Maps MCP from somewhere on the net; asked Cursor "Find me boxing gyms in Amsterdam". The MCP call then dumped a HATEOAS-annotated massive JSON causing Cursor to run out of context immediately. If it had been a CLI tool instead, Cursor could have wrapped it in say a `jq` to keep the context clean(er).
I mean what was keeping Cursor from running jq there? It's just a matter of being integrated poorly - which is largely why there was a rethink of "we just made this harder on ourselves, let's accomplish this with skills instead"
Better yet is a system which activates skills in certain situations. I use hooks for this with Claude, works great. The skill descriptions are "Do not activate unless instructed by guidance."
Example: A Python file is read or written, guidance is given back (once, with a long cooldown) to activate global and company-specific Python skills. Claude activates the skills and writes Python to our preference.
That depends on how you configure or implement your sandbox. If you let it have internet access as part of the sandbox, then yes, but that is your own choice.
Internet access is required to install third party packages, so given the choice almost no one would disable it for a coding agent sandbox.
In practice, it seems to me that the sandbox is only good enough to limit file system access to a certain project, everything else (code or secret exfiltration, installing vulnerable packages, adding prompt injection attacks for others to run) is game if you’re in YOLO mode like pi here.
There's no PostCompact hook unfortunately. You could try with PreCompact and giving back a message saying it's super duper important to re-read X, and hope that survives the compacting.
I use Claude Code to modify policies for Claude Code. (Think of say the regex auto-allow/deny, but a lot stronger.) I can do that with hot reload of the local development server; It works but it better not make any errors.
A setup like you describe would honestly be interesting to see, so long as it can roll back to a previous state. Otherwise the first mistake it makes will likely be its last.
Very interesting! This has a gem in the documentation: Using the tool itself as a CI check. I hadn't considered unresolved comments by say a person, or CodeRabbit or similar tool being a CI status failure. That's an excellent idea for AI driven PR's.
On a personal note; I hate LLM output to advertise a project. If you have something to share have the decency to type it out yourself or at least redact the nonsense from it.
Lol, I thought it did a reasonably good job, but to each their own - this was the difference between releasing the project so others could use it with decent documentation, or not releasing and just using it internally. :)
Depends on the build toolchain but usually you'd hash the dependency file and that hash is your cache key for a folder in which you keep your dependencies. You can also make a Docker image containing all your dependencies but usually downloading and spinning that up will take as long as installing the dependencies.
The article has the headline "AI Misses Nearly One-Third of Breast Cancers, Study Finds".
It also has the following quotes:
1. "The results were striking: 127 cancers, 30.7% of all cases, were missed by the AI system"
2. "However, the researchers also tested a potential solution. Two radiologists reviewed only the diffusion-weighted imaging"
3. "Their findings offered reassurance: DWI alone identified the majority of cancers the AI had overlooked, detecting 83.5% of missed lesions for one radiologist and 79.5% for the other. The readers showed substantial agreement in their interpretations, suggesting the method is both reliable and reproducible."
So, if you are saying that the article is "not about AI performance vs human performance", that's not correct.
The article very clearly makes claims about the performance of AI vs the performance of doctors.
The study doesn't have the ability to state anything about the performance of doctors vs the performance of AI, because of the issues I mentioned. That was my point.
But the study can't state anything about the sensitivity of AI either because it doesn't compare the sensitivity of AI based mammography (XRay) analysis with that of human reviewed mammography. Instead it compares AI based mammography vs human based DWI when the humans knew the results were all true positives. It's both a different task ("diagnose" vs "find a pattern to verify an existing diagnosis") and different data (XRay vs MRI).
So, I don't think the claims from the article are valid in any way. And the study seems very flawed.
Also, attempting to measure sensitivity without also measuring specificity seems doubly flawed, because there are very big tradeoffs between the two.
Increasing sensitivity while also decreasing specificity can lead to unnecessary amputations. That's a very high cost. Also, apparently studies have show that high false positive rates for breast cancer can lead to increased cancer risks because they deter future screening.
Given that I don't have access to the actual study, I have to assume I am missing something. But I don't think it's what you think I'm missing.
reply