Yeah, mid-late last year was one of the worst markets I've seen in my career, but the last couple of months things have really seemed to pick up speed.
I agree it works well. Although as a long-time TDD practitioner it is mildly frustrating that it has taken LLMs to get more people to realise it works!
I've always found TDD frustrating, especially in its Red/Green expression.
I find that my tests become too low-level, as I build up component by component. This hinders large-scale refactorings because my mental planning wants to avoid the extra effort of rewriting the tests to any new interface.
That refactoring can also some of the tests unnecessary, so it felt like I was going through extra work for a small benefit which wasn't worthwhile at that stage.
I also found that red/green TDD's focus on "confirm that the tests fail before implementing the code to make them pass" (quoting the link) makes me think less about writing tests which aren't expected to fail, but if they do fail they indicate serious design problems.
As an example, I once evaluated software package which was fully developed under TDD. It was a web app which, among other things, allowed anonymous and arbitrary users to download files with a URL like example.com/download?filename=xyz.txt
There were no tests for arbitrary path traversal, and when I tried it out with something like filename=../../config.ini I got access to the server's config file.
Now, it wasn't quite that bad. They required the filename end in only a handful of extensions, like ".pdf". Thing is, the developers didn't check for NUL characters, and their server was written in Java, which passes the string directly to the filesystem API, which in true C style expects NUL-terminated strings. My actual filename was more like "../../config.ini\0.pdf", with the NUL appropriately encoded as a URL path parameter. The Java code checked that the extension was allowed, then passed it to the filesystem call, which interpreted it as "../../config.ini", which gave me access to the system configuration - including poorly hashed admin passwords with almost no preimage resistance that I was able to break after a couple of hours of thinking about the algorithm.
The explicit NUL test is needed as a security test in Java. In Python it's a different class of error as Python's filesystem APIs raise a ValueError if the string contains a NUL.
I don't at all mean that good and useful software can't be written with TDD, nor that TDD is useless. Rather, it's that Red/Green TDD as a development practice appears to de-emphasize certain types of essential testing which don't fit into the red-green-refactor paradigm, but instead require a larger development methodology outside of TDD.
As for me personally, I'm strongly influenced by rapid prototyping - "spike and stabilize" I believe it's called - where the code goes through possibly several iterations for the API and implementation to stabilize to the point where the overhead of writing automated tests is outweighed by their benefit.
And these tests include tests which should pass, but which check boundary conditions, unexpected input, and the like.
To say nothing of choosing the right way to hash passwords, which doesn't easily fit into any test-based framework. :)
As to the specific tests in the linked-to piece, I think the tests are far from adequate. Consider the following test from the ChatGPT solution:
def test_ignore_in_fenced_code_block(self):
md = textwrap.dedent("""
# Real
```python
# Not a header
## Also not
```
## Real too
""").lstrip("\n")
self.assertEqual(extract_headers(md),
[(1, "Real"), (2, "Real too")])
That's the only test for fenced code blocks. The relevant code is:
_FENCE_RE = re.compile(r"^[ \t]{0,3}(?P<fence>`{3,}|~{3,})(?P<info>.*)$")
...
m_fence = _FENCE_RE.match(line)
if m_fence:
fence = m_fence.group("fence")
char = fence[0]
if not in_fence:
in_fence = True
fence_char = char
fence_len = len(fence)
else:
if char == fence_char and len(fence) >= fence_len:
in_fence = False
fence_char = None
fence_len = 0
i += 1
continue
if in_fence or is_blockquote(line):
i += 1
continue
You can see the code requires fences starts with at least 3 ' or ~ characters, and that the close fence must match. However, there's no test for mismatches, nor a test for ~ fences, nor a test for mismatched fence length.
Regular expression are really tricky to test correctly. Each term should be interpreted as a branch, and therefore tested, like a test for "``" as no-a-fence, and tests for leading whitespace, like "\t \t~~~~~~" as fence.
For that matter, are leading tabs really allowed? https://github.github.com/gfm/#fenced-code-block says "indented no more than three spaces", "A space is U+0020" and "in contexts where whitespace helps to define block structure, tabs behave as if they were replaced by spaces with a tab stop of 4 characters".
Also, the _FENCE_RE can drop the "(?P<info>.)$" as "info" isn't used, and the "." will match up to the end of line so "$" is guaranteed to match.
In the TDD view, there's no reason to add that group in the first place, so why is it there?
My point isn't that the code is right or wrong (though I do think the support for leading tabs is invalid). I'm rather pointing out that the code is incompletely tested, and not what I would expect from Red/Green TDD, because there is code and paths which aren't tested.
It's one infinitesimally small data point that can't be expected to move the needle.
Maybe if this becomes the standard response it would. But it seems like a ban would serve the same effect as the standard response because that would also be present in the next training runs.
I'm not sure that's true. While it obviously won't impact the general behavior of the models much If you get a very similar situation the model will likely regurgitate something similar to this interaction.
Where's the accountability here? Good luck going after an LLM for writing defamatory blog posts.
If you wanted to make people agree that anonymity on the internet is no longer a right people should enjoy this sort of thing is exactly the way to go about it.
There is no accountability (for now, at least)... But if you want it to delete its own blog post defamining you, you'll evidently have better luck asking nicely than by being aggressive. (Which matches my experience with LLMs. As a rule, saccharine politeness works well on them.)
Thought it seemed like a great idea but I never tried it. In a startup it seemed like an unnecessary source of risk and in an enterprise too much hassle to convince stakeholders to switch from existing IaC products.
In my last company, we _did_ pay for Google Cloud support and when BigQuery jobs started to fail randomly, causing huge trouble producing critical reports, the response was essentially "we are investigating", "we have identified the issue", and "please wait for it to be fixed". Hardly what I would call support. They couldn't care less.
The post is light on details. I'd guess the author ended up hammering the API and they decided it was abuse.
I expect more reports like this. LLM providers are already selling tokens at a loss. If everyone starts to use tmux or orchestrate multiple agents then their loss on each plan is going to get much larger.
Author here, thanks for reading. Yes, naming is tricky. By mono-environment, I mean that there is one _long-lived_ environment to which we deploy software.
I'm constantly surprised by developers who like LLMs because "it's great for boiler plate". Why on earth were you wasting your time writing boiler plate before? These people are supposed to be programmers. Write code to generate the boiler plate or get abstract it away.
I suppose the path of least resistance is to ignore the complexity, let the LLM deal with it, instead of stepping back and questioning why the complexity is even there.
> Write code to generate the boiler plate or get abstract it away.
That doesn’t make any sense. I want to consider what you’re saying here but I can’t relate to this idea at all. Every project has boilerplate. It gets written once. I don’t know what code you’d write to generate that boilerplate that would be less effort than writing the boilerplate itself…
>Every project has boilerplate. It gets written once.
Agree with you - I think when my colleagues have talked about boilerplate they really mean two kinds of boilerplate: code written once for project setup, like you describe, and then repetitive code. And in the context of LLMs, they talking about repetitive code.
Interesting. I don't recall this happening during my studies. You didn't have the time in my exams to cheat, leaving for the toilet was strongly discouraged and if you did leave, you would have an invigilator standing behind you at the urinal or outside the cubicle.
reply