now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.
On the other hand, if you take a more direct mirror of a Pascal string:
You're back to one memory span but can't reslice it.
And of course the worst codebase is when someone uses the first one because they want to keep slicing and someone else uses the second one because they need to save memory / indirections. To support both you end up writing functions that take a separate size and data pointer, and... well, then what's the point?
Most if not all string algorithms will eventually do this anyway:
size_t length = strlen("some string");
It's so common. Might as well memoize it so it's always available with no need to constantly loop through strings which is an O(N) algorithm. So many string algorithms call strlen, often multiple times on the same string. I remember GTA V took 6 minutes to parse a goddamn JSON file because of stuff like this and part of the fix was to store the string lengths.
you want depends heavily on what you're doing with the string(s) - plus other common variations like len+cap instead of just size, SSO, etc.
So if you can't standardize the data structure, what's the common interface? A function that takes a pointer and a length - which is what we already have. So everyone in this thread appealing to the C standardization process or stdlib to do something wants instead - what, exactly?
I'm not aware of any common string implementation that takes 3-4 words just to put an empty string in your struct especially if it also still requires external allocation with additional size words once the string gets above a certain size. Java takes 1, Go takes 2, SDS takes 1, libstdc++ takes 3 but doesn't require an external store later, etc.
Delphi uses a pointer to the latter, in addition to keeping the actual strings with a zero at the end. That way a cast to a C-style "string" is free.
In order to allow for the pointer to live on stack and minimize copying, the data is also reference counted and the compiler takes care of inserting the necessary reference counting calls where needed.
Overall it's pretty flexible, but the reference counting means it's not ideal to use shared strings in heavily threaded code. Of course the second a thread modifies a string, a new string is allocated and that thread can happily work on its "own" string.
Anyway, just yet another way of implementing strings.
> now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.
Having a size_t on the stack is hardly an issue, it's what every low-level modern language does. It's fast, convenient, and pretty efficient. It also doesn't require deref'ing to get the length, which is a pretty common use case (e.g. checking if a string is empty, or too big, or something along those lines).
That approach only works if the string isn't being mutated in a way which could change its size, though. Otherwise you need to make sure it has a lexical lifetime (and be very careful with it), or if that's not possible pay the double alloc cost.
I would be also worried about any difficulties separating the length and data causes for prefetching/cache lines though.
The second form is also more amenable to SSO; I'm not sure how often that would come up in the kernel but it's saved me a decent chunk of memory in at least one past project. (Still today I'm sometimes frustated by Go `string` porting from Java `String`, like great now I don't have to pay a boxing overhead but if it's often absent/empty my base size is now 2x what it would otherwise be...)
On the other hand, if you take a more direct mirror of a Pascal string:
You're back to one memory span but can't reslice it.And of course the worst codebase is when someone uses the first one because they want to keep slicing and someone else uses the second one because they need to save memory / indirections. To support both you end up writing functions that take a separate size and data pointer, and... well, then what's the point?