Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So, if you want to do

  struct bytes { size_t size; unsigned char *data; };
now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.

On the other hand, if you take a more direct mirror of a Pascal string:

  struct bytes { size_t size; unsigned char data[]; };
You're back to one memory span but can't reslice it.

And of course the worst codebase is when someone uses the first one because they want to keep slicing and someone else uses the second one because they need to save memory / indirections. To support both you end up writing functions that take a separate size and data pointer, and... well, then what's the point?



Most if not all string algorithms will eventually do this anyway:

  size_t length = strlen("some string");
It's so common. Might as well memoize it so it's always available with no need to constantly loop through strings which is an O(N) algorithm. So many string algorithms call strlen, often multiple times on the same string. I remember GTA V took 6 minutes to parse a goddamn JSON file because of stuff like this and part of the fix was to store the string lengths.

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

So is a length variable really such a big deal? It even fits in a register.

I understand and agree with your one memory span point. Ideally they should be located as close as possible in memory.

  struct bytes { size_t size; unsigned char data[]; };

  const char *literal = "some text";
  size_t length = strlen(literal);

  struct bytes *text = malloc(sizeof(*text) + length);
  text->size = length;
  memcpy(text->data, literal, length);


I'm not anti-length-variable! I'm saying, which of

  struct bytes { size_t size; unsigned char *data; };
  struct bytes { size_t size; unsigned char data[]; };
you want depends heavily on what you're doing with the string(s) - plus other common variations like len+cap instead of just size, SSO, etc.

So if you can't standardize the data structure, what's the common interface? A function that takes a pointer and a length - which is what we already have. So everyone in this thread appealing to the C standardization process or stdlib to do something wants instead - what, exactly?


  > then what's the point?
There's no point. Moreover, passing the length separately is annoying and error-prone.

You can have both your models in a single type. That's what my buffer lib achieves : SSO + slicing.

https://github.com/alcover/buffet


It’s clever and probably makes sense in some contexts but 24b overhead not counting the store feels like a nonstarter for many cases.


You'd help me by explaining those cases. I'm not connected to the 'industry'.

Also some major impl are 24 or even 32 bytes. With a generous SSO you catch a lot of strings w/o overhead.


I'm not aware of any common string implementation that takes 3-4 words just to put an empty string in your struct especially if it also still requires external allocation with additional size words once the string gets above a certain size. Java takes 1, Go takes 2, SDS takes 1, libstdc++ takes 3 but doesn't require an external store later, etc.


from https://github.com/elliotgoodrich/SSO-23/blob/master/README....

  MSVC  32
  GCC  32
  Clang 24
SDS is not typesafe, no SSO, no slices.


Your library is very nice.


Delphi uses a pointer to the latter, in addition to keeping the actual strings with a zero at the end. That way a cast to a C-style "string" is free.

In order to allow for the pointer to live on stack and minimize copying, the data is also reference counted and the compiler takes care of inserting the necessary reference counting calls where needed.

Overall it's pretty flexible, but the reference counting means it's not ideal to use shared strings in heavily threaded code. Of course the second a thread modifies a string, a new string is allocated and that thread can happily work on its "own" string.

Anyway, just yet another way of implementing strings.


> now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.

Having a size_t on the stack is hardly an issue, it's what every low-level modern language does. It's fast, convenient, and pretty efficient. It also doesn't require deref'ing to get the length, which is a pretty common use case (e.g. checking if a string is empty, or too big, or something along those lines).


That approach only works if the string isn't being mutated in a way which could change its size, though. Otherwise you need to make sure it has a lexical lifetime (and be very careful with it), or if that's not possible pay the double alloc cost.

I would be also worried about any difficulties separating the length and data causes for prefetching/cache lines though.

The second form is also more amenable to SSO; I'm not sure how often that would come up in the kernel but it's saved me a decent chunk of memory in at least one past project. (Still today I'm sometimes frustated by Go `string` porting from Java `String`, like great now I don't have to pay a boxing overhead but if it's often absent/empty my base size is now 2x what it would otherwise be...)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: