So, if you want to do struct bytes { size_t size; unsigned char *data; }; now yo...

matheusmoreira · on Aug 27, 2022

Most if not all string algorithms will eventually do this anyway:

  size_t length = strlen("some string");

It's so common. Might as well memoize it so it's always available with no need to constantly loop through strings which is an O(N) algorithm. So many string algorithms call strlen, often multiple times on the same string. I remember GTA V took 6 minutes to parse a goddamn JSON file because of stuff like this and part of the fix was to store the string lengths.

https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...

So is a length variable really such a big deal? It even fits in a register.

I understand and agree with your one memory span point. Ideally they should be located as close as possible in memory.

  struct bytes { size_t size; unsigned char data[]; };

  const char *literal = "some text";
  size_t length = strlen(literal);

  struct bytes *text = malloc(sizeof(*text) + length);
  text->size = length;
  memcpy(text->data, literal, length);

morelisp · on Aug 27, 2022

I'm not anti-length-variable! I'm saying, which of

  struct bytes { size_t size; unsigned char *data; };
  struct bytes { size_t size; unsigned char data[]; };

you want depends heavily on what you're doing with the string(s) - plus other common variations like len+cap instead of just size, SSO, etc.

So if you can't standardize the data structure, what's the common interface? A function that takes a pointer and a length - which is what we already have. So everyone in this thread appealing to the C standardization process or stdlib to do something wants instead - what, exactly?

alcover · on Aug 27, 2022

  > then what's the point?

There's no point. Moreover, passing the length separately is annoying and error-prone.

You can have both your models in a single type. That's what my buffer lib achieves : SSO + slicing.

https://github.com/alcover/buffet

morelisp · on Aug 27, 2022

It’s clever and probably makes sense in some contexts but 24b overhead not counting the store feels like a nonstarter for many cases.

alcover · on Aug 27, 2022

You'd help me by explaining those cases. I'm not connected to the 'industry'.

Also some major impl are 24 or even 32 bytes. With a generous SSO you catch a lot of strings w/o overhead.

morelisp · on Aug 27, 2022

I'm not aware of any common string implementation that takes 3-4 words just to put an empty string in your struct especially if it also still requires external allocation with additional size words once the string gets above a certain size. Java takes 1, Go takes 2, SDS takes 1, libstdc++ takes 3 but doesn't require an external store later, etc.

alcover · on Aug 27, 2022

from https://github.com/elliotgoodrich/SSO-23/blob/master/README....

  MSVC  32
  GCC  32
  Clang 24

SDS is not typesafe, no SSO, no slices.

morelisp · on Aug 27, 2022

Your library is very nice.

magicalhippo · on Aug 27, 2022

Delphi uses a pointer to the latter, in addition to keeping the actual strings with a zero at the end. That way a cast to a C-style "string" is free.

In order to allow for the pointer to live on stack and minimize copying, the data is also reference counted and the compiler takes care of inserting the necessary reference counting calls where needed.

Overall it's pretty flexible, but the reference counting means it's not ideal to use shared strings in heavily threaded code. Of course the second a thread modifies a string, a new string is allocated and that thread can happily work on its "own" string.

Anyway, just yet another way of implementing strings.

masklinn · on Aug 27, 2022

> now you've got two allocations (or at least two separate memory regions, or at least a pointer wasted assuming it's constant) per dynamically-allocated thing.

Having a size_t on the stack is hardly an issue, it's what every low-level modern language does. It's fast, convenient, and pretty efficient. It also doesn't require deref'ing to get the length, which is a pretty common use case (e.g. checking if a string is empty, or too big, or something along those lines).

morelisp · on Aug 27, 2022

That approach only works if the string isn't being mutated in a way which could change its size, though. Otherwise you need to make sure it has a lexical lifetime (and be very careful with it), or if that's not possible pay the double alloc cost.

I would be also worried about any difficulties separating the length and data causes for prefetching/cache lines though.

The second form is also more amenable to SSO; I'm not sure how often that would come up in the kernel but it's saved me a decent chunk of memory in at least one past project. (Still today I'm sometimes frustated by Go `string` porting from Java `String`, like great now I don't have to pay a boxing overhead but if it's often absent/empty my base size is now 2x what it would otherwise be...)