More

aktiur · 2025-08-29T23:08:23 1756508903

The text in this document also directly contradicts what you're saying. Put another way: the presence of a hot journal is how SQLite determines the database might be corrupted.

https://sqlite.org/lockingv3.html#hot_journals

aktiur · on April 26, 2024

It actually depends!

`bytes.decode` (and `str.encode`) have used UTF-8 as a default since at least Python 3.

However, the default encoding used for decoding the name of files use ` sys.getfilesystemencoding()`, which is also UTF-8 on Windows and macos, but will vary with the locale on linux (specifically with CODESET).

Finally, `open` will directly use `locale.getencoding()`.

aktiur · on April 26, 2024

> In 3.1 it was the default encoding of string (the type str I guess).

No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).

At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

_ache_ · on April 26, 2024

Thank you ! The documentation was misleading about "default encoding of string".

int_19h · on April 26, 2024

The simple thing to remember is that for all versions of Python going back 12 years, there's no such thing as "default encoding of string". A Python string is defined as a sequence of 32-bit Unicode codepoints, and that is how Python code perceives it in all respects. How it is stored internally is an implementation detail that does not affect you.

Dylan16807 · on April 27, 2024

32 bit specifically?

The most expansive Unicode has ever been was 31 bits, and UTF-8 is also capable of at most 31 bits.

int_19h · on April 27, 2024

You're right, the docs just say "Unicode codepoints", and standard facilities like "\U..." or chr() will refuse anything above U+10FFFF. However I'm not sure that still holds true when third-party native modules are in the picture.

aktiur · on April 26, 2024

> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

chrismorgan · on April 27, 2024

> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.

account42 · on April 29, 2024

Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

chrismorgan · on April 29, 2024

My comment was completely unrelated to PEP 686. WTF-8 is emphatically not intended to be used as a file encoding.

lucb1e · on April 26, 2024

> `str` are sequence of unicode codepoints [...] without reference to any encoding

I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!

tialaramex · on April 26, 2024

There are two distinct questions here, to which implementations can provide different answers

1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.

2. Representation: What is actually stored (in memory) ? Layout goes here.

So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.

lucb1e · on April 26, 2024

Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance!

https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)

aktiur · on April 27, 2024

Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

samus · on April 26, 2024

Since Java 9, the Java JRE does something similar: if a string contains only characters in ISO-8859-1 then it is stored as such, else the usual storage format (int16) is used.

tialaramex · on April 26, 2024

Yeah, I started writing about what you found (the answer to (2) for Python) and I realised that's a huge rabbit hole I was venturing down and decided to stop short and post, so, apologies I guess.

aktiur · on July 5, 2021

Does it not come down to the usual "cathedral vs. bazar" opposition? SQLite, for which Fossil was originally built, lists 3 persons on its developers page and looks pretty much like the definition of a "cathedral", whereas git was built by Linus Torvalds for Linux, which is the prototypical bazar project.

It makes sense when you have a small team of people that know the project very well to record everything, and they can easily maintain stringent standards, like never committing anything that breaks the tests.

Whereas for a big project that involve thousands of people mailing patches around, some of them first time contributors, you'd rather make sure that what ends up in the immutable log has is clean enough.

richie_adler · on July 6, 2021

> Does it not come down to the usual "cathedral vs. bazar" opposition?

This is expressly addressed in https://fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki (fifth row of the first table)

aktiur · on Feb 18, 2021

Ever read a good book, or played a video game, and known that you have to stop now but thought "I'll just read another chapter / play one more turn"?

I would say that's basically how addiction manifests itself: even if you know that you're dealing with substance abuse, come on, one more time won't matter, will it?

And then there's also the comfort aspect: you're getting back from a hard day's work, you're feeling tired and cranky, you do deserve something nice, don't you?

N.B.: I'm not saying that not being able to drop the book or stop playing that game IS addiction, just that substance addiction might feel the same way.

N.B. 2:And I'm not talking here about the medical aspects of withdrawal, because that's not the thing an addict would usually experience (withdrawal would only happen because you're trying to stop or cannot get access to the substance you need).

aktiur · on April 16, 2020

Question regarding Iodine-131: is it still being generated inside the spent fuel / nuclear waste in Chernobyl? From what I could gather, Iodine-131 is mostly a fission byproduct, and is not present in any of the 4 decay chains[1].

If none is still being generated, there should not be any significant quantity left considering it has a half-life of only 8 days.

[1]: https://en.wikipedia.org/wiki/Decay_chain

Mvandenbergh · on April 16, 2020

Yeah there's none left, activity is now dominated by caesium 137 which has a 30 year half life.

egorfine · on April 16, 2020

Iodine is only generated in a working reactor.

aktiur · on Dec 9, 2017

The article also says that 1500 police homicides a year accounts for 8 to 10 per cent of all homicides. The corresponding figure in Germany is less than 0.3 %

aktiur · on Oct 5, 2017

A humorous take on Nassim Taleb that is actually spot on http://www.karlremarks.com/2014/11/its-true-if-i-say-it-is-w...

aktiur · on Aug 8, 2017

It could. In France (which has a mixed system), single payer cover around 70% of prescriptions costs. And you usually have an additional private plan that will cover the rest (and if you're too poor to buy one of these plans, you can be eligible to get it from the state).