Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The more I learn about text representation and Unicode, the more it looks like a complete clusterfuck, and it boggles my mind that somehow all this works almost perfectly while hiding all the complexities from the end user.

I suppose this is inevitable when you tasked with representing literally every symbol in existence. You couldn't pay me enough to touch this problem with a ten foot pole (this and text rendering).



It’s not a clusterfuck and IMHO it’s an unfair characterization. It is insanely complicated and shouldn’t be touched except when wearing appropriate hazmat gear.

Writing seems simple — children do it routinely — but like a biological system it evolved over millennia in a ton of different directions. It’s coupled with emotional, practical, and even, yes, moral issues that operate on both deeply personal and social issues. This is hard to capture in software.

Unicode made a couple of hard decisions right up front. I hate them but they were smart and Unicode would not have survived had they not made them. One was round trip with legacy character sets, which meant encoding a lot of redundant characters (English and German “A” have he same code point, but Greek “A” and Russian “A” do not, nor does an “A” that appears in a Japanese code table. Second was abandoning attempts at Han unification, which had its own linguistic, emotional and political issues.

People are complicated and so are their languages so wrestling the whole thing into a tractable system has been worth the effort.


> abandoning attempts at Han unification

Huh? Han unification happened.


It is quite different from what was originally proposed but You are right I should not have phrased it that way.


While the goal and work of Unicode are admirable, I can't help but fear that they're setting themselves up for future problems. Take for example flag emojis [1]. At first it seems "just" complicated. But then it starts to become problematic: what happens when a country changes flags? What happens when a country ceases to exist? Or splits? Or merges into another? What about when there are flag disputes?

Imagine if Unicode has to start dealing with the temporal change that for example the Olson TZ database [2] has to!

[1] https://shkspr.mobi/blog/2019/06/quirks-and-limitations-of-e...

[2] https://en.wikipedia.org/wiki/Tz_database


This is already not an issue. Unicode doesn't assign a separate codepoint to any flag. Each flag is represented by a two character ISO code using regional indicator symbols (such as IN for Indian flag).

https://en.wikipedia.org/wiki/Regional_Indicator_Symbol


That's part of the point. Are we prepared to track those changes across time? What if there's an article written today with (Unicode Hong Kong flag) or (Unicode Crimean flag)? Those articles might mean to express something in a context where HK is a certain independent entity, or Crimea is Ukrainian. What if that article is displayed with a Chinese and Russian flag 10 years from now?


Technically they already have a flag dispute, over Taiwan, as I recall. Thankfully for the Unicode consortium they’ve managed to leave the implementation problems that causes to the vendors.


As someone who used to work in country list related things, the existence of Taiwan as a country flag codepoint at all would be an issue for China. China will complain if you include Taiwan, and Taiwan will complain if omitted, so it's not fun appeasing both sides.


I imagine so, but it exists.


Agreed, just pointing it out. In our system, we had to display 'Taiwan, Province of China' for the Chinese users, and 'Taiwan' to everyone else, though that was just UI and the backend treated it identically.


Usually it works perfectly. Sometimes it doesn’t. I’m occasionally stunned by how such a fundamental thing as text representation can be ruined by obscure encoding issues. For example, there is absolutely no way to be certain of the character encoding scheme of binary string data unless it is stored as metadata somewhere. Unicode attempts to solve this with the Byte Order Marker. If present, we can know that a string is unicode-encoded, and whether it’s big-endian or little-endian. However, the BOM is optional, and so it’s not known for sure if a string is Unicode.

One example of how this is a huge clusterfuck is that until recently, Windows Notepad opened and saved everything with the Win-1252 encoding scheme (labeled as ASCII in the app). The web, and the other popular OSes, on the other hand, are standardized around UTF-8. So if you download a txt file from the web or OS without a BOM, and you open it in Notepad, you can get characters that looked right in your browser, but not in Notepad.

There are smart algorithms out there that can detect character encoding pretty well, but none of them are perfect (as far as I know).

The Win-1252 default and the fact that most computer users have no idea about character encoding have caused all sorts of headaches for me with the reporting software I work on.


I wouldn't call that an obscure encoding issue, but an absolutely fundamental one. Absent meta information, you can never be sure that some text (or actually any data) is in a specific encoding (at best, you can be sure that it is not in some specific encoding). As an (indirect) illustration, see polyglots (programs that are valid programs in multiple programming languages simultaneously):

https://en.wikipedia.org/wiki/Polyglot_(computing)


You’re right. I just meant obscure from the end-user perspective. It’s not clear what’s wrong to uneducated users, only that their text looks weird.


It’s often the simple, fundamental things (text, time, images) that we think are easy to implement, but in fact are incredibly complex under the hood. It’s often the human “decentralization” that causes all the quirks and oddities that make things difficult to get right. Text encoding is a good example, date and time another one. Both actually have much more in common than you would think.


Perhaps it looks like it works "almost perfectly" because you're only using English (and similar western languages)? the problems that arise in Asian text are numerous -- and they do frequently hit end users.


Historically computing has been the backbone of bureaucracies for a very long time and as bureaucracies do, they make people bend to their rules, and thus the rules of computing. I'm German, the German alphabet is exactly the same as the English one, except we have four extra letters: äöü, the friendly Umlauts, and ß. Since a lot of older computing systems does not handle this (7-bit ASCII or mainframe character sets), computing bent the language instead. Jägerstraße => Jaegerstrasse. A lot of unixish software doesn't handle spaces in names and such. People bowed to that as well.

The idea that computers should support cultures, and not the other way around, is pretty recent.


I'm not German, but afaik the spelling reform of 1996 that introduced ss as an always-alternative for ß was mainly aimed at simplification and unification. Do you have any support for your statement that it was because of insufficient support by IT systems?


Such an always-alternative doesn't exist. ß was changed to ss at the end of short syllables, that is all. I think there is a rule to always use ss instead of ß (and ae instead of ä, etc.), when ß is not available. But that wasn't introduced in 1996, that is way older and less relevant today than it used to be.

Gruß, stkdump


> computing bent the language instead. Jägerstraße => Jaegerstrasse.

Um, no. The words were originally written that way. Ä, ö, ü and ß actually developed from ligatures for ae, oe, ue and ss, long before computers were a thing.


What's your point exactly? Umlauts as we know them have been used for a few hundred years (hard to pin-point, because öäü evolved in "casual" hand-writing, not printing or books) before computing came along, so were clearly how the language worked before. The motive force for using AE and SS in computing was clearly that computers commonly didn't support it, not because people thought suddenly writing like this again a couple hundred years later would be fun.


Originally ß was just a ligature for ss (ſs) but it since developed its own meaning. ß indicates that the preceding vowel is long. Busse and Buße are pronounced differently and mean different things. The conversion ß -> ss destroys information that was present in the original orthography.


> The more I learn about X and Y, the more it looks like a complete clusterfuck, and it boggles my mind that somehow all this works almost perfectly while hiding all the complexities from the end user.

I modified your first sentence to make it more generic and applicable to many other things in software.


As a frequent user of Unicode, for Chinese and Japanese, I sorta go along with it, but there's no arguing that we'd be closer to flying cars and jetpack commutes if computers just used ASCII.

Barring that, we could have all used UTF-8, but windows really screwed that up, and none of the arguments for 16-bit alignment vs 8-bit alignment for processing really hold water.


Microsoft implemented UTF-16 before UTF-8 was even invented, and certainly before it came in to widespread use.


Technically speaking no, they didn't have UTF-16 but rather UCS-2. UTF-8 was fully specified in 1992, and called UTF-8 in 1993, while it took until 1996 before UTF-16 was specified; whereas Windows NT 3.1 was released with UCS-2 in 1993.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: