Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

there doesn't seem to be any reasoning or evidence in that post supporting "uppercase is the better option", just that uppercase produces a larger number of word classes, which might be correct or incorrect

tchrist explains in that thread why neither uppercase nor lowercase is the best option:

> Mapping to lowercase doesn’t work for Unicode data, only for ASCII. You should be mapping to Unicode foldcase here, not lowercase. Otherwise yours is a Sisyphean task, since lowercase of Σίσυφος is σίσυφος, while lowercase of its uppercase, ΣΊΣΥΦΟΣ, is the correct σίσυφοσ, which is indeed the foldcase of all of those. Do you now understand why Unicode has a separate map? The casemappings are too complex for blindly mapping to anything not designed for that explicit purpose, and hence the presence of a 4th casemap in the Unicode casing tables: uppercase, titlecase, lowercase, foldcase.

of course 'σίσυφοσ' is not correct as a written word but if you were to encounter it then you should clearly consider it equivalent to 'σίσυφος'



> there doesn't seem to be any reasoning or evidence in that post supporting "uppercase is the better option", just that uppercase produces a larger number of word classes, which might be correct or incorrect

this sentence appears to be nonsense. the code doesnt check "word classes", it cases folds two characters and compares them.


character classes then


it doesn't check character classes either. It literally takes two characters, then uppercases both and compares, then lowercases both and compares. I have no idea where you are getting that it has anything to do with word or character classes, it doesn't.


by 'word class' i meant 'a set of words that are considered equivalent by whatever your equivalency relation is'

similarly for 'character class'

cf. https://en.wikipedia.org/wiki/Equivalence_class

what i thought the linked program did was that it counted how many of those there were

now on looking at it further i can see that it doesn't seem to be doing that but i don't have any idea what it does do

however, it definitely doesn't take into account the information you would need to learn anything about which candidate equivalency relation is better, which is something you'd need to examine at at least a word level, considering examples like größte, Σίσυφος, and the notoriously fatal sıkışınca/sikişinca pair


> doesn't take into account the information you would need to learn anything about which candidate equivalency relation is better

OK, no one said it did that. Its purely comparing characters, which is and always was what I said it was doing. And somehow it took 5 comments before you even decided to actually read the answer. Maybe next time you should start by actually reviewing and understanding what you are commenting on, before making multiple comments.


you cited it to support your proposition, 'in regards to accuracy, uppercase is the better option'

i reviewed it sufficiently to see that it's irrelevant to the question of whether that's true or not, and to pull the actually right answer out of the thread, and quote it above


> you cited it to support your proposition, 'in regards to accuracy, uppercase is the better option'

which is true

> i reviewed it sufficiently

good joke




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: