A Capital Error

Phil Factor, 2019-02-11

There was a time, in the late seventies, when we jeered at Unix users. In CP/M, we had a modern operating system. It couldn't do that much, but it would run the payroll and accounting systems of a business, do stock control and a lot of other commercial tasks. I wrote many database-driven applications using it for City-of-London stock-brokers (KSAM/ISAM in those days).

Unix, by contrast, had problems that limited its commercial appeal. The licensing problems were never fixed until Linus and his team rewrote it. Unix was originally written by US university geeks who had little idea of the complexities of nationalization. It had a binary collation because that was easiest with ASCII. This meant that a frog was a different thing to a Frog. A Unix geek could, we suspected, call his three sons john, John and jOhn, and be confident that they were unique identifiers. The idea of producing a commercial software product that could be supplied to all cultures, nationalities and languages never occurred to them.

The CP/M operating system was also written from scratch by a university lecturer and some of his friends and students, but the problem of accommodating the most common languages and cultures was fixed for the in the early Eighties, mostly funded by Xerox who wanted to introduce a range of CP/M-based word processors around Europe. The experience and the solutions soon spread to the new MSDOS.

While doing some Linux-based development work, recently, I was taken aback by hitting that same old 1970s US-Academic-geek culture. It was like coming face-to-face with a velociraptor. The extinct lives. What is the virtue of a binary collation? Capital and lowercase are just two ways of writing the same character. To say they are different is as fat-headed as saying that italic or bold makes them different. What is the reason for saying that an accented character is necessarily different? In some cultures, they are, and in some countries they aren't. The French seem to apply them nowadays with the same abandon as salad dressing. Nationalisation is a messy problem. I remember once doing a big nationalisation project and sitting back in my chair with satisfaction, only to be informed that the Semitic-based Middle-Eastern countries wrote backwards from right to left.

While we're on the topic of Unix nonsenses, what about the bizarre idea that indices start at zero rather than one? This is a ghastly error that makes a mockery of the zero concept, and of the vernacular understanding of sequence. Ah, here is john, my zero'th son. John, my son number 1, was born a year later.

What about databases? Fortunately, in SQL Server they pretty-well nailed the collation problem. However, MongoDB still ships with a binary collation, though now you can impose something more sensible on the data. I still get tripped up with Regular Expressions though, which ignore collation and so are case sensitive, by default.

It seems that anything originating in Unix/Linux is infected with this silliness of binary collation and the zero first index. It is so entrenched that people think that there is method in the madness. Actually, not: it is just madness.

Phil Factor





Related content