pan-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Pan-users] Annoying ' in posts


From: Steven D'Aprano
Subject: Re: [Pan-users] Annoying ' in posts
Date: Sun, 23 Sep 2012 14:03:48 +1000
User-agent: Mozilla/5.0 (X11; Linux i686; rv:10.0.6esrpre) Gecko/20120717 Thunderbird/10.0.6

On 23/09/12 04:29, Paul Crawford wrote:

What I hate about unicode was the idea of adopting 16-bit characters and
thus breaking so much byte-orientated code that was written, tested, and
integrated over the history of computing.

You make it sound like the Unicode Consortium hacked into people's computers
and changed their existing 8-bit ASCII files into 16-bit UCS-2 files. I'm
pretty sure that never happened.

The actual problem was with the application writers, for failing to
distinguish between file formats correctly.

If people failed to distinguish PNG files from GIF files (say, they decided
to keep the .gif file extension for PNG files), would you blame PNG for
breaking the code? Would you insist that all progress in creating better
image formats should cease, that 8-bit colour is enough for everybody? That
people should just stick to "plain images"?

Of course not. You would recognise that the fault was in the people who
stupidly decided that there was no need to distinguish between GIF and PNG,
and the programs that were already broken because they made certain
assumptions about the data they would be given but didn't cope well when
those assumptions were violated.

The problem you describe predates Unicode. The same problem occurs with
the older "code page" standards such as Latin-1, so-called "ANSII text",
and dozens of other encodings that predate Unicode, some of which were
multibyte.


The actual problem was two-fold:

* The writers of ASCII text editors and ASCII-only tools foolishly
  believed that there was such a thing as "plain text". Historically,
  that is understandable. That some people continue to think so is
  unforgivable willful ignorance.

* The writers of text editors foolishly had no mechanism for accurately
  determining the format used. On Unix, they assumed there was only one
  text format. On Windows, they used the same .txt file extension for
  all text files, regardless of format.


That second is equivalent to insisting that all image files (JPEGs, TIFFs,
GIFs, PNGs, and dozens of others) should either have no file extension at
all, or all should use (say) ".bmp". That's fine, *if* you write your
program to detect formats you can't deal with and gracefully decline to
handle them. But people didn't do this, because they had this idea that
they were dealing with "plain text" instead of dozens of different formats.

"Plain text" is one of the most pernicious, harmful, and *idiotic* memes
in computing, about up there with the idea that you only need two years
to specify the year.

There has *never* been such a thing as "plain text" -- ASCII post-dates
text formats such as EBCDIC, there have *always* been multiple single-
byte text formats. To say nothing of different conventions for line
endings.

Adding multi-byte Unicode didn't create the problem. It just made the
problem obvious to those who were ignorant of it because they hardly
ever interchanged "plain text" files between (say) Unix and DOS, or
Windows and Macintosh, or IBM mainframes and Commodore home computers.

(And when they did, *both* sides grumbled that the *other* side didn't
know what "plain text" was.)

That's at least six different "plain texts" right there:

- ASCII with \n line endings
- ASCII with \r\n line endings and ^Z end-of-file marker
- "extended ASCII" with any of dozens of different code pages
- Mac 8-bit "extended ASCII" character set (MacRoman)
- EBCDIC
- PETSCII


People got away with these wrong-headed assumptions for so long because,
before the Internet, folks hardly ever interchanged text with users of
different languages and formats. But that was then, this is now, and
interchange text in different languages and formats is all we do on the
Internet. Every file has a filename, every webpage is text.

Unicode is the solution to these dozens of incompatible text formats, it
is not the cause. The sooner people stop pining for a Golden Age of "good
ol' plain ASCII text" that never existed, and start using Unicode, the
better off we'll all be.



--
Steven



reply via email to

[Prev in Thread] Current Thread [Next in Thread]