octave-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 inpu


From: Michael Leitner
Subject: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Date: Thu, 24 Oct 2019 02:58:43 -0400 (EDT)
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0

Follow-up Comment #5, bug #57107 (project octave):

Please don't do sniffing, and even more definitely not by default. First,
think about the performance: even if I would like to just read in the first
few bytes of a file, the fopen alone would have to read in quite a large chunk
in order to detect the encoding, which I would never use. And how much would
you read in, the whole file? And further, there is no reliable way to detect
the encoding. Yes, you could probably discern quite easily Latin-script
languages written in UTF-8 or UTF-16, but for everything else you would have
to have also a knowledge of the language, which characters it used and to what
frequency, in order to distinguish between non-Latin script languages written
in two-byte encodings. And how would you distinguish for instance between
German or a Nordic language written in ISO 8859-1 and an Eastern European
language written in ISO 8859-2, again by the histogram of characters above
127? Further, it would break the principle of least surprise: what if my file
consists of a list of given names of a sample of people taken in England? If
fopen reads the first 1024 bytes to decide on the encoding, it will probably
choose the default among any ISO-8859 or UTF-8 (as probably no byte will be
above 127). However, later in the file an expatriate "Jürgen" might well
appear, which is then misread. That in alone would not yet be much of a
problem, but the "Jürgen" could appear also in the first 1024 bytes, in which
case it would be interpreted differently. 

I am a late-comer at this issue of making octave encoding-aware. Actually, I
have always been perfectly happy with the previous situation -- octave had no
idea of encodings, it read the bytes as they came in the file and fed them to
the terminal emulator, which cared about how the are displayed. The only issue
in this sense could have been that the number of bytes are not necessarily
equal to the number of displayed characters. But I do not see that this would
be a problem unless you do manual positioning of characters of a fixed-width
font say in a plot -- the much more frequent problem of e.g. how large string
buffers to allocate is a no-brainer. 

Please, can you point me to a write-up of what is planned in this regard? Do
you propose to make the "t" and "b" in the mode string of fopen have a
meaning, while today (on linux) they are irrelevant (is this what Matlab
does)? If the "b" then keeps the current behaviour, I could live with that, I
would only have to use it consistently where today I distinguish between "t"
and "b" depending on whether the file will contain text or binary data. 

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?57107>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]