[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 inpu

octave-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 inpu

From:	Andrew Janke
Subject:	[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Date:	Thu, 24 Oct 2019 08:18:14 -0400 (EDT)
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0

Follow-up Comment #7, bug #57107 (project octave):

> First, think about the performance: even if I would like to just read in the
first few bytes of a file, the fopen alone would have to read in quite a large
chunk in order to detect the encoding, which I would never use. And how much
would you read in, the whole file?

There would be little or no performance effect. In a modern OS, all file I/O
is done in blocks or pages: when you read one byte, what actually happens is
that the first 4K block is brought in from disk to cache and held there until
you read more. And then I/O is usually further buffered by the next language
layer. Sniffing would be done within the first block or buffer.

> And further, there is no reliable way to detect the encoding.

This is true. But there are some heuristics; ICU4C provides some which have an
okay reputation. Sniffing would only be a convenience for casual users; the
real answer to all of these scenarios is that you have to actually know and
specify the file encoding to get correct, reliable behavior.

But you've got a good point: sniffing introduces variability and
unpredictability into your code's behavior, and could well make Octave I/O
both harder to use and harder for maintainers to debug user issues with. It
could even introduce variability between different versions or builds of
Octave, if they were built with different versions of the library that
provides the sniffing algorithms.

> Actually, I have always been perfectly happy with the previous situation --
octave had no idea of encodings, it read the bytes as they came in the file
and fed them to the terminal emulator, which cared about how the are
displayed.

That scenario is fine, as long as your input files are in the same encoding as
your terminal. (And you don't want to do pattern matching on non-basic-English
character classes, or do text processing on non-UTF-8 input data, like OP for
this bug report does.) Being encoding-aware allows you to work with
international data where your input files not in your current locale's
encoding, or they are in multiple encodings. Useful if you're working with,
say, census data or energy data that comes from multiple countries or
continents, or a spreadsheet that your Japanese colleague sent you.

If we make a good choice with the default encoding selection - e.g. taking it
from your locale definition, assuming your locale is correctly configured -
your scenario will continue to work with no code changes.

> Do you propose to make the "t" and "b" in the mode string of fopen have a
meaning, while today (on linux) they are irrelevant (is this what Matlab
does)?

If we stay compatible with Matlab, the "b" and "t" modes would only have an
effect on Windows, where the "t" mode enables translation between Windows CRLF
line endings and Unix-style LF endings. "t" mode has no other effect, and "b"
mode would keep the current behavior. On Linux (and in portable code), they
would remain irrelevant.

> Please, can you point me to a write-up of what is planned in this regard?

The discussion has been over at https://savannah.gnu.org/bugs/index.php?55452
and on the octave-maintainers mailing list:
* https://lists.gnu.org/archive/html/octave-maintainers/2019-03/msg00019.html
* https://lists.gnu.org/archive/html/octave-maintainers/2018-04/msg00155.html
* https://lists.gnu.org/archive/html/octave-maintainers/2019-01/msg00311.html
* https://lists.gnu.org/archive/html/octave-maintainers/2018-05/msg00137.html

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?57107>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, A.R. Burgers, 2019/10/23
- [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/23
  - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/23
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/23
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/23
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Michael Leitner, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke <=
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/25
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/26

Prev by Date: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Next by Date: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Previous by thread: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Next by thread: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Index(es):
- Date
- Thread