[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 inpu

octave-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 inpu

From:	Andrew Janke
Subject:	[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Date:	Wed, 23 Oct 2019 17:49:44 -0400 (EDT)
User-agent:	Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:69.0) Gecko/20100101 Firefox/69.0

Follow-up Comment #4, bug #57107 (project octave):

> There is at least one modern OS that still uses 8bit encodings by default:
Windows 10 and its predecessors. 

Good point. Windows is weird because it has both Unicode and legacy code page
APIs. And both Octave and Matlab are Unicode-enabled. I guess this gets into
the semantics of what the "default encoding" is. But you're right.

> But I now see that this bug is marked as affecting GNU/Linux.

My initial testing shows it affects Mac as well.

> Matlab's internal encoding is 16bit wide (maybe UCS-2).

Yep, it's UCS-2. (Though it also generally passes through UTF-16 surrogate
pair code units unmolested, so UTF-16 data will generally work too, as long as
you're not trying to do character counts.)

> Maybe it reads the non-UTF-8 bytes as is and they "happen" to map the
Unicode code points (for a western encoded file). 

Nope. Matlab's fopen() opens files with an "encoding" attribute (see
https://savannah.gnu.org/bugs/index.php?55452), and when you do text or
char-oriented I/O (depending on what read/write function you call, and what
you pass for the "precision" argument for low-level I/O functions), it
transcodes the input to UCS-2/UTF-16.

It just so happens that for ISO-8859-1 in particular, the non-UTF-8 byte
values between 128-255 map to the Unicode code points with the same values,
which in UTF-16 are represented by code units with the same numeric values. So
the transcoding operation there is a no-op, except for bit width. But that
won't work for Octave, because Octave's internal coding is UTF-8.

> I am not sure whether we should do something similar and transcode from a
default 8bit encoding if we detect that a source contains invalid UTF-8.

I think Octave should do transcoding. I dunno about "detecting" that the
source contains invalid UTF-8. Just for Matlab compatibility; I don't think
they sniff the input contents to detect encoding. But maybe that would be an
advantage that it's worth losing compatibility for? On Matlab, to be portable
and properly internationalized, you pretty much have to explicitly force the
encoding from your code when you do I/O. And that would still work on Octave
in the face of sniffing for the default case.

Diagnostic: the 4-argout version of fopen returns the encoding. (Not supported
in Octave. (Yet.))


f = fopen('foo.txt');
[a,b,c,d] = fopen(f);


Thought: Since Matlab is so Windows-focused, I wonder if it just opens all
files as ISO-8859-x by default, regardless of OS?

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?57107>

_______________________________________________
  Message sent via Savannah
  https://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, A.R. Burgers, 2019/10/23
- [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/23
  - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/23
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/23
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke <=
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Michael Leitner, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Markus Mützel, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24
    - [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input, Andrew Janke, 2019/10/24

Prev by Date: [Octave-bug-tracker] [bug #51410] display of multidimensional arrays uses 'ans'
Next by Date: [Octave-bug-tracker] [bug #51410] display of multidimensional arrays uses 'ans'
Previous by thread: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Next by thread: [Octave-bug-tracker] [bug #57107] regexp functions fail on ISO-8859 input
Index(es):
- Date
- Thread