[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XeTeX encoding problem
From: |
Masamichi HOSODA |
Subject: |
Re: XeTeX encoding problem |
Date: |
Mon, 18 Jan 2016 23:12:56 +0900 (JST) |
> If I understand correctly, you are changing the category codes of the
> Unicode characters when writing out to an auxiliary file, but only for
> those Unicode characters that are defined. This leads the Unicode
> character to be written out as a UTF-8 sequence. For the regular
> output, the definitions given with \DeclareUnicodeCharacter are used
> instead of trying to get a glyph for the Unicode character from a
> font. If there's no definition given, then the character must be in
> the font.
>
> I don't know why you did it this way; maybe you could explain? Or if
> my explanation above is incorrect, could you correct it?
>
> There is a potential problem with changing the category codes of the
> Unicode characters, in that any tokens that have already been read in
> won't be affected, depending on the implementation. For example, with
>
> @chapter é,
>
> whether this works depends on whether the argument "é" was read before
> or after the category codes changed. It would be less fragile to keep
> the characters as active but make them expand to a token with category
> code "other".
>
> Using the character definitions built in to texinfo.tex with
> \DeclareUnicodeCharacter may give less good results than using the
> glyphs from a proper Unicode font.
Thank you for your comments.
I've updated the patch.
I want the following.
UTF-8 auxiliary file.
Handling Unicode filename (image files and include files).
Handling Unicode PDF bookmark strings.
For this purpose, I used the method that changes catcode.
The patch that is attached to this mail
uses different method for this purpose.
It uses re-defining replacing macros.
--- texinfo.tex.org 2016-01-15 07:41:42.861186100 +0900
+++ texinfo.tex 2016-01-18 23:04:55.714317700 +0900
@@ -9428,45 +9428,18 @@
\global\righthyphenmin = #3\relax
}
-% Get input by bytes instead of by UTF-8 codepoints for XeTeX and LuaTeX,
-% otherwise the encoding support is completely broken.
-\ifx\XeTeXrevision\thisisundefined
-\else
-\XeTeXdefaultencoding "bytes" % For subsequent files to be read
-\XeTeXinputencoding "bytes" % Effective in texinfo.tex only
-% Unfortunately, there seems to be no corresponding XeTeX command for
-% output encoding. This is a problem for auxiliary index and TOC files.
-% The only solution would be perhaps to write out @U{...} sequences in
-% place of UTF-8 characters.
-\fi
+\newif\iftxinativeunicodecapable
-\ifx\luatexversion\thisisundefined
+\ifx\XeTeXrevision\thisisundefined
+ \ifx\luatexversion\thisisundefined
+ \txinativeunicodecapablefalse
+ \else
+ \txinativeunicodecapabletrue
+ \fi
\else
-\directlua{
-local utf8_char, byte, gsub = unicode.utf8.char, string.byte, string.gsub
-local function convert_char (char)
- return utf8_char(byte(char))
-end
-
-local function convert_line (line)
- return gsub(line, ".", convert_char)
-end
-
-callback.register("process_input_buffer", convert_line)
-
-local function convert_line_out (line)
- local line_out = ""
- for c in string.utfvalues(line) do
- line_out = line_out .. string.char(c)
- end
- return line_out
-end
-
-callback.register("process_output_buffer", convert_line_out)
-}
+ \txinativeunicodecapabletrue
\fi
-
% Helpers for encodings.
% Set the catcode of characters 128 through 255 to the specified number.
%
@@ -9491,13 +9464,6 @@
%
\def\documentencoding{\parseargusing\filenamecatcodes\documentencodingzzz}
\def\documentencodingzzz#1{%
- % Get input by bytes instead of by UTF-8 codepoints for XeTeX,
- % otherwise the encoding support is completely broken.
- % This settings is for the document root file.
- \ifx\XeTeXrevision\thisisundefined
- \else
- \XeTeXinputencoding "bytes"
- \fi
%
% Encoding being declared for the document.
\def\declaredencoding{\csname #1.enc\endcsname}%
@@ -9526,10 +9492,12 @@
\latninechardefs
%
\else \ifx \declaredencoding \utfeight
- \setnonasciicharscatcode\active
- % since we already invoked \utfeightchardefs at the top level
- % (below), do not re-invoke it, then our check for duplicated
- % definitions triggers. Making non-ascii chars active is enough.
+ \iftxinativeunicodecapable
+ \nativeunicodechardefs
+ \else
+ \setnonasciicharscatcode\active
+ \utfeightchardefs
+ \fi
%
\else
\message{Ignoring unknown document encoding: #1.}%
@@ -9859,7 +9827,7 @@
\catcode`\;=12
\catcode`\!=12
\catcode`\~=13
- \gdef\DeclareUnicodeCharacter#1#2{%
+ \gdef\DeclareUnicodeCharacterUTFviii#1#2{%
\countUTFz = "#1\relax
%\wlog{\space\space defining Unicode char U+#1 (decimal \the\countUTFz)}%
\begingroup
@@ -9917,6 +9885,23 @@
\uppercase{\gdef\UTFviiiTmp{#2#3#4}}}
\endgroup
+\def\DeclareUnicodeCharacterNative#1#2{%
+ \catcode"#1=\active
+ \begingroup
+ \uccode`\~="#1\relax
+ \uppercase{\gdef~}{#2}%
+ \endgroup}
+
+\def\DeclareUnicodeCharacterNativeThru#1#2{%
+ \catcode"#1=\active
+ \begingroup
+ \uccode`\.="#1\relax
+ \uppercase{\endgroup \def\UTFNativeTmp{.}}%
+ \begingroup
+ \uccode`\~="#1\relax
+ \uppercase{\endgroup \edef~}{\UTFNativeTmp}%
+}
+
% https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_M
% U+0000..U+007F = https://en.wikipedia.org/wiki/Basic_Latin_(Unicode_block)
% U+0080..U+00FF =
https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
@@ -9931,7 +9916,7 @@
% We won't be doing that here in this simple file. But we can try to at
% least make most of the characters not bomb out.
%
-\def\utfeightchardefs{%
+\def\unicodechardefs{%
\DeclareUnicodeCharacter{00A0}{\tie}
\DeclareUnicodeCharacter{00A1}{\exclamdown}
\DeclareUnicodeCharacter{00A2}{{\tcfont \char162}}% 0242=cent
@@ -10601,7 +10586,26 @@
\global\mathchardef\checkmark="1370 % actually the square root sign
\DeclareUnicodeCharacter{2713}{\ensuremath\checkmark}
-}% end of \utfeightchardefs
+}% end of \unicodechardefs
+
+\def\utfeightchardefs{
+ \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterUTFviii
+ \unicodechardefs
+}
+
+\def\nativeunicodechardefs{
+ \iftxinativeunicodecapable
+ \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterNative
+ \unicodechardefs
+ \fi
+}
+
+\def\nativeunicodechardefsthru{
+ \iftxinativeunicodecapable
+ \let\DeclareUnicodeCharacter\DeclareUnicodeCharacterNativeThru
+ \unicodechardefs
+ \fi
+}
% US-ASCII character definitions.
\def\asciichardefs{% nothing need be done
@@ -10610,6 +10614,9 @@
% Latin1 (ISO-8859-1) character definitions.
\def\nonasciistringdefs{%
+ \iftxinativeunicodecapable
+ \nativeunicodechardefsthru
+ \else
\setnonasciicharscatcode\active
\def\defstringchar##1{\def##1{\string##1}}%
%
@@ -10652,13 +10659,9 @@
\defstringchar^^f4\defstringchar^^f5\defstringchar^^f6\defstringchar^^f7%
\defstringchar^^f8\defstringchar^^f9\defstringchar^^fa\defstringchar^^fb%
\defstringchar^^fc\defstringchar^^fd\defstringchar^^fe\defstringchar^^ff%
+ \fi
}
-
-% define all the unicode characters we know about, for the sake of @U.
-\utfeightchardefs
-
-
% Make non-ASCII characters printable again for compatibility with
% existing Texinfo documents that may use them, even without declaring a
% document encoding.
- Re: XeTeX encoding problem, (continued)
- Re: XeTeX encoding problem, Karl Berry, 2016/01/15
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/16
- Re: XeTeX encoding problem, Gavin Smith, 2016/01/16
- Re: XeTeX encoding problem, Karl Berry, 2016/01/16
- Re: XeTeX encoding problem, Gavin Smith, 2016/01/16
- Re: XeTeX encoding problem, Werner LEMBERG, 2016/01/16
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/17
- Re: XeTeX encoding problem, Gavin Smith, 2016/01/17
- Re: XeTeX encoding problem,
Masamichi HOSODA <=
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/21
- Re: XeTeX encoding problem, Werner LEMBERG, 2016/01/21
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/22
- Re: XeTeX encoding problem, Gavin Smith, 2016/01/22
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/22
- Re: XeTeX encoding problem, Gavin Smith, 2016/01/23
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/23
- Re: XeTeX encoding problem, Masamichi HOSODA, 2016/01/28
- Re: XeTeX encoding problem, Gavin Smith, 2016/01/31
- Re: XeTeX encoding problem, Werner LEMBERG, 2016/01/31