bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: column numbers for non-ASCII characters in error messages


From: Bruno Haible
Subject: Re: column numbers for non-ASCII characters in error messages
Date: Sun, 19 Dec 2010 00:59:16 +0100
User-agent: KMail/1.9.9

Hi Ben,

> The GNU coding standards say how to calculate column numbers for
> ASCII characters:
> 
>     Line numbers should start from 1 at the beginning of the file, and
>     column numbers should start from 1 at the beginning of the line.  (Both
>     of these conventions are chosen for compatibility.)  Calculate column
>     numbers assuming that space and all ASCII printing characters have
>     equal width, and assuming tab stops every 8 columns.
> 
> They don't say how to calculate them for non-ASCII characters.

The intent of this part of the coding standards is that users can find the
denoted position by moving the cursor around, in an editor that displays
column numbers (such as 'emacs', 'vi', and 'kate'), and that the Emacs
'compile' mode can automatically highlight the denoted column.

Emacs (with M-x column-number-mode) displays column numbers that increase
by 1 for single-width characters (e.g. Cyrillic or Greek letters), by 2
for Hanzi characters, and by 0 for zero-width combining characters.
It starts at 0, not 1.

vi displays two column numbers when Ctrl-G is pressed: first the byte
count, then the column width like Emacs. Both start at 1.

kate displays a character count, starting at 1.

The Emacs 'compile' mode, implemented in emacs/lisp/progmodes/compile.el,
by default uses screen columns, but can also be customized to use
character counts; see variable 'compilation-error-screen-columns'.

> So far, I've thought of the following ways:
> 
>         * Byte offset from beginning of line.

This notion would require users to be aware of the encoding: The
same string has different byte counts in UTF-8 and, say, ISO-8859-15.

>         * Display width from beginning of line, with double-wide
>           characters counting as two positions and combining
>           characters (e.g. combining accents) counting as zero
>           positions.

Yes, this is the definition that most text editors, including the
Emacs M-x compile mode, use.

>         * Grapheme clusters (user-visible characters) from
>           beginning of line, as specified in Unicode Standard
>           Annex #29 "Unicode Text Segmentation".

Grapheme clusters are a notion used for rendering of text. In many
situations this definition yields the same value as the previous one,
but it is in general harder to compute (it requires script dependent
processing). The difference matters only for complex scripts like
Indic (Devanagari etc.), and these scripts most often don't use
fixed-width fonts.

For this reason, the second definition is the one that is normally
agreed upon. Note, however, that it is ambiguous: For "ambiguous width"
Unicode characters, the width may depend on the terminal emulator or
on the locale. But this is not a big problem in practice, because such
characters occur rarely.

This notion of width, measured by screen columns, is implemented by
  - the POSIX function wcwidth(),
  - the gnulib function mbswidth() (applicable to multibyte strings),
  - the gnulib or libunistring function u8_strwidth() (applicable to UTF-8
    strings).

> (If there is a better place to ask this question, let me know.)

We don't have a general internationalization list here. For 10 years
we had <address@hidden>, but this list is dead now.

Bruno



reply via email to

[Prev in Thread] Current Thread [Next in Thread]