[groff] 16/23: Support CJK fonts encoded in UTF-16 (2/6).

groff-commit
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[groff] 16/23: Support CJK fonts encoded in UTF-16 (2/6).

From:	G. Branden Robinson
Subject:	[groff] 16/23: Support CJK fonts encoded in UTF-16 (2/6).
Date:	Thu, 21 Nov 2024 14:47:49 -0500 (EST)
gbranden pushed a commit to branch master
in repository groff.

commit 6692471f0a31f00b052cec9b223ed963a130edc1
Author: TANAKA Takuji <ttk@t-lab.opal.ne.jp>
AuthorDate: Fri Dec 29 13:56:37 2023 +0000

    Support CJK fonts encoded in UTF-16 (2/6).
    
    * src/include/font.h (class font): Declare private member variable
      `wch`, a pointer to an existing list type `font_char_metric`.  Declare
      private member function `get_font_wchar_metric()` to access it.
    
    * src/libs/libgroff/font.cpp (struct font_char_metric): Add members
      `next` (a pointer to the struct's own type) and `end_code` of type
      `int`.
    
      (glyph_to_ucs_codepoint): New function returns UCS code point from a
      (non-composite) `glyph` object, or -1 if invalid.
    
      (font::font): Constructor initializes `wch` member variable to null
      pointer.
    
      (font::~font): Destructor frees storage allocated in `font::load()`
      for `special_device_coding` member of `wcp` struct, and that of `wcp`
      itself.
    
      (font::contains): If `glyph_to_ucs_codepoint()` returns a valid value
      for the glyph, populate its wide character metrics and return true.
    
      (font::get_font_wchar_metric): New function obtains font metrics of
      input character by Unicode code point.
    
      (font::get_width, font::get_height, font::get_depth)
      (font::get_italic_correction, font::get_left_italic_correction)
      (font::get_subscript_correction, font::get_character_type)
      (font::get_code, font::get_special_device_encoding): If
      `glyph_to_ucs_codepoint()` returns a valid value for the glyph,
      populate its wide character metrics and return the appropriate
      parameter based on them.
    
      (font::get_width): Add conditional guard when computing width for a
      glyph from a "Unicode font"; use the computation only if the device
      description file ("DESC") didn't declare "unscaled_charwidths".
    
      (font::load): Recognize new directive in font description files:
      "charset-range", which works like the existing "charset" directive
      except that the glyph descriptions use a `name` of the form
      "uFFFF..uFFFF" (where "FFFF" is a hexadecimal digit sequence), and
      apply the metrics identically to all glyphs in the designated range.
    
      (font::load): When processing glyph descriptions in "charset" section
      and the device has declared the "unicode" directive, stop scaling the
      width of the glyph by what `wcwidth()` returns for it.  (Does this fix
      Savannah #44018?)
---
 ChangeLog                  |  45 +++++++++++
 src/include/font.h         |   4 +
 src/libs/libgroff/font.cpp | 194 ++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 233 insertions(+), 10 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index 7a7da1f62..cb309aead 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,48 @@
+2024-11-20  TANAKA Takuji <ttk@t-lab.opal.ne.jp>
+
+       Support CJK fonts encoded in UTF-16 (2/6).
+
+       * src/include/font.h (class font): Declare private member
+       variable `wch`, a pointer to an existing list type
+       `font_char_metric`.  Declare private member function
+       `get_font_wchar_metric()` to access it.
+       * src/libs/libgroff/font.cpp (struct font_char_metric): Add
+       members `next` (a pointer to the struct's own type) and
+       `end_code` of type `int`.
+       (glyph_to_ucs_codepoint): New function returns UCS code point
+       from a (non-composite) `glyph` object, or -1 if invalid.
+       (font::font): Constructor initializes `wch` member variable to
+       null pointer.
+       (font::~font): Destructor frees storage allocated in
+       `font::load()` for `special_device_coding` member of `wcp`
+       struct, and that of `wcp` itself.
+       (font::contains): If `glyph_to_ucs_codepoint()` returns a valid
+       value for the glyph, populate its wide character metrics and
+       return true.
+       (font::get_font_wchar_metric): New function obtains font metrics
+       of input character by Unicode code point.
+       (font::get_width, font::get_height, font::get_depth)
+       (font::get_italic_correction, font::get_left_italic_correction)
+       (font::get_subscript_correction, font::get_character_type)
+       (font::get_code, font::get_special_device_encoding): If
+       `glyph_to_ucs_codepoint()` returns a valid value for the glyph,
+       populate its wide character metrics and return the appropriate
+       parameter based on them.
+       (font::get_width): Add conditional guard when computing width
+       for a glyph from a "Unicode font"; use the computation only if
+       the device description file ("DESC") didn't declare
+       "unscaled_charwidths".
+       (font::load): Recognize new directive in font description files:
+       "charset-range", which works like the existing "charset"
+       directive except that the glyph descriptions use a `name` of the
+       form "uFFFF..uFFFF" (where "FFFF" is a hexadecimal digit
+       sequence), and apply the metrics identically to all glyphs in
+       the designated range.
+       (font::load): When processing glyph descriptions in "charset"
+       section and the device has declared the "unicode" directive,
+       stop scaling the width of the glyph by what `wcwidth()` returns
+       for it.  (Does this fix Savannah #44018?)
+
 2024-11-20  TANAKA Takuji <ttk@t-lab.opal.ne.jp>
 
        Support CJK fonts encoded in UTF-16 (1/6).
diff --git a/src/include/font.h b/src/include/font.h
index 9742a383a..e2537ef12 100644
--- a/src/include/font.h
+++ b/src/include/font.h
@@ -295,6 +295,7 @@ private:
                        // font (if !is_unicode) or for just some characters
                        // (if is_unicode).  The indices of this array are
                        // font-specific, found as values in ch_index[].
+  font_char_metric *wch;// Metrics for wide characters.
   int ch_used;
   int ch_size;
   font_widths_cache *widths_cache;     // A cache of scaled character
@@ -334,6 +335,9 @@ private:
                                           const char *,        // file
                                           int);                // lineno
 
+  // Get font metric for wide characters indexed by Unicode code point.
+  font_char_metric *get_font_wchar_metric(int);
+
 protected:
   font(const char *);  // Initialize a font with the given name.
 
diff --git a/src/libs/libgroff/font.cpp b/src/libs/libgroff/font.cpp
index 4ec4f19db..27c213209 100644
--- a/src/libs/libgroff/font.cpp
+++ b/src/libs/libgroff/font.cpp
@@ -1,4 +1,4 @@
-/* Copyright (C) 1989-2021 Free Software Foundation, Inc.
+/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
      Written by James Clark (jjc@jclark.com)
 
 This file is part of groff.
@@ -47,6 +47,8 @@ struct font_char_metric {
   int italic_correction;
   int subscript_correction;
   char *special_device_coding;
+  struct font_char_metric *next;
+  int end_code;
 };
 
 struct font_kern_list {
@@ -163,6 +165,18 @@ void text_file::fatal(const char *format,
     fatal_with_file_and_line(path, lineno, format, arg1, arg2, arg3);
 }
 
+static int glyph_to_ucs_codepoint(glyph *g)
+{
+  const char *nm = glyph_to_name(g);
+  if (nm != 0 /* nullptr */) {
+    if (valid_unicode_code_sequence(nm) && (strchr(nm, '_') == 0)) {
+      char *ignore;
+      return static_cast<int>(strtol(nm + 1, &ignore, 16));
+    }
+  }
+  return -1;
+}
+
 int glyph_to_unicode(glyph *g)
 {
   const char *nm = glyph_to_name(g);
@@ -212,7 +226,7 @@ font::font(const char *s) : ligatures(0),
   kern_hash_table(0 /* nullptr */),
   space_width(0), special(false), internalname(0 /* nullptr */),
   slant(0.0), zoom(0), ch_index(0 /* nullptr */), nindices(0),
-  ch(0 /* nullptr */), ch_used(0), ch_size(0),
+  ch(0 /* nullptr */), wch(0 /* nullptr */), ch_used(0), ch_size(0),
   widths_cache(0 /* nullptr */)
 {
   name = new char[strlen(s) + 1];
@@ -244,6 +258,13 @@ font::~font()
     widths_cache = widths_cache->next;
     delete tem;
   }
+  struct font_char_metric *wcp, *nwcp;
+  for (wcp = wch; wcp != 0 /* nullptr */; wcp = nwcp) {
+    nwcp = wcp->next;
+    if (wcp->special_device_coding)
+      delete[] wcp->special_device_coding;
+    delete wcp;
+  }
 }
 
 static int scale_round(int n, int x, int y)
@@ -326,6 +347,12 @@ bool font::contains(glyph *g)
   // Explicitly enumerated glyph?
   if (idx < nindices && ch_index[idx] >= 0)
     return true;
+  int uc = glyph_to_ucs_codepoint(g);
+  if (uc > 0) {
+    font_char_metric *wcp = get_font_wchar_metric(uc);
+    if (wcp != 0 /* nullptr */)
+      return true;
+  }
   if (is_unicode) {
     // Unicode font
     // ASCII or Unicode character, or groff glyph name that maps to Unicode?
@@ -357,6 +384,17 @@ font_widths_cache::~font_widths_cache()
   delete[] width;
 }
 
+struct font_char_metric *font::get_font_wchar_metric(int uc)
+{
+  struct font_char_metric *wcp;
+  for (wcp = wch; wcp != 0 /* nullptr */; wcp = wcp->next) {
+    if (wcp->code <= uc && uc <= wcp->end_code) {
+      return wcp;
+    }
+  }
+  return 0 /* nullptr */;
+}
+
 int font::get_width(glyph *g, int point_size)
 {
   int idx = glyph_to_index(g);
@@ -371,6 +409,13 @@ int font::get_width(glyph *g, int point_size)
     else
       real_size = int(point_size * double(zoom) / 1000.0 + .5);
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 && !(idx < nindices && ch_index[idx] >= 0)) {
+    return scale(wcp->width, point_size);
+  }
   if (idx < nindices && ch_index[idx] >= 0) {
     // Explicitly enumerated glyph
     int i = ch_index[idx];
@@ -403,7 +448,7 @@ int font::get_width(glyph *g, int point_size)
     // Unicode font
     int width = 24; // XXX: Add a request to override this.
     int w = wcwidth(get_code(g));
-    if (w > 1)
+    if (w > 1 && !font::use_unscaled_charwidths)
       width *= w;
     if (real_size == unitwidth || font::use_unscaled_charwidths)
       return width;
@@ -422,6 +467,13 @@ int font::get_height(glyph *g, int point_size)
     // Explicitly enumerated glyph
     return scale(ch[ch_index[idx]].height, point_size);
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return scale(wcp->height, point_size);
+  }
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -438,6 +490,13 @@ int font::get_depth(glyph *g, int point_size)
     // Explicitly enumerated glyph
     return scale(ch[ch_index[idx]].depth, point_size);
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return scale(wcp->depth, point_size);
+  }
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -454,6 +513,13 @@ int font::get_italic_correction(glyph *g, int point_size)
     // Explicitly enumerated glyph
     return scale(ch[ch_index[idx]].italic_correction, point_size);
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return scale(wcp->italic_correction, point_size);
+  }
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -465,11 +531,18 @@ int font::get_italic_correction(glyph *g, int point_size)
 int font::get_left_italic_correction(glyph *g, int point_size)
 {
   int idx = glyph_to_index(g);
-  assert(idx >= 0);
+  assert(idx >= 0 /* nullptr */);
   if (idx < nindices && ch_index[idx] >= 0) {
     // Explicitly enumerated glyph
     return scale(ch[ch_index[idx]].pre_math_space, point_size);
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0 )
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return scale(wcp->pre_math_space, point_size);
+  }
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -486,6 +559,13 @@ int font::get_subscript_correction(glyph *g, int 
point_size)
     // Explicitly enumerated glyph
     return scale(ch[ch_index[idx]].subscript_correction, point_size);
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return scale(wcp->subscript_correction, point_size);
+  }
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -560,6 +640,13 @@ int font::get_character_type(glyph *g)
     // Explicitly enumerated glyph
     return ch[ch_index[idx]].type;
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return wcp->type;
+  }
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -576,6 +663,13 @@ int font::get_code(glyph *g)
     // Explicitly enumerated glyph
     return ch[ch_index[idx]].code;
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */) {
+    return uc;
+  }
   if (is_unicode) {
     // Unicode font
     // ASCII or Unicode character, or groff glyph name that maps to Unicode?
@@ -610,6 +704,12 @@ const char *font::get_special_device_encoding(glyph *g)
     // Explicitly enumerated glyph
     return ch[ch_index[idx]].special_device_coding;
   }
+  int uc = glyph_to_ucs_codepoint(g);
+  font_char_metric *wcp = 0 /* nullptr */;
+  if (uc > 0)
+    wcp = get_font_wchar_metric(uc);
+  if (wcp != 0 /* nullptr */)
+    return wcp->special_device_coding;
   if (is_unicode) {
     // Unicode font
     return 0;
@@ -877,7 +977,8 @@ bool font::load(bool load_header_only)
     else if (strcmp(p, "special") == 0) {
       special = true;
     }
-    else if (strcmp(p, "kernpairs") != 0 && strcmp(p, "charset") != 0) {
+    else if (strcmp(p, "kernpairs") != 0 && strcmp(p, "charset") != 0 &&
+             strcmp(p, "charset-range") != 0) {
       char *directive = p;
       p = strtok(0 /* nullptr */, "\n");
       handle_unknown_font_command(directive, trim_arg(p), t.path,
@@ -923,6 +1024,84 @@ bool font::load(bool load_header_only)
        add_kern(g1, g2, n);
       }
     }
+    // TODO: Rename this directive to "ranged-charset".
+    else if (strcmp(directive, "charset-range") == 0) {
+      if (load_header_only)
+       return true;
+      saw_charset_directive = true;
+      bool had_range = false;
+      for (;;) {
+       if (!t.next_line()) {
+         directive = 0 /* nullptr */;
+         break;
+       }
+       char *nm = strtok(t.buf, WS);
+       assert(nm != 0 /* nullptr */);
+       p = strtok(0 /* nullptr */, WS);
+       if (0 /* nullptr */ == p) {
+         directive = nm;
+         break;
+       }
+       int start_code = 0;
+       int end_code = 0;
+       int nrange = sscanf(nm, "u%X..u%X", &start_code, &end_code);
+       // TODO: Check for backwards range: end_code < start_code.
+       if (2 == nrange) {
+         had_range = true;
+         font_char_metric *wcp = new font_char_metric;
+         wcp->code = start_code;
+         wcp->end_code = end_code;
+         wcp->height = 0;
+         wcp->depth = 0;
+         wcp->pre_math_space = 0;
+         wcp->italic_correction = 0;
+         wcp->subscript_correction = 0;
+         int nparms = sscanf(p, "%d,%d,%d,%d,%d,%d",
+                             &wcp->width, &wcp->height, &wcp->depth,
+                             &wcp->italic_correction,
+                             &wcp->pre_math_space,
+                             &wcp->subscript_correction);
+         if (nparms < 1) {
+           t.error("missing or invalid width for character range '%1'",
+                   nm);
+           return false;
+         }
+         p = strtok(0 /* nullptr */, WS);
+         if (0 /* nullptr */ == p) {
+           t.error("missing character type for '%1'", nm);
+           return false;
+         }
+         int type;
+         if (sscanf(p, "%d", &type) != 1) {
+           t.error("invalid character type for '%1'", nm);
+           return false;
+         }
+         if ((type < 0) || (type > 255)) {
+           t.error("character type '%1' out of range for '%2'", type,
+                   nm);
+           return false;
+         }
+         wcp->type = type;
+
+         p = strtok(0 /* nullptr */, WS);
+         if ((0 /* nullptr */ == p) || (strcmp(p, "--") == 0)) {
+           wcp->special_device_coding = 0 /* nullptr */;
+         }
+         else {
+           wcp->special_device_coding = new char[strlen(p) + 1];
+           strcpy(wcp->special_device_coding, p);
+         }
+         wcp->next = wch;
+         wch = wcp;
+         p = 0 /* nullptr */;
+       }
+      }
+      // TODO: Parallelize wording of "charset"'s diagnostic.
+      if (!had_range) {
+       t.error("no glyphs described after 'charset-range' directive");
+       return false;
+      }
+    }
     else if (strcmp(directive, "charset") == 0) {
       if (load_header_only)
        return true;
@@ -997,11 +1176,6 @@ bool font::load(bool load_header_only)
            t.error("invalid code '%1' for character '%2'", p, nm);
            return false;
          }
-         if (is_unicode) {
-           int w = wcwidth(metric.code);
-           if (w > 1)
-             metric.width *= w;
-         }
          p = strtok(0 /* nullptr */, WS);
          if ((0 /* nullptr */ == p) || (strcmp(p, "--") == 0)) {
            metric.special_device_coding = 0;
[Prev in Thread]
Current Thread
[Next in Thread]
[groff] 16/23: Support CJK fonts encoded in UTF-16 (2/6)., G. Branden Robinson <=
Prev by Date: [groff] 12/23: [hdtbl]: Make `-wall` clean.
Next by Date: [groff] 17/23: Support CJK fonts encoded in UTF-16 (3/6).
Previous by thread: [groff] 12/23: [hdtbl]: Make `-wall` clean.
Next by thread: [groff] 17/23: Support CJK fonts encoded in UTF-16 (3/6).
Index(es):
- Date
- Thread