bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#72165: 31.0.50; Intermittent crashing with recent emacs build


From: Dima Kogan
Subject: bug#72165: 31.0.50; Intermittent crashing with recent emacs build
Date: Thu, 18 Jul 2024 00:25:14 -0700

Thank you very much for replying, Eli.


> So when you say that "anecdotally, the 2024/04/30 build has been very
> stable", what exactly do you mean? It sounds like both that build and
> the one from 2024/07/09 crash in the same way, so why do you consider
> the April one "very stable"?

Sorry, I wasn't clear. I've been using the April build for many months,
and haven't seen any crashing at all until today. Today I tried to debug
the mu4e modeline problem, and saw it crash. Then I updated to the
latest build (2024/07/09) hoping it would be fixed, and kept seeing
crashing, as I continued to debug. So whatever the problem is, it
started in April or earlier.

Here're some notes about the mu4e problem that looks correlated with
this crash, maybe. I'm hazy on the details here, so there's no bug
report yet, but I've at least pinpointed the mechanism.
truncate-string-to-width() in international/mule-util.el has

  (condition-case nil
      (while (< column end-column)
        (setq last-column column
              last-idx idx
              ch (aref str idx)
              column (+ column (char-width ch))
              idx (1+ idx)))
    (args-out-of-range (setq idx str-len)))

The intent was that we might have idx >= length(str), so (aref str idx)
would signal args-out-of-range, and the condition-case would catch it.
But this is reliably not happening under some (probably over-specified)
conditions:

- mu4e is running, with multiple mail contexts; it shows the selected
  context in its modeline, which eventually calls
  truncate-string-to-width()
- I have some remote file opened with TRAMP
- I run (shell-command) from the remote buffer

In at least this scenario, args-out-of-range errors from the above (aref
...) are uncaught (100% of the time with my config), and appear in the
*Messages* buffer. I was debugging this by tweaking and re-evaluating my
local copy of truncate-string-to-width() and other related functions in
the *scratch* buffer, while looking at the *Messages* buffer in another
window. Will get back to this in a sec.

Here's what I see in the core dump:

  (gdb) p current_thread->m_current_buffer->text->z
  $22 = 32192

  (gdb) p current_thread->m_current_buffer->text->z_byte
  $23 = 32178

  (gdb) p current_thread->m_current_buffer->pt
  $24 = 32192

  (gdb) p current_thread->m_current_buffer->pt_byte
  $25 = 32178

So that tells me that the failing condition isn't the one gdb flagged,
but the one immediately after:

  if (BYTEPOS (opoint) < CHARPOS (opoint))
    emacs_abort ();

The compiler optimizations could be responsible for the discrepancy. Am
I understanding correctly that this check makes sure that BYTEPOS >=
CHARPOS, which must always be true because sizeof(emacs character) is
always >= 1byte?

The buffer name:

  (gdb) p current_thread->m_current_buffer->name_
  $26 = XIL(0x7fc685b24c1c)

  (gdb) xstring
  $27 = (struct Lisp_String *) 0x7fc685b24c18
  "*Messages*"

I confirm that the text is our own text:

  (gdb) p &current_thread->m_current_buffer->own_text
  $43 = (struct buffer_text *) 0x7fc685a107e0

  (gdb) p current_thread->m_current_buffer->text
  $44 = (struct buffer_text *) 0x7fc685a107e0

The full structure:

  (gdb) p current_thread->m_current_buffer->own_text
  $45 = {
    beg                        = 0x561d7100f800 ...
    z                          = 32192,
    z_byte                     = 32178,
    gpt                        = 32191,
    gpt_byte                   = 32177,
    gap_size                   = 1313,
    modiff                     = 69879,
    chars_modiff               = 69879,
    save_modiff                = 1,
    overlay_modiff             = 10,
    compact                    = 53392,
    beg_unchanged              = 0,
    end_unchanged              = 0,
    unchanged_modified         = 69373,
    overlay_unchanged_modified = 6,
    intervals                  = 0x0,
    markers                    = 0x561d6da79bc0,
    inhibit_shrinking          = false,
    redisplay                  = true
  }

Looks like gpt and gpt_byte have a similar inconsistency as z and zbyte.
Looking at the definitions in buffer.h, I guess the above means that the
gap starts at gpt_byte-1 = 32176

Let's look at the last bit of the buffer:

  (gdb) printf "%.2200s\n", &current_thread->m_current_buffer->text->beg[30000]
  share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-mime-parts.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-modeline.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-notification.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-obsolete.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-org.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-pkg.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-query-items.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-search.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-server.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-speedbar.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-thread.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-update.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-vars.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-view.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-window.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e.el
  Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-actions.el
  0 matching files marked
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0) [3 times]
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range #("<fastmail>" 1 9 (face mu4e-context-face help-echo "mu4e 
context: fastmail")) 10)
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0) [5 times]
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range #("<fastmail>" 1 9 (face mu4e-context-face help-echo "mu4e 
context: fastmail")) 10)
  QuitError during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0)
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0) [2 times]
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range #("<fastmail>" 1 9 (face mu4e-context-face help-echo "mu4e 
context: fastmail")) 10)
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0) [5 times]

This particular print ("Error during redisplay") happens (I think) when
I removed the (condition-case ...) stuff above to let the (aref ...)
fail. I wouldn't crash most of the time. Also I'm not at all confident
that this is the only scenario where it crashed, but maybe.

Let's look just at the last little bit, to count the bytes:

  (gdb) printf "%.200s\n", &current_thread->m_current_buffer->text->beg[32000]
  mail>" 1 9 (face mu4e-context-face help-echo "mu4e context: fastmail")) 10)
  Error during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0) [5 times]

I asked for at most 200 bytes (up to byte 32200). I got exactly 176
bytes, so the text ends where the gap supposedly begins. That makes
sense. Let's look a bit past the end, INTO the gap

  (gdb) x /3cb &current_thread->m_current_buffer->text->beg[32176]
  0x561d710175b0:       0 '\000'        0 '\000'        114 'r'

So we have two trailing \0 bytes. Past them:

  (gdb) printf "%.200s\n", &current_thread->m_current_buffer->text->beg[32178]
  rror during redisplay: (eval (mu4e--modeline-string) t) signaled 
(args-out-of-range "" 0)

Theory: there's a race condition between error handling that ends up
writing to *Messages* and the logic that aggregates duplicated messages
into things like [5 times]. People usually don't have lots of errors
happening, and they usually don't stare at the *Messages* buffer, so
this is easily missed. Anything more you would suggest?

I saw the crashing once every 20min maybe, so reproducing it is probably
possible, but not very quick and easy. Does it make sense to try to fix
the (condition-case) problem first, since that's easily reproducible?

Thank you





reply via email to

[Prev in Thread] Current Thread [Next in Thread]