sks-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Sks-devel] Re: sks recon cores when it claims "reconciliation complete"


From: Yaron Minsky
Subject: [Sks-devel] Re: sks recon cores when it claims "reconciliation complete"
Date: Mon, 26 Jan 2004 15:08:19 -0500 (EST)
User-agent: SquirrelMail/1.4.2-1

Hi Chris.  First off, a small thing.  This kind of message should probably
go to address@hidden  (And of course, when sending there, messages
shouldn't be encrypted.)  I'm forwarding this message there in the hopes
of getting feedback from others.

Speaking of which, has anyone had experience running SKS on freebsd?  I'm
wondering if Chris' problems are unique.

I haven't seen this particular error before, and the error report is
unhappily hard to track down.  The error happens between lines 177 and 180
of reconserver.ml.  It might be useful to stick a few more plerror
instructions in there to see precisely where the error is.

For the most part, segfaults in ocaml come in two places; where you use
the native marshalling code and you get the type wrong --- that doesn't
occur at all in my code, so we can ignore that case --- and in the
interfacing code between ocaml and C.  There are two places this might
happen in SKS.  The first is the interface to the berkely database, and
the second is in numerix, which is the large-integer arithmetic package
that SKS relies on.  The hashconvert function, which occurs in the
critical lines in question, could be the source of the problem, so adding
printout statements could help track the problem down.

One ugly possibility is that the error is somewhere else entirely, and the
exception hits there just because that's when the GC happens to do a big
collection that runs over the memory in question.  If that's the case, it
will be harder to track down what's going on.

The bytecode won't really help you here, since core dumps don't generate
stacktraces.  I'm not sure why the bytecode isn't working for you.  Can
you run the ocaml interpreter?  You can invoke it by typing "ocaml" at the
command line.  Also, try doing "ocaml unix.cma", and then do something
like "Unix.dup;;" and see if it throws a gasket.

y

> Thanks for the pointer to "sks cleandb" that did the trick. anyway i'm now
> syncing with a few other machines, and whenever recon claims to have
completed
> it cores.
>
> pyxis:ttyp9# tail log.recon
> 2004-01-26 10:12:45 Reconciliation complete           <-- core dump
> 2004-01-26 10:13:17 Opening log                               <-- i restart 
> sks recon
> 2004-01-26 10:13:17 sks_recon, SKS version 1.0.6
> 2004-01-26 10:13:17 Copyright Yaron Minsky 2002-2003
> 2004-01-26 10:13:17 Licensed under GPL.  See COPYING file for details
> 2004-01-26 10:13:17 Opening PTree database
> 2004-01-26 10:13:17 Setting up PTree data structure
> 2004-01-26 10:13:18 PTree setup complete
> 2004-01-26 10:13:18 Initiating catchup
> 2004-01-26 10:13:22 Fetching filters
> 2004-01-26 10:13:26 Starting event loop
> 2004-01-26 10:14:35 Recon partner: <ADDR_INET 213.141.74.169:11370>
> 2004-01-26 10:14:35 Initiating reconciliation
>
> I don't know much about debugging ocaml, I assume that "alloc_small" is
> some sort of ocaml intrinsic? I find a bunch of things that call it in the
> sks source, but no definition thereof...
>
> pyxis:ttypa# gdb sks sks.core
> GNU gdb 4.16.1
> Copyright 1996 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you
are
> welcome to change it and/or distribute copies of it under certain
conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-unknown-openbsd3.4"...
> Core was generated by `sks'.
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /usr/lib/libz.so.3.0...done.
> Reading symbols from /usr/local/lib/libdb.so.4.2...done.
> Reading symbols from /usr/lib/libm.so.1.0...done.
> Reading symbols from /usr/lib/libc.so.30.3...done.
> Reading symbols from /usr/libexec/ld.so...done.
> #0  0x1c0e3f6b in alloc_small ()
> (gdb) bt
> #0  0x1c0e3f6b in alloc_small ()
> #1  0x1c0ed7f0 in alloc_custom ()
> #2  0x1c0bec68 in sx_split ()
> #3  0x1c085aac in Numerix__fun_2305 ()
> #4  0x60cf7fe0 in ?? ()
> Cannot access memory at address 0x10cf7fe0.
>
> objdump -S says this was going on....
> 1c0e3f64 <alloc_small>:
> 1c0e3f64:       55                      push   %ebp
> 1c0e3f65:       89 e5                   mov    %esp,%ebp
> 1c0e3f67:       83 ec 0c                sub    $0xc,%esp
> 1c0e3f6a:       57                      push   %edi
> 1c0e3f6b:       56                      push   %esi
>                                                       *CORE*
> 1c0e3f6c:       53                      push   %ebx
> 1c0e3f6d:       8b 75 08                mov    0x8(%ebp),%esi
> 1c0e3f70:       8b 7d 0c                mov    0xc(%ebp),%edi
> 1c0e3f73:       8d 1c b5 04 00 00 00    lea    0x4(,%esi,4),%ebx
> 1c0e3f7a:       a1 bc ad 05 3c          mov    0x3c05adbc,%eax
> 1c0e3f7f:       29 d8                   sub    %ebx,%eax
> 1c0e3f81:       a3 bc ad 05 3c          mov    %eax,0x3c05adbc
> 1c0e3f86:       3b 05 c0 ad 05 3c       cmp    0x3c05adc0,%eax
> 1c0e3f8c:       73 12                   jae    1c0e3fa0 <alloc_small+0x3c>
> 1c0e3f8e:       01 d8                   add    %ebx,%eax
> 1c0e3f90:       a3 bc ad 05 3c          mov    %eax,0x3c05adbc
> 1c0e3f95:       e8 f2 f6 ff ff          call   1c0e368c <minor_collection>
> 1c0e3f9a:       29 1d bc ad 05 3c       sub    %ebx,0x3c05adbc
> 1c0e3fa0:       8b 15 bc ad 05 3c       mov    0x3c05adbc,%edx
> 1c0e3fa6:       c1 e6 0a                shl    $0xa,%esi
> 1c0e3fa9:       8d 84 3e 00 03 00 00    lea    0x300(%esi,%edi,1),%eax
> 1c0e3fb0:       89 02                   mov    %eax,(%edx)
> 1c0e3fb2:       a1 bc ad 05 3c          mov    0x3c05adbc,%eax
> 1c0e3fb7:       83 c0 04                add    $0x4,%eax
> 1c0e3fba:       5b                      pop    %ebx
> 1c0e3fbb:       5e                      pop    %esi
> 1c0e3fbc:       5f                      pop    %edi
> 1c0e3fbd:       c9                      leave
> 1c0e3fbe:       c3                      ret
> 1c0e3fbf:       90                      nop
>
> (gdb) info registers
> eax            0x7      7
> ecx            0x0      0
> edx            0x4      4
> ebx            0x5      5
> esp            0xcf7fe000       0xcf7fe000
> ebp            0xcf7fe010       0xcf7fe010
> esi            0x3c059460       1006998624
> edi            0x8      8
> eip            0x1c0e3f6b       0x1c0e3f6b
> eflags         0x10292  66194
> cs             0x2b     43
> ss             0x33     51
> ds             0x33     51
> es             0x33     51
> fs             0x33     51
> gs             0x33     51
>
>
> I'll try run the bytecode version with backtrace turned on and see if that
> gets me any further. or not...
>
> pyxis:ttypa# ocamlrun bin/sks.bc help
> Fatal error: unknown C primitive `unix_dup'
>
> I'll see if the cvs code helps any.
>
> OS: OpenBSD 3.4-current i388
> DB: 4.2.52
> ML: Ocaml 3.07
> CC: "gcc version 2.95.3 20010125 (prerelease, propolice)"
>



|--------/            Yaron M. Minsky              \--------|
|--------\ http://www.cs.cornell.edu/home/yminsky/ /--------|

Open PGP --- KeyID B1FFD916
Fingerprint: 5BF6 83E1 0CE3 1043 95D8 F8D5 9F12 B3A9 B1FF D916





reply via email to

[Prev in Thread] Current Thread [Next in Thread]