gnustep-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NSObject.m threading bug(s).


From: Chris Ball
Subject: Re: NSObject.m threading bug(s).
Date: Fri, 10 Sep 2010 16:21:26 -0700

On Fri, 10 Sep 2010 22:42:46 +0100
David Chisnall <address@hidden> wrote:

> 
> On 10 Sep 2010, at 22:22, Chris Ball wrote:
 -- trim --

> > Potential bug #0:
 -- more trimming  --

> > Now it has been a long time since I have done any assembly programming and I
> > have never done any for mips/power but I don't really see anything at all
> > atomic about this.  If one thread gets through the load and then switches to
> > another thread which then goes through all the steps of load 
> > increment/decrement
> > store, when we get back to the first thread, we are essentially b0rked.
> 
> The load and store are interlocked.  This is how all atomic ops are 
> implemented on RISC platforms (including ARM).  The load puts the address in 
> a reservation buffer.  Every store to that address (by any CPU, in a 
> cache-cohernect system) sets a flag.  The interlinked store then writes to 
> the address in the reservation buffer if and only if the flag is not set.  
> The branch happens if the store failed - the load-modify-retry cycle happens 
> again until nothing else touches the memory in the middle.

I had no idea.
 
> > Bug #1 ('round about line 425):
> > 
> > #if     defined(GSATOMICREAD)
> >      int       result;
> > 
> >      result = 
> > GSAtomicDecrement((gsatomic_t)&(((obj)anObject)[-1].retained));
> >      if (result < 0)
> >        {
> >          if (result != -1)
> >            {
> >              [NSException raise: NSInternalInconsistencyException
> >                format: @"NSDecrementExtraRefCount() decremented too far"];
> >            }
> >          /* The counter has become negative so it must have been zero.
> >           * We reset it and return YES ... in a correctly operating
> >           * process we know we can safely reset back to zero without
> >           * worrying about atomicity, since there can be no other
> >           * thread accessing the object (or its reference count would
> >           * have been greater than zero)
> >           */
> >          (((obj)anObject)[-1].retained) = 0;
> >          return YES;
> >        }
> > #else   /* GSATOMICREAD */
> > 
> > 
> > For architectures like Intel where bug #0 isn't relevant we definitely have 
> > a
> > problem here.  GSAtomicDecrement can be atomic but the two if's following it
> > make this whole bit a critical section which requires some locks which 
> > pretty
> > much defeats the purpose of the GSATOMICLOCK stuff.  
> > 
> > Specifically one thread can get the result to -2 before the thread that 
> > noticed
> > that the result was -1 reset the result to zero.  
> 
> This will only ever happen in the case of an object being released after its 
> retain count hits 0.  This has undefined behaviour (in a single threaded 
> system, it will probably segfault), so we are free to do whatever we want 
> here, including some complete nonsense or crashing.
>

Whilst I understand what you are saying, my practical experience with our code
suggests there are subtleties perhaps within GNUstep that make this assumption
unsafe.  

Thread 1: 
        result = GSAtomicDecrement((gsatomic_t)&(((obj)anObject)[-1].retained));
where result = -1

now switch to Thread 2:
        result = GSAtomicDecrement((gsatomic_t)&(((obj)anObject)[-1].retained));
        if (result < 0)
          {
            if (result != -1)
              {
                [NSException raise: NSInternalInconsistencyException
                  format: @"NSDecrementExtraRefCount() decremented too far"];
              }

Death due to exception. 

Other variations of course exist, ie thread one can get through the first two
if's and switch to the other thread just after resetting the count to zero, now
thread one will return YES as will thread 2.

Further exercises are left to the reader.

If your contention is that only one thread is managing the retain count and
releasing an object, then you don't need an atomic decrement at all but that
clearly isn't true given that any thread can do retain/release an otherwise
autoreleased object.

I should point out that this isn't academic, we have a home rolled database
that has used GNUstep since around 2002 and processes between 500 million and a
billion messages per day and before I disabled this GSATOMIC stuff we would
process tops around 30 million messages before the system fell over in one
weird way or another  (this on a 32 core machine).  In our application the place
I most consistently found the code to fail is where we are transferring messages
via a queue from one thread to another so we have an autorelease message
created in thread 1, which is then retained for transfer to the queue (it is
retained again inside the queue because the queue itself is threadsafe) then
the other thread pulls the message from the queue (queue releases the message)
we do stuff with it and then call release (at which point we don't care about
the message anymore) and presumably thread one reaps the autoreleased memory.

Now I'm not proud, this system all told, is around 300k lines of code, of course
it has bugs, having said this the program doesn't leak the queued messages (we
don't have terrabytes of ram so we would notice) and I would expect wild double
releases in our code to have at least caused a few otherwise unexplained
application crashes over the years given that the place that fails is called
once per message, actually I wouldn't expect the system to last more than a
minute or two tops given the memory churn that the system goes through.

We played with a couple simple gnustep programs using the Intel thread checker
tool (http://software.intel.com/en-us/intel-thread-checker/) and disabling
GSATOMIC stopped quite a few strange errors in the GNUStep internals from
occurring for what it's worth.

> > Thank goodness for the NSInteralInconsistencyException because this would 
> > have
> > been a bear to find otherwise.  Although very random/crashy behaviour seemed
> > about as likely as getting the exception.
> 
> The exception is there for debugging.  We don't guarantee catching 
> double-free bugs using this mechanism, because if it happened in a 
> non-concurrent system we'd already have free'd the memory that the retain 
> count is stored in, giving a segfault.

There is no double free bugs that I'm aware of and since the actual memory 
release happens outside of the locks in the non-GSATOMIC version of the code
double frees could certainly still cause problems if they are present.

> 
> > At some level I suppose this can be half interesting because if there is a
> > properly atomic increment available you only really need locks for the
> > decrement half of the operation, but then again if we don't have real atomic
> > increment/decrement for a couple supported architectures it maybe isn't 
> > worth
> > the effort.
> 
> On any modern compiler (clang or gcc 4.2 or later), we are using the GCC 
> built-in functions for atomic ops, so that asm is not used.  It's only needed 
> for a couple of platforms that don't have a recent compiler on a few archs.

That's cool and if what you said above about memory reservations is true then
it shouldn't be an issue.

        Chris.


> David
> 
> -- Send from my Jacquard Loom

-- 



reply via email to

[Prev in Thread] Current Thread [Next in Thread]