bug-gmp
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

New version of sqrtrem.c


From: Rick Coupland
Subject: New version of sqrtrem.c
Date: Wed, 25 Jul 2001 19:14:43 -0700

I am attaching a new version of sqrtrem.c. This uses a new algorithm which I developed. For large operands, it is significantly faster than the current version. I developed this a while back for use in another project I was working on. I would now like to contribute
this to the GMP library.

Below is a timing comparison between the current 3.1.1 version and my new version. These
times were measured on a 750 MHz Athlon.

Original 3.1.1 version of sqrtrem.c
Small (001-002 words):  0.461207 microseconds per call
Medium (003-008 words):  2.0125 microseconds per call
Large (009-500 words):  233.403 microseconds per call


New version of sqrtrem.c
Small (001-002 words):  0.224138 microseconds per call
Medium (003-008 words):  1.4625 microseconds per call
Large (009-500 words):  111.81 microseconds per call


The algorithm is explained in the source code.

Please contact me with any questions or problems with the source code.


Rick Coupland
/* mpn_sqrtrem (root_ptr, rem_ptr, op_ptr, op_size)

   Write the square root of {OP_PTR, OP_SIZE} at ROOT_PTR.
   Write the remainder at REM_PTR, if REM_PTR != NULL.
   Return the size of the remainder.
   (The size of the root is always half of the size of the operand.)

   OP_PTR and ROOT_PTR may not point to the same object.
   OP_PTR and REM_PTR may point to the same object.

   If REM_PTR is NULL, only the root is computed and the return value of
   the function is 0 if OP is a perfect square, and *any* non-zero number
   otherwise.

Copyright (C) 1993, 1994, 1996, 1997, 1998, 1999, 2000 Free Software 
Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Library General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Library General Public
License for more details.

You should have received a copy of the GNU Library General Public License
along with the GNU MP Library; see the file COPYING.LIB.  If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA. */

/* This code is just correct if "unsigned char" has at least 8 bits.  It
   doesn't help to use CHAR_BIT from limits.h, as the real problem is
   the static arrays.  */

#include "gmp.h"
#include "gmp-impl.h"
#include "longlong.h"

#define MP_LIMB_HI_BIT (1<<(BITS_PER_MP_LIMB-1))

/* Square root algorithm:

   1. Shift OP (the input) to the left an even number of bits s.t. there
      are an even number of words and either (or both) of the most
      significant bits are set.  This way, sqrt(OP) has exactly half as
      many words as OP, and has its most significant bit set.  This step
      scales OP by 2^(2*i).

   2. Calculate the square root of the high order word using a hardware
      sqrt instruction (if available) or a table lookup & Newton's method
      approximation.

   3. Calculate the sqrt of the two high order words using a single
      Newton's method iteration on the result from (2).  If the original
      OP had 2 or fewer significant words scale the result & return.

   4. Enter the main loop which doubles the number of significant words
      of the square root on each iteration.  This is done by dividing
      high order words of the remainder by the partial square root
      and then recalculating the remainder.  In general, this method
      will require the same number of iterations as Newton's method but
      is faster for large operands because the size of the division
      performed in each iteration is smaller.

   5. Scale the result by dividing by 2^i and recalculate the remainder.
*/

/* Define this macro for IEEE P854 machines with a fast sqrt instruction.  */
#if defined __GNUC__ && ! defined __SOFT_FLOAT

#if defined (__sparc__) && BITS_PER_MP_LIMB == 32
#define SQRT(a) \
  ({                                                                    \
    double __sqrt_res;                                                  \
    asm ("fsqrtd %1,%0" : "=f" (__sqrt_res) : "f" (a));                 \
    __sqrt_res;                                                         \
  })
#endif

#if defined (__HAVE_68881__)
#define SQRT(a) \
  ({                                                                    \
    double __sqrt_res;                                                  \
    asm ("fsqrtx %1,%0" : "=f" (__sqrt_res) : "f" (a));                 \
    __sqrt_res;                                                         \
  })
#endif

#if defined (__hppa) && BITS_PER_MP_LIMB == 32
#define SQRT(a) \
  ({                                                                    \
    double __sqrt_res;                                                  \
    asm ("fsqrt,dbl %1,%0" : "=fx" (__sqrt_res) : "fx" (a));            \
    __sqrt_res;                                                         \
  })
#endif

#if defined (_ARCH_PWR2) && BITS_PER_MP_LIMB == 32
#define SQRT(a) \
  ({                                                                    \
    double __sqrt_res;                                                  \
    asm ("fsqrt %0,%1" : "=f" (__sqrt_res) : "f" (a));                  \
    __sqrt_res;                                                         \
  })
#endif

#if 0
#if defined (__i386__) || defined (__i486__)
#define SQRT(a) \
  ({                                                                    \
    double __sqrt_res;                                                  \
    asm ("fsqrt" : "=t" (__sqrt_res) : "0" (a));                        \
    __sqrt_res;                                                         \
  })
#endif
#endif

#endif

#ifndef SQRT

/*
** Table for initial approximation of the square root.  This is
** indexed with bits 0-7 of the operand for which the square root is
** calculated, where bits 0-1 are the most significant non-zero bit
** pair.  Because the minimum value of bits 0-7 is 0x40, the actual
** index is calculated by subtracting 0x40 from the value of bits 0-7.
*/
static const unsigned char srtab[192] = {
  0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87,
  0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f,
  0x8f, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96,
  0x96, 0x97, 0x98, 0x99, 0x9a, 0x9b, 0x9b, 0x9c,
  0x9d, 0x9e, 0x9f, 0x9f, 0xa0, 0xa1, 0xa2, 0xa3,
  0xa3, 0xa4, 0xa5, 0xa6, 0xa7, 0xa7, 0xa8, 0xa9,
  0xaa, 0xaa, 0xab, 0xac, 0xad, 0xad, 0xae, 0xaf,
  0xaf, 0xb0, 0xb1, 0xb2, 0xb2, 0xb3, 0xb4, 0xb5,
  0xb5, 0xb6, 0xb7, 0xb7, 0xb8, 0xb9, 0xb9, 0xba,
  0xbb, 0xbb, 0xbc, 0xbd, 0xbd, 0xbe, 0xbf, 0xbf,
  0xc0, 0xc1, 0xc1, 0xc2, 0xc3, 0xc3, 0xc4, 0xc5,
  0xc5, 0xc6, 0xc7, 0xc7, 0xc8, 0xc9, 0xc9, 0xca,
  0xcb, 0xcb, 0xcc, 0xcc, 0xcd, 0xce, 0xce, 0xcf,
  0xcf, 0xd0, 0xd1, 0xd1, 0xd2, 0xd3, 0xd3, 0xd4,
  0xd4, 0xd5, 0xd6, 0xd6, 0xd7, 0xd7, 0xd8, 0xd9,
  0xd9, 0xda, 0xda, 0xdb, 0xdb, 0xdc, 0xdd, 0xdd,
  0xde, 0xde, 0xdf, 0xdf, 0xe0, 0xe1, 0xe1, 0xe2,
  0xe2, 0xe3, 0xe3, 0xe4, 0xe5, 0xe5, 0xe6, 0xe6,
  0xe7, 0xe7, 0xe8, 0xe8, 0xe9, 0xea, 0xea, 0xeb,
  0xeb, 0xec, 0xec, 0xed, 0xed, 0xee, 0xee, 0xef,
  0xef, 0xf0, 0xf1, 0xf1, 0xf2, 0xf2, 0xf3, 0xf3,
  0xf4, 0xf4, 0xf5, 0xf5, 0xf6, 0xf6, 0xf7, 0xf7,
  0xf8, 0xf8, 0xf9, 0xf9, 0xfa, 0xfa, 0xfb, 0xfb,
  0xfc, 0xfc, 0xfd, 0xfd, 0xfe, 0xfe, 0xff, 0xff,
};
#endif


mp_size_t
#if __STDC__
mpn_sqrtrem (mp_ptr root_ptr, mp_ptr rem_ptr, mp_srcptr op_ptr, mp_size_t 
op_size)
#else
mpn_sqrtrem (root_ptr, rem_ptr, op_ptr, op_size)
     mp_ptr root_ptr;
     mp_ptr rem_ptr;
     mp_srcptr op_ptr;
     mp_size_t op_size;
#endif
{
  /* R (root result) */
  mp_ptr rp;                    /* Ptr to LSW of partial square root */
  mp_ptr qp;                    /* Ptr to LSW of quotient */
  mp_size_t rsize;              /* The size in words of partial root */

  /* TT (temporary for numerator/remainder) */
  mp_ptr tp;                    /* Pointer to allocated tmp space */
  mp_ptr ntp;                   /* Ptr to LSW of current remainder */
  mp_size_t remsize;            /* Size of current remainder */
  mp_size_t ocnt;               /* Unused words of operand */
  mp_ptr prod;                  /* Ptr to product in main loop */

  unsigned cnt;                 /* Leading zero bits in high order limb of op */
  mp_limb_t iop[4];             /* Initial operand bits, left justified */
  mp_limb_t op_hi, op_h2;       /* High order limbs of operand */
  mp_limb_t sr0;                /* High order word of square root */
  mp_limb_t q, r;               /* Temp quotient & remainder */
  mp_limb_t p_hi, p_lo;         /* Used square of high order word of root */
  mp_size_t n_obits;            /* Number of significant bits in op */
  long i, t, t2;                /* Temp */
  mp_limb_t w;                  /* Temporary */

  TMP_DECL (marker);

  /* If OP is zero, both results are zero.  */
  if (op_size == 0)
    return 0;

  count_leading_zeros (cnt, op_ptr[op_size - 1]);
  cnt &= ~1;

  /* Allocate space for temp storage of operand & remainder */
  if (op_size > 2)
    tp = (mp_ptr)TMP_ALLOC((op_size + 2) * BYTES_PER_MP_LIMB) + 2;
  else {
    iop[0] = 0;
    tp = iop + 1;
  }
  if (cnt)
    (void)mpn_lshift(tp, op_ptr, op_size, cnt);
  else
    MPN_COPY(tp, op_ptr, op_size);

  op_hi = tp[op_size - 1];
  op_h2 = tp[op_size - 2];

/*
** First calculate square root of the 2 most significant limbs
** (left justified) of the operand.  This gives us the most
** significant limb (left justified) of the square root.
*/
/* Is there a fast sqrt instruction defined for this machine?  */
#ifdef SQRT
  {
    sr0 = SQRT (op_hi * MP_BASE_AS_DOUBLE + op_h2);
    /*
    ** If t_high0,,t_high1 is big, the result in sr0 might have
    ** become incorrect due to overflow in the conversion from double to
    ** mp_limb_t above.  It will typically be zero in that case, but might be
    ** a small number on some machines.  The most significant bit of
    ** sr0 should be set, so that bit is a good overflow indication.
    */
    if ((mp_limb_signed_t) sr0 >= 0)
      sr0 = ~(mp_limb_t)0;
  }
#else

  /*
  ** We handle cases with the high order limb of the normalized operand
  ** consisting of all 1's separately.  This is to avoid an overflow
  ** in divide operation of the newtons method approx.
  */
  if (op_hi == (mp_limb_t)~0) {
    /* Just plug-in the high order limb of the square root for this case */
    sr0 = (mp_limb_t)~0;
  }
  else {
    /*
    ** If high order limb of operand is too large, a first Newtons method
    ** approx will overflow.  For this case, the high order limb of the
    ** operand + 1 gives a good enough first approximation for the square
    ** root of the two high order limbs of the operand.
    */
    if (op_hi >= (mp_limb_t)(~((1 << BITS_PER_MP_LIMB/2) - 1))) {
      /* Use the high order limb of the operand plus 1 as first approx */
      sr0 = op_hi + 1;
    }
    else {
      /*
      ** Otherwise, get first approximation from lookup table to get
      ** 1st 8 bits of square root.  Then do small precision Newtons
      ** iterations to get 1st 1/2 limb of square root.
      */
      sr0 = srtab[(op_hi >> (BITS_PER_MP_LIMB - 8)) - 0x40]
            << ((BITS_PER_MP_LIMB - 16)/2);
      sr0 = (sr0 + op_hi/sr0);

# if BITS_PER_MP_LIMB > 32
      i = 32;
      do {
        sr0 >>= 1;
        sr0 = (sr0 + op_hi/sr0);
      } while ((i <<= 1) < BITS_PER_MP_LIMB);
# endif

      if (op_size == 1) {
        sr0 >>= (cnt/2 + 1);
        if ((mp_limb_signed_t)(w = op_ptr[0] - sr0*sr0) < 0) {
          sr0 -= 1;
          w += ((sr0 << 1) + 1);
        }
        *root_ptr = sr0;
        if (rem_ptr)
          rem_ptr[0] = w;
        TMP_FREE (marker);
        return (w != 0);
      }

      sr0 <<= ((BITS_PER_MP_LIMB / 2) - 1);
    }

    /* Do one more Newtons method iteration to get high order limb of root */
    udiv_qrnnd(q, r, op_hi, op_h2, sr0);
    sr0 = ((sr0 + q) >> 1) | MP_LIMB_HI_BIT;
  }
#endif

  /*
  ** Square our calculated square root of the 2 high order words.  If it
  ** is too large decrement the root by 1 & recalculate the product.
  */
  umul_ppmm(p_hi, p_lo, sr0, sr0);
  if (p_hi > op_hi || (p_hi == op_hi && p_lo > op_h2))
  {
    sr0 -= 1;
    umul_ppmm(p_hi, p_lo, sr0, sr0);
  }

  /* If the operand was less than 2 full words, we are done */
  if ((t = 2 - op_size) > 0 || (!t && cnt))
  {
    /* Right justify the root & return it */
    *root_ptr = (sr0 >>= ((BITS_PER_MP_LIMB*t + cnt)/2));

    /* Calculate the remainder */
    /*  We know that the remainder must be <= 2*root. */
    w = op_ptr[0] - sr0*sr0;
    if (rem_ptr)
      rem_ptr[0] = w;
    TMP_FREE (marker);
    return (w != 0);
  }

  /* Calculate remainder & store in tp */
  ntp = tp + op_size - 2;
  sub_ddmmss(ntp[1], ntp[0], op_hi, op_h2, p_hi, p_lo);

  /* Setup root ptr & size */
  rp = root_ptr + (op_size - 1)/2;
  rsize = 1;
  rp[0] = sr0;

  /* Return if operand was exactly 2 full limbs */
  if (op_size == 2) {
    remsize = ntp[1] ? 2 : (ntp[0] ? 1 : 0);
    if (remsize && rem_ptr)
      MPN_COPY(rem_ptr, ntp, remsize);
    TMP_FREE (marker);
    return remsize;
  }

  tp[-1] = mpn_rshift(tp, tp, op_size, 1);
  tp[-2] = 0;
  remsize = ntp[0] ? 1 : 0;

  /* Remaining unprocessed words of operand */
  ocnt = op_size - 2;

  /* Allocate temp storage for q^2 */
  prod = (mp_ptr)TMP_ALLOC(((op_size + 1)/2) * BYTES_PER_MP_LIMB);

  /* Main loop */
  for (i = 1; ocnt > 0; ) {
    if (2*i >= ocnt)
      i = (ocnt + 1)/2;
    else if (2*(i - 1) > (ocnt - 2*i))
      i = (ocnt + 3)/4;
    ocnt -= (t = 2*i);
    ntp -= i;
    qp = rp - i;
    if ( mpn_divrem(qp, 0, ntp, rsize + i, rp, rsize) ) {
      /* Only happens if {ntp, rsize} == {rp, rsize} */
      memset(qp, -1, BYTES_PER_MP_LIMB * i);
      memset(ntp + i, 0, BYTES_PER_MP_LIMB * rsize);
      if (mpn_add_n(ntp, ntp, rp, rsize)) {
        ntp[rsize] = (mp_limb_t)1;
        remsize += 1;
      }
    }
    ntp -= i;
    remsize += i;
    rsize += i;
    mpn_mul_n(prod, qp, qp, i);
    rp = qp;
    w = mpn_rshift(prod, prod, t, 1);
    if ((ntp[-1] ^= w) & w)
      prod[0] += 1;

    if ( mpn_sub_n(ntp, ntp, prod, t) )
      if (remsize <= t || mpn_sub_1(ntp + t, ntp + t, remsize - t, 
(mp_limb_t)1)) {
        /*
        ** The calculated new words of the root at {qp, i} were too big by 1.
        ** Decrement qp by one and adjust the remainder.
        */
        mpn_decr_u(qp, 1);
        mpn_add_n(ntp, ntp, rp, rsize);
        if ( !((ntp[-1]  ^= MP_LIMB_HI_BIT) & MP_LIMB_HI_BIT) )
          mpn_incr_u(ntp, 1);
      }
    remsize = rsize;
    i <<= 1;
    /*
    **  The following will speed convergence for some operands but appears to
    **  not be worth the overhead on average.
    **
    MPN_NORMALIZE(ntp, remsize);
    i = rsize + 2*(rsize - remsize);
    */
  }
  if (ntp[remsize] = mpn_lshift(ntp - 1, ntp - 1, remsize + 1, 1))
    remsize += 1; 

  /* 
  ** We computed the square root of OP * 2**(2*i)
  ** Scale the square root by dividing by 2*i and adjust the remainder.
  */
  if (t = (op_size & 1) * BITS_PER_MP_LIMB + cnt) {
    t >>= 1;
    r = rp[0] & ((1 << t) - 1);
    prod[rsize] = mpn_lshift(prod, rp, rsize, 1);
    prod[0] -= r;
    mpn_addmul_1(ntp, prod, rsize + 1, r);
    mpn_rshift(rp, rp, rsize, t);
    remsize = rsize + 1 - (op_size & 1);
  }

  /* Find the actual remainder size */
  MPN_NORMALIZE(tp, remsize);

  /* If caller specified a remainder pointer, return remainder */
  if (rem_ptr && remsize)
  {
    if (cnt)
      mpn_rshift(rem_ptr, tp, remsize, cnt);
    else
      MPN_COPY(rem_ptr, tp, remsize);
  }

  TMP_FREE (marker);
  return remsize;
}

reply via email to

[Prev in Thread] Current Thread [Next in Thread]