bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort --random-sort


From: Paul Eggert
Subject: Re: sort --random-sort
Date: Sat, 03 Dec 2005 00:46:39 -0800
User-agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

Frederik Eaton <address@hidden> writes:

> I still think what I sent is ready to apply.

I took a look at the patch quoted in that email and have some comments
and suggestions.

It's more conservative to minimize the changes to the ISAAC code.  The
simplest thing to do for now is to put it into a rand-isaac.c file,
which is included both by shred.c and by sort.c.  This results in a
trivial change to shred.c (remove a block of code and replace it by an
#include of a copy of that code) and greater confidence that we
haven't messed up the random-number generator somehow.  I didn't quite
follow all the ISAAC_WORDS stuff, but since the patch you sent uses
the same value for both sort and shred, we might as well leave it a
constant, to minimize the number of changes to the code.

Eventually, once we factor in /dev/random / /dev/urandom stuff, this
can be revisited, but in the meantime it's better to minimize changes
to the code.

As a documentation issue, I'd feel more comfortable calling this a
"hashed" sort rather than a "random" sort.  Using the word "random"
has the wrong connotations, since the sort doesn't generate a random
permutation of the input.  So I would prefer option names like
--hash-sort and -h.

When two different keys hash to the same value, the comparison should
fall back on byte-by-byte comparison of the keys; otherwise the user
can't rely on identical keys sorting next to each other, which is one
of the desired properties here.

I don't see why get_hash needs to extract a subset of data from the mm
buffer.  Why not just compare the contents of the entire buffer?  Is
there something nonrandom about that?  If so, perhaps we should go the
whole hog and use the "correct" way to get at it.  (*)

We needn't xstrdup optarg; the original is not going away.

The new files need proper copyright notices, headers, and authorship info.

The usual style is to use NULL rather than 0 for the null pointer.

Here's a proposed patch embodying the above suggestions.  I'm still a
bit puzzled about the part marked (*) above, though.

2005-12-03  Frederik Eaton <address@hidden>
       and  Paul Eggert  <address@hidden>

        * doc/coreutils.texi (sort invocation): Add -h or --hash-sort
        and --seed, plus an example illustrating -h.
        * src/Makefile.am (sort_LDADD): Add $(LIB_GETHRXTIME).
        * src/shred.c: Include rand-isaac.c.  Move the ISAAC code into it.
        * src/sort.c: Include rand-isaac.c.
        (struct keyfield): New member hash.
        (rand_state): New static variable.
        (usage, short_options, long_options): Support -h, --hash-sort, --seed.
        (SEED_OPTION): New constant.
        * src/rand-isaac.c: New file, taken from the code moved out of
        shred.c, but with the following changes:
        Include gethrxtime.h.
        (isaac_seed_start): Initialize mm even if
        AVOID_USED_UNINITIALIZED_WARNINGS is 0, since sort needs this
        to get reproducible results with a user-specified seed.

Index: doc/coreutils.texi
===================================================================
RCS file: /fetish/cu/doc/coreutils.texi,v
retrieving revision 1.297
diff -p -u -r1.297 coreutils.texi
--- doc/coreutils.texi  26 Nov 2005 07:52:43 -0000      1.297
+++ doc/coreutils.texi  3 Dec 2005 08:41:29 -0000
@@ -3396,6 +3396,16 @@ To compare such strings numerically, use
 Reverse the result of comparison, so that lines with greater key values
 appear earlier in the output instead of later.
 
address@hidden -h
address@hidden --hash-sort
address@hidden -h
address@hidden --hash-sort
address@hidden hash sort
+Sort by hashing the input keys and then sorting the hash values.  This
+is much like a random shuffle of the input keys, except that keys with
+the same value sort together.  Normally the hash function is chosen
+at random, but this can be overridden with the @option{--seed} option.
+
 @end table
 
 Other options are:
@@ -3529,6 +3539,17 @@ This option can be useful in conjunction
 reliably handle arbitrary file names (even those containing blanks
 or other special characters).
 
address@hidden address@hidden
address@hidden --seed
address@hidden seed for hash
+Use data from @var{string} to choose the hash function used by the
address@hidden option.  This option can be used to reproduce
+results of earlier invocations of @command{sort} with
address@hidden  However, results are not necessarily
+reproducible across different @command{sort} implementations (e.g.,
address@hidden on little-endian versus big-endian architectures, or
+from one version of @command{sort} to the next).
+
 @end table
 
 Historical (BSD and System V) implementations of @command{sort} have
@@ -3695,6 +3716,16 @@ by the sort operation.
 @c printf 'c\n\nb\n\na\n'|perl -0pe 's/\n\n/\n\0/g'|sort -z|perl -0pe 
's/\0/\n/g'
 @c @end example
 
address@hidden
+Shuffle a list of directories, but preserve the order of files within
+each directory.  For instance, one could use this to generate a music
+playlist in which albums are shuffled but the songs of each album are
+played in order.
+
address@hidden
+ls */* | sort -t / -k 1,1h -k 2,2
address@hidden example
+
 @end itemize
 
 
Index: src/Makefile.am
===================================================================
RCS file: /fetish/cu/src/Makefile.am,v
retrieving revision 1.59
diff -p -u -r1.59 Makefile.am
--- src/Makefile.am     23 Oct 2005 15:23:56 -0000      1.59
+++ src/Makefile.am     3 Dec 2005 08:41:29 -0000
@@ -69,7 +69,7 @@ shred_LDADD = $(LDADD) $(LIB_GETHRXTIME)
 vdir_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME)
 
 ## If necessary, add -lm to resolve use of pow in lib/strtod.c.
-sort_LDADD = $(LDADD) $(POW_LIB)
+sort_LDADD = $(LDADD) $(POW_LIB) $(LIB_GETHRXTIME)
 
 # for get_date and gettime
 date_LDADD = $(LDADD) $(LIB_CLOCK_GETTIME)
Index: src/shred.c
===================================================================
RCS file: /fetish/cu/src/shred.c,v
retrieving revision 1.117
diff -p -u -r1.117 shred.c
--- src/shred.c 24 Sep 2005 13:40:37 -0000      1.117
+++ src/shred.c 3 Dec 2005 08:41:29 -0000
@@ -110,6 +110,8 @@
 #include "quotearg.h"          /* For quotearg_colon */
 #include "quote.h"             /* For quotearg_colon */
 
+#include "rand-isaac.c"
+
 #define DEFAULT_PASSES 25      /* Default */
 
 /* How many seconds to wait before checking whether to output another
@@ -227,387 +229,6 @@ to be recovered later.\n\
 }
 
 /*
- * --------------------------------------------------------------------
- *     Bob Jenkins' cryptographic random number generator, ISAAC.
- *     Hacked by Colin Plumb.
- *
- * We need a source of random numbers for some of the overwrite data.
- * Cryptographically secure is desirable, but it's not life-or-death
- * so I can be a little bit experimental in the choice of RNGs here.
- *
- * This generator is based somewhat on RC4, but has analysis
- * (http://ourworld.compuserve.com/homepages/bob_jenkins/randomnu.htm)
- * pointing to it actually being better.  I like it because it's nice
- * and fast, and because the author did good work analyzing it.
- * --------------------------------------------------------------------
- */
-
-/* Size of the state tables to use.  (You may change ISAAC_LOG) */
-#define ISAAC_LOG 8
-#define ISAAC_WORDS (1 << ISAAC_LOG)
-#define ISAAC_BYTES (ISAAC_WORDS * sizeof (uint32_t))
-
-/* RNG state variables */
-struct isaac_state
-  {
-    uint32_t mm[ISAAC_WORDS];  /* Main state array */
-    uint32_t iv[8];            /* Seeding initial vector */
-    uint32_t a, b, c;          /* Extra index variables */
-  };
-
-/* This index operation is more efficient on many processors */
-#define ind(mm, x) \
-  (* (uint32_t *) ((char *) (mm) \
-                  + ((x) & (ISAAC_WORDS - 1) * sizeof (uint32_t))))
-
-/*
- * The central step.  This uses two temporaries, x and y.  mm is the
- * whole state array, while m is a pointer to the current word.  off is
- * the offset from m to the word ISAAC_WORDS/2 words away in the mm array,
- * i.e. +/- ISAAC_WORDS/2.
- */
-#define isaac_step(mix, a, b, mm, m, off, r) \
-( \
-  a = ((a) ^ (mix)) + (m)[off], \
-  x = *(m), \
-  *(m) = y = ind (mm, x) + (a) + (b), \
-  *(r) = b = ind (mm, (y) >> ISAAC_LOG) + x \
-)
-
-/*
- * Refill the entire R array, and update S.
- */
-static void
-isaac_refill (struct isaac_state *s, uint32_t r[/* ISAAC_WORDS */])
-{
-  uint32_t a, b;               /* Caches of a and b */
-  uint32_t x, y;               /* Temps needed by isaac_step macro */
-  uint32_t *m = s->mm; /* Pointer into state array */
-
-  a = s->a;
-  b = s->b + (++s->c);
-
-  do
-    {
-      isaac_step (a << 13, a, b, s->mm, m, ISAAC_WORDS / 2, r);
-      isaac_step (a >> 6, a, b, s->mm, m + 1, ISAAC_WORDS / 2, r + 1);
-      isaac_step (a << 2, a, b, s->mm, m + 2, ISAAC_WORDS / 2, r + 2);
-      isaac_step (a >> 16, a, b, s->mm, m + 3, ISAAC_WORDS / 2, r + 3);
-      r += 4;
-    }
-  while ((m += 4) < s->mm + ISAAC_WORDS / 2);
-  do
-    {
-      isaac_step (a << 13, a, b, s->mm, m, -ISAAC_WORDS / 2, r);
-      isaac_step (a >> 6, a, b, s->mm, m + 1, -ISAAC_WORDS / 2, r + 1);
-      isaac_step (a << 2, a, b, s->mm, m + 2, -ISAAC_WORDS / 2, r + 2);
-      isaac_step (a >> 16, a, b, s->mm, m + 3, -ISAAC_WORDS / 2, r + 3);
-      r += 4;
-    }
-  while ((m += 4) < s->mm + ISAAC_WORDS);
-  s->a = a;
-  s->b = b;
-}
-
-/*
- * The basic seed-scrambling step for initialization, based on Bob
- * Jenkins' 256-bit hash.
- */
-#define mix(a,b,c,d,e,f,g,h) \
-   (       a ^= b << 11, d += a, \
-   b += c, b ^= c >>  2, e += b, \
-   c += d, c ^= d <<  8, f += c, \
-   d += e, d ^= e >> 16, g += d, \
-   e += f, e ^= f << 10, h += e, \
-   f += g, f ^= g >>  4, a += f, \
-   g += h, g ^= h <<  8, b += g, \
-   h += a, h ^= a >>  9, c += h, \
-   a += b                        )
-
-/* The basic ISAAC initialization pass.  */
-static void
-isaac_mix (struct isaac_state *s, uint32_t const seed[/* ISAAC_WORDS */])
-{
-  int i;
-  uint32_t a = s->iv[0];
-  uint32_t b = s->iv[1];
-  uint32_t c = s->iv[2];
-  uint32_t d = s->iv[3];
-  uint32_t e = s->iv[4];
-  uint32_t f = s->iv[5];
-  uint32_t g = s->iv[6];
-  uint32_t h = s->iv[7];
-
-  for (i = 0; i < ISAAC_WORDS; i += 8)
-    {
-      a += seed[i];
-      b += seed[i + 1];
-      c += seed[i + 2];
-      d += seed[i + 3];
-      e += seed[i + 4];
-      f += seed[i + 5];
-      g += seed[i + 6];
-      h += seed[i + 7];
-
-      mix (a, b, c, d, e, f, g, h);
-
-      s->mm[i] = a;
-      s->mm[i + 1] = b;
-      s->mm[i + 2] = c;
-      s->mm[i + 3] = d;
-      s->mm[i + 4] = e;
-      s->mm[i + 5] = f;
-      s->mm[i + 6] = g;
-      s->mm[i + 7] = h;
-    }
-
-  s->iv[0] = a;
-  s->iv[1] = b;
-  s->iv[2] = c;
-  s->iv[3] = d;
-  s->iv[4] = e;
-  s->iv[5] = f;
-  s->iv[6] = g;
-  s->iv[7] = h;
-}
-
-#if 0 /* Provided for reference only; not used in this code */
-/*
- * Initialize the ISAAC RNG with the given seed material.
- * Its size MUST be a multiple of ISAAC_BYTES, and may be
- * stored in the s->mm array.
- *
- * This is a generalization of the original ISAAC initialization code
- * to support larger seed sizes.  For seed sizes of 0 and ISAAC_BYTES,
- * it is identical.
- */
-static void
-isaac_init (struct isaac_state *s, uint32_t const *seed, size_t seedsize)
-{
-  static uint32_t const iv[8] =
-  {
-    0x1367df5a, 0x95d90059, 0xc3163e4b, 0x0f421ad8,
-    0xd92a4a78, 0xa51a3c49, 0xc4efea1b, 0x30609119};
-  int i;
-
-# if 0
-  /* The initialization of iv is a precomputed form of: */
-  for (i = 0; i < 7; i++)
-    iv[i] = 0x9e3779b9;                /* the golden ratio */
-  for (i = 0; i < 4; ++i)      /* scramble it */
-    mix (iv[0], iv[1], iv[2], iv[3], iv[4], iv[5], iv[6], iv[7]);
-# endif
-  s->a = s->b = s->c = 0;
-
-  for (i = 0; i < 8; i++)
-    s->iv[i] = iv[i];
-
-  if (seedsize)
-    {
-      /* First pass (as in reference ISAAC code) */
-      isaac_mix (s, seed);
-      /* Second and subsequent passes (extension to ISAAC) */
-      while (seedsize -= ISAAC_BYTES)
-       {
-         seed += ISAAC_WORDS;
-         for (i = 0; i < ISAAC_WORDS; i++)
-           s->mm[i] += seed[i];
-         isaac_mix (s, s->mm);
-       }
-    }
-  else
-    {
-      /* The no seed case (as in reference ISAAC code) */
-      for (i = 0; i < ISAAC_WORDS; i++)
-       s->mm[i] = 0;
-    }
-
-  /* Final pass */
-  isaac_mix (s, s->mm);
-}
-#endif
-
-/* Start seeding an ISAAC structire */
-static void
-isaac_seed_start (struct isaac_state *s)
-{
-  static uint32_t const iv[8] =
-    {
-      0x1367df5a, 0x95d90059, 0xc3163e4b, 0x0f421ad8,
-      0xd92a4a78, 0xa51a3c49, 0xc4efea1b, 0x30609119
-    };
-  int i;
-
-#if 0
-  /* The initialization of iv is a precomputed form of: */
-  for (i = 0; i < 7; i++)
-    iv[i] = 0x9e3779b9;                /* the golden ratio */
-  for (i = 0; i < 4; ++i)      /* scramble it */
-    mix (iv[0], iv[1], iv[2], iv[3], iv[4], iv[5], iv[6], iv[7]);
-#endif
-  for (i = 0; i < 8; i++)
-    s->iv[i] = iv[i];
-
-  /* Enable the following memset if you're worried about used-uninitialized
-     warnings involving code in isaac_refill from tools like valgrind.
-     Since this buffer is used to accumulate pseudo-random data, there's
-     no harm, and maybe even some benefit, in using it uninitialized.  */
-#if AVOID_USED_UNINITIALIZED_WARNINGS
-  memset (s->mm, 0, sizeof s->mm);
-#endif
-
-  /* s->c gets used for a data pointer during the seeding phase */
-  s->a = s->b = s->c = 0;
-}
-
-/* Add a buffer of seed material */
-static void
-isaac_seed_data (struct isaac_state *s, void const *buffer, size_t size)
-{
-  unsigned char const *buf = buffer;
-  unsigned char *p;
-  size_t avail;
-  size_t i;
-
-  avail = sizeof s->mm - (size_t) s->c;        /* s->c is used as a write 
pointer */
-
-  /* Do any full buffers that are necessary */
-  while (size > avail)
-    {
-      p = (unsigned char *) s->mm + s->c;
-      for (i = 0; i < avail; i++)
-       p[i] ^= buf[i];
-      buf += avail;
-      size -= avail;
-      isaac_mix (s, s->mm);
-      s->c = 0;
-      avail = sizeof s->mm;
-    }
-
-  /* And the final partial block */
-  p = (unsigned char *) s->mm + s->c;
-  for (i = 0; i < size; i++)
-    p[i] ^= buf[i];
-  s->c = size;
-}
-
-
-/* End of seeding phase; get everything ready to produce output. */
-static void
-isaac_seed_finish (struct isaac_state *s)
-{
-  isaac_mix (s, s->mm);
-  isaac_mix (s, s->mm);
-  /* Now reinitialize c to start things off right */
-  s->c = 0;
-}
-#define ISAAC_SEED(s,x) isaac_seed_data (s, &(x), sizeof (x))
-
-/*
- * Get seed material.  16 bytes (128 bits) is plenty, but if we have
- * /dev/urandom, we get 32 bytes = 256 bits for complete overkill.
- */
-static void
-isaac_seed (struct isaac_state *s)
-{
-  isaac_seed_start (s);
-
-  { pid_t t = getpid ();   ISAAC_SEED (s, t); }
-  { pid_t t = getppid ();  ISAAC_SEED (s, t); }
-  { uid_t t = getuid ();   ISAAC_SEED (s, t); }
-  { gid_t t = getgid ();   ISAAC_SEED (s, t); }
-
-  {
-    xtime_t t = gethrxtime ();
-    ISAAC_SEED (s, t);
-  }
-
-  {
-    char buf[32];
-    int fd = open ("/dev/urandom", O_RDONLY | O_NOCTTY);
-    if (fd >= 0)
-      {
-       read (fd, buf, 32);
-       close (fd);
-       isaac_seed_data (s, buf, 32);
-      }
-    else
-      {
-       fd = open ("/dev/random", O_RDONLY | O_NONBLOCK | O_NOCTTY);
-       if (fd >= 0)
-         {
-           /* /dev/random is more precious, so use less */
-           read (fd, buf, 16);
-           close (fd);
-           isaac_seed_data (s, buf, 16);
-         }
-      }
-  }
-
-  isaac_seed_finish (s);
-}
-
-/* Single-word RNG built on top of ISAAC */
-struct irand_state
-{
-  uint32_t r[ISAAC_WORDS];
-  unsigned int numleft;
-  struct isaac_state *s;
-};
-
-static void
-irand_init (struct irand_state *r, struct isaac_state *s)
-{
-  r->numleft = 0;
-  r->s = s;
-}
-
-/*
- * We take from the end of the block deliberately, so if we need
- * only a small number of values, we choose the final ones which are
- * marginally better mixed than the initial ones.
- */
-static uint32_t
-irand32 (struct irand_state *r)
-{
-  if (!r->numleft)
-    {
-      isaac_refill (r->s, r->r);
-      r->numleft = ISAAC_WORDS;
-    }
-  return r->r[--r->numleft];
-}
-
-/*
- * Return a uniformly distributed random number between 0 and n,
- * inclusive.  Thus, the result is modulo n+1.
- *
- * Theory of operation: as x steps through every possible 32-bit number,
- * x % n takes each value at least 2^32 / n times (rounded down), but
- * the values less than 2^32 % n are taken one additional time.  Thus,
- * x % n is not perfectly uniform.  To fix this, the values of x less
- * than 2^32 % n are disallowed, and if the RNG produces one, we ask
- * for a new value.
- */
-static uint32_t
-irand_mod (struct irand_state *r, uint32_t n)
-{
-  uint32_t x;
-  uint32_t lim;
-
-  if (!++n)
-    return irand32 (r);
-
-  lim = -n % n;                        /* == (2**32-n) % n == 2**32 % n */
-  do
-    {
-      x = irand32 (r);
-    }
-  while (x < lim);
-  return x % n;
-}
-
-/*
  * Fill a buffer with a fixed pattern.
  *
  * The buffer must be at least 3 bytes long, even if
Index: src/sort.c
===================================================================
RCS file: /fetish/cu/src/sort.c,v
retrieving revision 1.328
diff -p -u -r1.328 sort.c
--- src/sort.c  7 Oct 2005 18:48:28 -0000       1.328
+++ src/sort.c  3 Dec 2005 08:41:30 -0000
@@ -77,6 +77,8 @@ double strtod ();
 # define DEFAULT_TMPDIR "/tmp"
 #endif
 
+#include "rand-isaac.c"
+
 /* Exit statuses.  */
 enum
   {
@@ -149,6 +151,7 @@ struct keyfield
                                   point, but no exponential notation. */
   bool general_numeric;                /* Flag for general, numeric comparison.
                                   Handle numbers in exponential notation. */
+  bool hash;                   /* Hash keys before sorting.  */
   bool month;                  /* Flag for comparison by month name. */
   bool reverse;                        /* Reverse the sense of comparison. */
   struct keyfield *next;       /* Next keyfield to try. */
@@ -254,6 +257,10 @@ static bool unique;
 /* Nonzero if any of the input files are the standard input. */
 static bool have_read_stdin;
 
+/* The ISAAC state resulting from the user-specified seed, or from
+   random data taken from the environment.  */
+static struct isaac_state rand_state;
+
 /* List of key field comparisons to be tried.  */
 static struct keyfield *keylist;
 
@@ -300,6 +307,7 @@ Ordering options:\n\
 "), stdout);
       fputs (_("\
   -g, --general-numeric-sort  compare according to general numerical value\n\
+  -h, --hash-sort             compare hashes of keys\n\
   -i, --ignore-nonprinting    consider only printable characters\n\
   -M, --month-sort            compare (unknown) < `JAN' < ... < `DEC'\n\
   -n, --numeric-sort          compare according to string numerical value\n\
@@ -313,6 +321,7 @@ Other options:\n\
   -k, --key=POS1[,POS2]     start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge               merge already sorted files; do not sort\n\
   -o, --output=FILE         write result to FILE instead of standard output\n\
+      --seed=STRING         seed hash function with STRING\n\
   -s, --stable              stabilize sort by disabling last-resort 
comparison\n\
   -S, --buffer-size=SIZE    use SIZE for main memory buffer\n\
 "), stdout);
@@ -353,7 +362,14 @@ native byte values.\n\
   exit (status);
 }
 
-static char const short_options[] = "-bcdfgik:mMno:rsS:t:T:uy:z";
+/* For long options that have no equivalent short option, use a
+   non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
+enum
+{
+  SEED_OPTION = CHAR_MAX + 1
+};
+
+static char const short_options[] = "-bcdfghik:mMno:rsS:t:T:uy:z";
 
 static struct option const long_options[] =
 {
@@ -367,6 +383,7 @@ static struct option const long_options[
   {"merge", no_argument, NULL, 'm'},
   {"month-sort", no_argument, NULL, 'M'},
   {"numeric-sort", no_argument, NULL, 'n'},
+  {"hash-sort", no_argument, NULL, 'h'},
   {"output", required_argument, NULL, 'o'},
   {"reverse", no_argument, NULL, 'r'},
   {"stable", no_argument, NULL, 's'},
@@ -375,6 +392,7 @@ static struct option const long_options[
   {"temporary-directory", required_argument, NULL, 'T'},
   {"unique", no_argument, NULL, 'u'},
   {"zero-terminated", no_argument, NULL, 'z'},
+  {"seed", required_argument, NULL, SEED_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0},
@@ -1179,6 +1197,20 @@ keycompare (const struct line *a, const 
       size_t lenb = limb <= textb ? 0 : limb - textb;
 
       /* Actually compare the fields. */
+
+      if (key->hash)
+       {
+         struct isaac_state diga = rand_state;
+         struct isaac_state digb = rand_state;
+         isaac_seed_data (&diga, texta, lena);
+         isaac_seed_data (&digb, textb, lenb);
+         isaac_seed_finish (&diga);
+         isaac_seed_finish (&digb);
+         diff = memcmp (diga.mm, digb.mm, sizeof diga.mm);
+         if (diff)
+           goto not_equal;
+       }
+
       if (key->numeric | key->general_numeric)
        {
          char savea = *lima, saveb = *limb;
@@ -2069,6 +2101,9 @@ set_ordering (const char *s, struct keyf
        case 'g':
          key->general_numeric = true;
          break;
+       case 'h':
+         key->hash = true;
+         break;
        case 'i':
          /* Option order should not matter, so don't let -i override
             -d.  -d implies -i, but -i does not imply -d.  */
@@ -2109,6 +2144,8 @@ main (int argc, char **argv)
   int c = 0;
   bool checkonly = false;
   bool mergeonly = false;
+  char const *seed = NULL;
+  bool need_rand = false;
   size_t nfiles = 0;
   bool posixly_correct = (getenv ("POSIXLY_CORRECT") != NULL);
   bool obsolete_usage = (posix2_version () < 200112);
@@ -2187,7 +2224,8 @@ main (int argc, char **argv)
   gkey.sword = gkey.eword = SIZE_MAX;
   gkey.ignore = NULL;
   gkey.translate = NULL;
-  gkey.numeric = gkey.general_numeric = gkey.month = gkey.reverse = false;
+  gkey.numeric = gkey.general_numeric = false;
+  gkey.hash = gkey.month = gkey.reverse = false;
   gkey.skipsblanks = gkey.skipeblanks = false;
 
   files = xnmalloc (argc, sizeof *files);
@@ -2259,6 +2297,7 @@ main (int argc, char **argv)
        case 'd':
        case 'f':
        case 'g':
+       case 'h':
        case 'i':
        case 'M':
        case 'n':
@@ -2339,6 +2378,10 @@ main (int argc, char **argv)
          outfile = optarg;
          break;
 
+       case SEED_OPTION:
+         seed = optarg;
+         break;
+
        case 's':
          stable = true;
          break;
@@ -2413,26 +2456,44 @@ main (int argc, char **argv)
        }
     }
 
+  need_rand = gkey.hash;
   /* Inheritance of global options to individual keys. */
   for (key = keylist; key; key = key->next)
-    if (! (key->ignore || key->translate
-          || (key->skipsblanks | key->reverse
-              | key->skipeblanks | key->month | key->numeric
-              | key->general_numeric)))
-      {
-       key->ignore = gkey.ignore;
-       key->translate = gkey.translate;
-       key->skipsblanks = gkey.skipsblanks;
-       key->skipeblanks = gkey.skipeblanks;
-       key->month = gkey.month;
-       key->numeric = gkey.numeric;
-       key->general_numeric = gkey.general_numeric;
-       key->reverse = gkey.reverse;
-      }
+    {
+      need_rand |= key->hash;
+      if (! (key->ignore || key->translate
+            || (key->skipsblanks | key->reverse
+                | key->skipeblanks | key->month | key->numeric
+                | key->general_numeric
+                | key->hash)))
+       {
+         key->ignore = gkey.ignore;
+         key->translate = gkey.translate;
+         key->skipsblanks = gkey.skipsblanks;
+         key->skipeblanks = gkey.skipeblanks;
+         key->month = gkey.month;
+         key->numeric = gkey.numeric;
+         key->general_numeric = gkey.general_numeric;
+         key->hash = gkey.hash;
+         key->reverse = gkey.reverse;
+       }
+    }
+
+  if (need_rand)
+    {
+      if (seed)
+       {
+         isaac_seed_start (&rand_state);
+         isaac_seed_data (&rand_state, seed, strlen (seed));
+       }
+      else
+       isaac_seed (&rand_state);
+    }
 
   if (!keylist && (gkey.ignore || gkey.translate
                   || (gkey.skipsblanks | gkey.skipeblanks | gkey.month
-                      | gkey.numeric | gkey.general_numeric)))
+                      | gkey.numeric | gkey.general_numeric
+                      | gkey.hash)))
     insertkey (&gkey);
   reverse = gkey.reverse;
 
--- /dev/null   2005-09-24 22:00:15.000000000 -0700
+++ src/rand-isaac.c    2005-12-02 22:49:16.000000000 -0800
@@ -0,0 +1,395 @@
+/* Bob Jenkins's cryptographic random number generator, ISAAC.
+
+   Copyright (C) 1999-2005 Free Software Foundation, Inc.
+   Copyright (C) 1997, 1998, 1999 Colin Plumb.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 2, or (at your option)
+   any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program; if not, write to the Free Software Foundation,
+   Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+
+   Written by Colin Plumb.  */
+
+/*
+ * --------------------------------------------------------------------
+ * We need a source of random numbers for some data.
+ * Cryptographically secure is desirable, but it's not life-or-death
+ * so I can be a little bit experimental in the choice of RNGs here.
+ *
+ * This generator is based somewhat on RC4, but has analysis
+ * <http://burtleburtle.net/bob/rand/isaacafa.html>
+ * pointing to it actually being better.  I like it because it's nice
+ * and fast, and because the author did good work analyzing it.
+ * --------------------------------------------------------------------
+ */
+
+#include "gethrxtime.h"
+
+/* Size of the state tables to use.  ISAAC_LOG should be at least 3,
+   and smaller values give less security.  */
+#define ISAAC_LOG 8
+#define ISAAC_WORDS (1 << ISAAC_LOG)
+#define ISAAC_BYTES (ISAAC_WORDS * sizeof (uint32_t))
+
+/* RNG state variables */
+struct isaac_state
+  {
+    uint32_t mm[ISAAC_WORDS];  /* Main state array */
+    uint32_t iv[8];            /* Seeding initial vector */
+    uint32_t a, b, c;          /* Extra index variables */
+  };
+
+/* This index operation is more efficient on many processors */
+#define ind(mm, x) \
+  (* (uint32_t *) ((char *) (mm) \
+                  + ((x) & (ISAAC_WORDS - 1) * sizeof (uint32_t))))
+
+/*
+ * The central step.  This uses two temporaries, x and y.  mm is the
+ * whole state array, while m is a pointer to the current word.  off is
+ * the offset from m to the word ISAAC_WORDS/2 words away in the mm array,
+ * i.e. +/- ISAAC_WORDS/2.
+ */
+#define isaac_step(mix, a, b, mm, m, off, r) \
+( \
+  a = ((a) ^ (mix)) + (m)[off], \
+  x = *(m), \
+  *(m) = y = ind (mm, x) + (a) + (b), \
+  *(r) = b = ind (mm, (y) >> ISAAC_LOG) + x \
+)
+
+/*
+ * Refill the entire R array, and update S.
+ */
+static void
+isaac_refill (struct isaac_state *s, uint32_t r[/* ISAAC_WORDS */])
+{
+  uint32_t a, b;               /* Caches of a and b */
+  uint32_t x, y;               /* Temps needed by isaac_step macro */
+  uint32_t *m = s->mm; /* Pointer into state array */
+
+  a = s->a;
+  b = s->b + (++s->c);
+
+  do
+    {
+      isaac_step (a << 13, a, b, s->mm, m, ISAAC_WORDS / 2, r);
+      isaac_step (a >> 6, a, b, s->mm, m + 1, ISAAC_WORDS / 2, r + 1);
+      isaac_step (a << 2, a, b, s->mm, m + 2, ISAAC_WORDS / 2, r + 2);
+      isaac_step (a >> 16, a, b, s->mm, m + 3, ISAAC_WORDS / 2, r + 3);
+      r += 4;
+    }
+  while ((m += 4) < s->mm + ISAAC_WORDS / 2);
+  do
+    {
+      isaac_step (a << 13, a, b, s->mm, m, -ISAAC_WORDS / 2, r);
+      isaac_step (a >> 6, a, b, s->mm, m + 1, -ISAAC_WORDS / 2, r + 1);
+      isaac_step (a << 2, a, b, s->mm, m + 2, -ISAAC_WORDS / 2, r + 2);
+      isaac_step (a >> 16, a, b, s->mm, m + 3, -ISAAC_WORDS / 2, r + 3);
+      r += 4;
+    }
+  while ((m += 4) < s->mm + ISAAC_WORDS);
+  s->a = a;
+  s->b = b;
+}
+
+/*
+ * The basic seed-scrambling step for initialization, based on Bob
+ * Jenkins' 256-bit hash.
+ */
+#define mix(a,b,c,d,e,f,g,h) \
+   (       a ^= b << 11, d += a, \
+   b += c, b ^= c >>  2, e += b, \
+   c += d, c ^= d <<  8, f += c, \
+   d += e, d ^= e >> 16, g += d, \
+   e += f, e ^= f << 10, h += e, \
+   f += g, f ^= g >>  4, a += f, \
+   g += h, g ^= h <<  8, b += g, \
+   h += a, h ^= a >>  9, c += h, \
+   a += b                        )
+
+/* The basic ISAAC initialization pass.  */
+static void
+isaac_mix (struct isaac_state *s, uint32_t const seed[/* ISAAC_WORDS */])
+{
+  int i;
+  uint32_t a = s->iv[0];
+  uint32_t b = s->iv[1];
+  uint32_t c = s->iv[2];
+  uint32_t d = s->iv[3];
+  uint32_t e = s->iv[4];
+  uint32_t f = s->iv[5];
+  uint32_t g = s->iv[6];
+  uint32_t h = s->iv[7];
+
+  for (i = 0; i < ISAAC_WORDS; i += 8)
+    {
+      a += seed[i];
+      b += seed[i + 1];
+      c += seed[i + 2];
+      d += seed[i + 3];
+      e += seed[i + 4];
+      f += seed[i + 5];
+      g += seed[i + 6];
+      h += seed[i + 7];
+
+      mix (a, b, c, d, e, f, g, h);
+
+      s->mm[i] = a;
+      s->mm[i + 1] = b;
+      s->mm[i + 2] = c;
+      s->mm[i + 3] = d;
+      s->mm[i + 4] = e;
+      s->mm[i + 5] = f;
+      s->mm[i + 6] = g;
+      s->mm[i + 7] = h;
+    }
+
+  s->iv[0] = a;
+  s->iv[1] = b;
+  s->iv[2] = c;
+  s->iv[3] = d;
+  s->iv[4] = e;
+  s->iv[5] = f;
+  s->iv[6] = g;
+  s->iv[7] = h;
+}
+
+#if 0 /* Provided for reference only; not used in this code */
+/*
+ * Initialize the ISAAC RNG with the given seed material.
+ * Its size MUST be a multiple of ISAAC_BYTES, and may be
+ * stored in the s->mm array.
+ *
+ * This is a generalization of the original ISAAC initialization code
+ * to support larger seed sizes.  For seed sizes of 0 and ISAAC_BYTES,
+ * it is identical.
+ */
+static void
+isaac_init (struct isaac_state *s, uint32_t const *seed, size_t seedsize)
+{
+  static uint32_t const iv[8] =
+  {
+    0x1367df5a, 0x95d90059, 0xc3163e4b, 0x0f421ad8,
+    0xd92a4a78, 0xa51a3c49, 0xc4efea1b, 0x30609119};
+  int i;
+
+# if 0
+  /* The initialization of iv is a precomputed form of: */
+  for (i = 0; i < 7; i++)
+    iv[i] = 0x9e3779b9;                /* the golden ratio */
+  for (i = 0; i < 4; ++i)      /* scramble it */
+    mix (iv[0], iv[1], iv[2], iv[3], iv[4], iv[5], iv[6], iv[7]);
+# endif
+  s->a = s->b = s->c = 0;
+
+  for (i = 0; i < 8; i++)
+    s->iv[i] = iv[i];
+
+  if (seedsize)
+    {
+      /* First pass (as in reference ISAAC code) */
+      isaac_mix (s, seed);
+      /* Second and subsequent passes (extension to ISAAC) */
+      while (seedsize -= ISAAC_BYTES)
+       {
+         seed += ISAAC_WORDS;
+         for (i = 0; i < ISAAC_WORDS; i++)
+           s->mm[i] += seed[i];
+         isaac_mix (s, s->mm);
+       }
+    }
+  else
+    {
+      /* The no seed case (as in reference ISAAC code) */
+      for (i = 0; i < ISAAC_WORDS; i++)
+       s->mm[i] = 0;
+    }
+
+  /* Final pass */
+  isaac_mix (s, s->mm);
+}
+#endif
+
+/* Start seeding an ISAAC structire */
+static void
+isaac_seed_start (struct isaac_state *s)
+{
+  static uint32_t const iv[8] =
+    {
+      0x1367df5a, 0x95d90059, 0xc3163e4b, 0x0f421ad8,
+      0xd92a4a78, 0xa51a3c49, 0xc4efea1b, 0x30609119
+    };
+  int i;
+
+#if 0
+  /* The initialization of iv is a precomputed form of: */
+  for (i = 0; i < 7; i++)
+    iv[i] = 0x9e3779b9;                /* the golden ratio */
+  for (i = 0; i < 4; ++i)      /* scramble it */
+    mix (iv[0], iv[1], iv[2], iv[3], iv[4], iv[5], iv[6], iv[7]);
+#endif
+  for (i = 0; i < 8; i++)
+    s->iv[i] = iv[i];
+
+  memset (s->mm, 0, sizeof s->mm);
+
+  /* s->c gets used for a data pointer during the seeding phase */
+  s->a = s->b = s->c = 0;
+}
+
+/* Add a buffer of seed material */
+static void
+isaac_seed_data (struct isaac_state *s, void const *buffer, size_t size)
+{
+  unsigned char const *buf = buffer;
+  unsigned char *p;
+  size_t avail;
+  size_t i;
+
+  avail = sizeof s->mm - (size_t) s->c;        /* s->c is used as a write 
pointer */
+
+  /* Do any full buffers that are necessary */
+  while (size > avail)
+    {
+      p = (unsigned char *) s->mm + s->c;
+      for (i = 0; i < avail; i++)
+       p[i] ^= buf[i];
+      buf += avail;
+      size -= avail;
+      isaac_mix (s, s->mm);
+      s->c = 0;
+      avail = sizeof s->mm;
+    }
+
+  /* And the final partial block */
+  p = (unsigned char *) s->mm + s->c;
+  for (i = 0; i < size; i++)
+    p[i] ^= buf[i];
+  s->c = size;
+}
+
+
+/* End of seeding phase; get everything ready to produce output. */
+static void
+isaac_seed_finish (struct isaac_state *s)
+{
+  isaac_mix (s, s->mm);
+  isaac_mix (s, s->mm);
+  /* Now reinitialize c to start things off right */
+  s->c = 0;
+}
+#define ISAAC_SEED(s,x) isaac_seed_data (s, &(x), sizeof (x))
+
+/*
+ * Get seed material.  16 bytes (128 bits) is plenty, but if we have
+ * /dev/urandom, we get 32 bytes = 256 bits for complete overkill.
+ */
+static void
+isaac_seed (struct isaac_state *s)
+{
+  isaac_seed_start (s);
+
+  { pid_t t = getpid ();   ISAAC_SEED (s, t); }
+  { pid_t t = getppid ();  ISAAC_SEED (s, t); }
+  { uid_t t = getuid ();   ISAAC_SEED (s, t); }
+  { gid_t t = getgid ();   ISAAC_SEED (s, t); }
+
+  {
+    xtime_t t = gethrxtime ();
+    ISAAC_SEED (s, t);
+  }
+
+  {
+    char buf[32];
+    int fd = open ("/dev/urandom", O_RDONLY | O_NOCTTY);
+    if (fd >= 0)
+      {
+       read (fd, buf, 32);
+       close (fd);
+       isaac_seed_data (s, buf, 32);
+      }
+    else
+      {
+       fd = open ("/dev/random", O_RDONLY | O_NONBLOCK | O_NOCTTY);
+       if (fd >= 0)
+         {
+           /* /dev/random is more precious, so use less */
+           read (fd, buf, 16);
+           close (fd);
+           isaac_seed_data (s, buf, 16);
+         }
+      }
+  }
+
+  isaac_seed_finish (s);
+}
+
+/* Single-word RNG built on top of ISAAC */
+struct irand_state
+{
+  uint32_t r[ISAAC_WORDS];
+  unsigned int numleft;
+  struct isaac_state *s;
+};
+
+static void
+irand_init (struct irand_state *r, struct isaac_state *s)
+{
+  r->numleft = 0;
+  r->s = s;
+}
+
+/*
+ * We take from the end of the block deliberately, so if we need
+ * only a small number of values, we choose the final ones which are
+ * marginally better mixed than the initial ones.
+ */
+static uint32_t
+irand32 (struct irand_state *r)
+{
+  if (!r->numleft)
+    {
+      isaac_refill (r->s, r->r);
+      r->numleft = ISAAC_WORDS;
+    }
+  return r->r[--r->numleft];
+}
+
+/*
+ * Return a uniformly distributed random number between 0 and n,
+ * inclusive.  Thus, the result is modulo n+1.
+ *
+ * Theory of operation: as x steps through every possible 32-bit number,
+ * x % n takes each value at least 2^32 / n times (rounded down), but
+ * the values less than 2^32 % n are taken one additional time.  Thus,
+ * x % n is not perfectly uniform.  To fix this, the values of x less
+ * than 2^32 % n are disallowed, and if the RNG produces one, we ask
+ * for a new value.
+ */
+static uint32_t
+irand_mod (struct irand_state *r, uint32_t n)
+{
+  uint32_t x;
+  uint32_t lim;
+
+  if (!++n)
+    return irand32 (r);
+
+  lim = -n % n;                        /* == (2**32-n) % n == 2**32 % n */
+  do
+    {
+      x = irand32 (r);
+    }
+  while (x < lim);
+  return x % n;
+}




reply via email to

[Prev in Thread] Current Thread [Next in Thread]