Re: [Duplicity-talk] Caching for pwd and grp operations

duplicity-talk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Duplicity-talk] Caching for pwd and grp operations

From:	Steve Atwell
Subject:	Re: [Duplicity-talk] Caching for pwd and grp operations
Date:	Thu, 8 Nov 2012 11:32:03 -0800

On Thu, Nov 8, 2012 at 2:01 AM, Lluís Batlle i Rossell <address@hidden> wrote:
> Name resolution is cached by the glibc (even between multiple processes, with
> nscd).
>
> Is it that duplicity does not use the usual name resolution mechanisms?

Duplicity uses python's pwd and grp modules, which are written in the
C API and wrap the normal libc getpw* and getgr* functions.  But there
are a couple problems when you start working with very large group and
password maps.

1)  As the map file gets very large, searching it to get the entry you
want starts to take a little bit longer.  So it's not uncommon for
programs to cache this information.  E.g., GNU tar keeps a cache to
speed up all the lookups it needs to do.  (It's true that you can use
nscd to mitigate some of this, but nscd has some reliability issues.)

2)  The bigger problem is that python grp module parses the member
list string into a python list every time you get a group entry.  If
you have a very large group (tens of thousands of members), doing this
string parsing gets really expensive when you're doing it over and
over.  This isn't something nscd can help with, because it just caches
the unparsed string.

Here's an example when you add a group with 40k members to /etc/group:

>>> f = open('/etc/group', 'a')
>>> s = ','.join(['user%d' % i for i in xrange(40000)])
>>> f.write('big:x:10000:%s\n' % s)
>>> f.write('small:x:10001:user1,user2\n')
>>> f.close()

Now let's see how long group lookups take:

>>> import timeit
>>> timeit.timeit('import grp; grp.getgrnam("big")', number=10000)
33.63645005226135
>>> timeit.timeit('import grp; grp.getgrnam("small")', number=10000)
7.10194993019104

That's a pretty big difference.  But the difference gets even bigger
if you turn on nscd.  (The "small" group lookup is expensive because
every time we scan /etc/group, we have to read past the 380k big group
line.)

>>> timeit.timeit('import grp; grp.getgrnam("big")', number=10000)
28.031965017318726
>>> timeit.timeit('import grp; grp.getgrnam("small")', number=10000)
0.017841100692749023

So once you have large groups, you really can't afford to parse them
every time.  Duplicity doesn't need the member list, just the mapping
between name and gid, but you can't turn off member parsing in the grp
module.  :-(  That means you need a cache for group lookups.  And once
you have a cache for group lookups, you might as well have a cache for
passwd lookups too.

-- 
Steve Atwell <address@hidden>

[Prev in Thread]

Current Thread

[Next in Thread]

[Duplicity-talk] Caching for pwd and grp operations, Steve Atwell, 2012/11/07
- Re: [Duplicity-talk] Caching for pwd and grp operations, Michael Terry, 2012/11/07
  - Re: [Duplicity-talk] Caching for pwd and grp operations, edgar . soldin, 2012/11/08
    - Re: [Duplicity-talk] Caching for pwd and grp operations, Lluís Batlle i Rossell, 2012/11/08
    - Re: [Duplicity-talk] Caching for pwd and grp operations, Steve Atwell <=
    - Re: [Duplicity-talk] Caching for pwd and grp operations, Steve Atwell, 2012/11/08

Prev by Date: Re: [Duplicity-talk] Support for Amazon Glacier
Next by Date: Re: [Duplicity-talk] Caching for pwd and grp operations
Previous by thread: Re: [Duplicity-talk] Caching for pwd and grp operations
Next by thread: Re: [Duplicity-talk] Caching for pwd and grp operations
Index(es):
- Date
- Thread