bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9500: [PATCH]: use posix_fallocate where supported


From: Pádraig Brady
Subject: bug#9500: [PATCH]: use posix_fallocate where supported
Date: Wed, 23 Nov 2011 00:49:11 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0

On 09/14/2011 03:46 PM, Pádraig Brady wrote:
> On 09/14/2011 03:06 PM, Eric Blake wrote:
>> On 09/13/2011 11:55 PM, Kelly Anderson wrote:
>>> Hi,
>>>
>>> I put together a patch 2 or 3 years ago (back when posix_fallocate was
>>> first introduced in glibc).
>>
>> Thanks for the effort.  However, this has been discussed in the past, and 
>> the consensus was that we should first write a patch to gnulib that provides 
>> a posix_fallocate() stub for all platforms, so that coreutils can 
>> unconditionally call posix_fallocate, rather than making coreutils have to 
>> use #ifdef.  Among other things, a gnulib module would make it possible to 
>> emulate posix_fallocate() even on older glibc where it is missing or broken.
>>
> 
> Also we probably want fallocate() for this use case
> rather than posix_fallocate() in any case,
> as we don't want to fall back to writing zeros.
> 
> Also I had a whole lot of fallocate() things to try
> once the fiemap() stuff landed, but unfortunately
> that doesn't work reliably on all file systems
> and is currently restricted to sparse files.
> So I need to dig out my notes on how to apply
> fallocate() to files with holes and "empty portions" again.

I thought a little about this today.

fallocate() is a feature to quickly allocate space in a file system.
It's useful for 3 things as far as I can see:

  1. Improved file layout for subsequent access
  2. Immediate indication of ENOSPC
  3. Efficient writing of NUL portions

Note 1. is somewhat moot with newer file systems that do "delayed allocation".
So what do we need to consider when using fallocate on the destination file?
Considering just cp for the moment, its inputs impacting this are the options:

  --sparse={auto,always,never}
  Note with no --sparse specified we behave with --sparse=auto,
  where we try to detect holes based on st_size vs st_blocks

The other significant input is the construction of the source file.
Now data in a file can generally be classed into 4 types:

  Data:  normal data
  Zero:  normal data containing only NULs
  Hole:  unallocated data containing only NULs
  Empty: allocated data containing only NULs

  One can have any of the above types at any point in the file.
  Also 'Empty' is special in that it can extend beyond the apparent size.
  In fact this tail allocation is common on XFS for performance reasons.

An important factor is how well we can distinguish the above data classes.
There are currently three possible identification options:

  Heuristics
    This is used by default to see if holes might be present.
    The test is simply st_size >= the appropriate number of allocated st_blocks.
    Note, this can fail for example in the case where there is
    a tail allocation not accounted for in the size like:

      +-----------+---+
      | D | E | H | E |
      +-----------+---+

    Traditionally when a sparse source is detected we check input blocks
    for all zeros and create a 'Hole' in the destination instead.
    This is inefficient as it requires reading all the NUL data
    and verifying that it is in fact NUL.

  SEEK_HOLE
    Available on linux since 3.1

    'Empty' is treated like a 'Hole' which at least
    allows 'Empty' portions to be processed quickly by `cp`.

    We lose the ability to copy the allocation from src to dst.

  fiemap
    Available on linux since around 2.6.39

    Gives greater control by distinguishing Hole and Empty,
    thus allowing us to both efficiently copy and maintain allocation.

    Requires sync on ext4, xfs

    Code already done and used (with sync) for sparse files

    Note by not being able to use fiemap with non sparse files,
    means that we need to read() the empty extents which is
    inefficient, especially in --sparse=always mode.


So given the above info, what functionality might the use
of fallocate() make available to cp?

Exact copy from source to dest:

  Copying the source layout would mean that one could for example,
  create a backup copy of a large db file, which could be then used
  without worrying about fragmentation or ENOSPC issues.

  There is the argument that this might be better as a higher level
  file operation anyway, and perhaps `cp --reflink` might cover
  this use case on some file systems at least.

  fiemap gives us most control, allowing us to copy even tail
  allocations from source to destination. But the sync issue
  makes it not usable in general at present, and is currently
  restricted to sparse files where it's used to avoid reading
  'Empty' and 'Hole' portions.

Copying sparse files

 It's worth noting again, the caveat mentioned above that we
 might not recognise some sparse files due to tail allocation.

 Given that we use fiemap (with sync) for sparse files at present,
 we can augment the fiemap copying code to use fallocate where appropriate.
  So dependent on the options the operations would be:
    --sparse=auto   => 'Empty' -> 'Empty'
    --sparse=always => 'Empty' -> 'Hole'  && discard tail allocation
    --sparse=never  => 'Hole'  -> 'Empty'
 Perhaps the first case could be simplified to initially doing:
    fallocate(dest, blocks*blocksize))

Copying normal files

 Note using SEEK_HOLE for this case, would only help
 to avoid reading 'Hole' and more likely 'Empty' portions,
 and should not impact on the use of fallocate(dest).

 So assuming we initially did:

   if ! --sparse=always
     fallocate(dest, st_size)

 That would throw away any tail allocation in the source,
 which is probably OK as noted above. In fact we might always
 discard tail allocation for consistency, unless we can use fiemap
 for all cases.

I'll cook something up on this soon.

cheers,
Pádraig.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]