bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18681: cp Specific fail example


From: Linda Walsh
Subject: bug#18681: cp Specific fail example
Date: Sun, 19 Oct 2014 23:20:00 -0700
User-agent: Thunderbird



Bob Proulx wrote:
Linda Walsh wrote:
Bob Proulx wrote:
Also consider that if cp were to acquire all of the enhancements
that have been requested for cp as time has gone by then cp would
be just as featureful (bloated!) as rsync and likely just as slow
as rsync too.
        Nope...rsync is slow because it does everything over a client
server model --- even when it is local.  So everything is written through
a pipe .. that's why it can't come close to cp -- and why cp would never
be so slow -- I can't imagine it using a pipe to copy a file anywhere!

The client-server structure of rsync is required for copying between
systems.  Saying that cp doesn't have it isn't fair if cp were to add
every requested feature.
---
        cp was designed for local->local copy.
        rsync was designed for local->remote synchronization (thus
'r(emote) sync'.  Saying it isn't fair to compare code quality between
a java->'native code compiler' and a compiler developed for a native platform
is entirely fair -- because both started out with different design goals -- thus
each ends up with pluses and minus that are an effect of that goal.  If you
claim comparing such effects isn't fair, then it's not fair to compare
any different algorithm with another because algorithms inherently have
their pluses and minuses and are often chosen for use in a particular
situation because of those pluses and minuses.

        So lets compare using 'cp' with rsync in copying a remote file.
The choice of tools depends on the quality of the remote connection, but
in most remote connections, "today", reliability isn't usually an issue as
they flow over TCP and file transfer protocols like NFS or CIFS also have
checks to allow users to reconnect after an interruption (like a machine 
reboot).
Depending on timeout settings, 'cp' already has a restart over-remove
ability when used with NFS or CIFS. CIFS doesn't tolerate a system reboot
in the middle of a copy, whereas NFS can recover from such if the client
uses hard mounts.  But for a local network, I regularly use 'cp' with
CIFS and it does a faster job than rsync -- over a reliable local network.


I am sure that if I search the archives I
would find a request to add client-server structure to cp to support
copying from system to system. :-)
----
        We are comparing where the tools are at  _not_ where they _could_
have been had previous algorithm choices been ignored.  We are talking
about a local->local copy (in the base note), so glossing over the slowness
of rsync in doing such is entirely fair.  If you want some level of
recovery after interrupt, NFS is a better choice for a local network --
client connections can continue even after a server reboot.  But if we
are talking local->local reliability, the simple, close solution would be
SMB/CIFS.

Using a 1GB file as an example (and throwing in a 'dd' for
for comparison):

time rsync 1G ishtar:/home/law/1G
20.13sec 1.29usr 2.68sys (19.73% cpu)
time cp 1G /h/.
6.94sec 0.01usr 1.10sys (16.16% cpu)
time dd if=1G of=/h/1G bs=256M oflag=direct
4+0 records in
4+0 records out
1073741824 bytes (1.1 GB) copied, 3.4694 s, 309 MB/s
3.50sec 0.00usr 0.51sys (14.64% cpu)

Here again, we see rsync doing the same job of cp
taking about 3x the time.

For a single file over a local net 'dd' is a better bet.


Now I will proactively agree that it would be nice if rsync detected
that it was all running locally and didn't fork and instead ran
everything in one process like cp does.  But I could see that coming
to rsync at some time in the future.  It is an often requested
feature.
---
        For many years.


This is something to consider every time someone asks for a
creeping feature to cp.  Especially if they say they want the feature
in cp because it is faster than rsync.  The natural progression is
that cp would become rsync.
        Not even!  Note.  cp already has a comparison function
built in that it uses during "cp -u"...

I am not convinced of the robustness of 'cp -u ...' interrupt, repeat,
interrupt repeat.  It wasn't intended for that mode.
---
        Neither is rsync in its default mode.  It compares
timestamps and size, nothing more.  I'd be suspicious of either
rsync OR cp's chances in such a situation.  But USUALLY, people
don't interrupt a copy many times -- or even once, so cp is usually
faster...


Is there any code path that could leave a new file in the target area
that would avoid copy?  Not sure.  Newer meets the -u test but isn't
an exact copy if the time stamp were older in the original.  But with
rsync I know it will correct for this during a subsequent run.
---
        Not necessarily.  It doesn't do checksumming by default.  Certainly,
if you used rsync with '-u', rsync will not be much better in recovery,
since target files with more recent timestamps may be left in the
target dir.  I don't think rsync or cp trap a control-c-abort to cleanup
target files.



built in that it uses during "cp -u"... but it doesn't go through
pipes.  It used to use larger buffer sizes or maybe tell posix
to pre-alloc the destination space, dunno, but it used to be
faster.. I can't say for certain, but it seems to be using

Often the data sizes we work with grow larger over time making the
same task feel slower because we are actually dealing with more data
now.
---
        I was comparing copy times with same files,
 not from years ago to now.


 Another reason rsync is so slow -- uses
a relatively small i/o size 1-4k last I looked. I've asked them
to increase it, but going through a pipe it won't help alot.

Nod.  Rsync was designed for the network use case.  It could benefit
with some tuning for the local case.  A topic for the rsync list.
---
Been there, done that.  Still comparing current-to-current, not
hypotheticals.



Also in rsync, they've added the posix calls to reserve
space in the target location for a file being copied in.
Specifically, this is to lower disk fragmentation (does
cp do anything like that, been a while since I looked).

I don't know.  It would be worth a look.

The advantage of rsync is that it can be interrupted and restarted and
the restarted process will efficiently avoid doing work that is
already done.  An interrupted and restarted cp will perform the same
work again from start to finish.
        I wouldn't trust that it would.  If you interrupt it at exactly
the wrong time, I'd be afraid some file might get set with the right
data but the wrong Meta info (acls, primarily).

The design of rsync is to copy the file to a temporary name beside the
intended target.  After the copy the timestamps are set.  After that
the timestamps are set the file is renamed into place.  An interrupt
that happens before that rename time will cause the temporary file to
be removed.  An interrupt that happens after the rename is, well,
after that and the copy is already done.  Since rename on the local
file system is atomic this is guaranteed to function robustly.  (As
long as you aren't using a buggy file system that changes the order of
operations.  That isn't cool.  But of course it was famously seen in
ext4 for a while.  Fortunately sanity has prevailed and ext4 doesn't
do that for this operation anymore.  Okay to use now.)

If I am doing a simple copy from A to B then I use 'cp -av A B'.  If I
am doing it the second time then I will use rsync to avoid repeating
previously done work 'rsync -av A B'.
        Wouldn't cp -auv A B do the same?

Do I have to go look at the source code to verify that it doesn't? :-(
---
        My timing says cp is 20x faster for that 1G file case.  It also
shows that rsync doesn't use a tmp file in the update case
 time cp -au 1G /h
0.03sec 0.00usr 0.03sys (79.47% cpu)
cp -au 1G /h time rsync -au 1G ishtar:/home/law/1G
0.60sec 0.06usr 0.09sys (25.12% cpu)


I assume it doesn't without looking.  I assume cp copies in place.  I
assume that cp does not make a temporary file off to the side and
rename it into place once it is done and has set the timestamps.
---
        I assume rsync doesn't either -- if it is comparing against
a file already in place, for it to transfer the whole file... nope.  I
assume that cp copies to the named destination directly and updates
the timestamps afterward.  That creates a window of time when the file
is in place but has not had the timestamp placed on it yet.

Which means that if the cp is interrupted on a large file that it will
have started the copy but will not have finished it at the moment that
it is interrupted.  The new file will be in place with a new
timestamp.  The second run with cp -u will avoid overwriting the file
because the timestamp is newer.  However the contents of the file will
be incomplete, or at least not matching the source copy at the time of
the second copy.

If my assumptions in the above are wrong please correct me.  I will
learn something.  But the operating model would need to be the same
portably across all portable systems covered by posix before I would
consider it actually safe to use.
---
        Same happens in rsync -- no tmp file is involved.  It compares
time stamps and doesn't copy.



If I want progress indication...  If I want placement of backup files
in a particular directory...  If I want other fancy features that are
provided by rsync then it is worth it to use rsync.
...trimmed simple benchmark...
 $ time cp -a coreutils junk/
By default cp -a transfers acls and ext-attrs and preserves
hard links.   Rsync doesn't do any of that by default.
You need to  use "-aHAX" to compare them ...

Good catch.  :-)

you have to call them
out as 'extra' with rsync, so the above test may not be what it seems.
Though if you don't use ACL's (which I do), then maybe the above
is almost reasonable.  Still.. should use -aHAX

I didn't have any hard links, ACLs, or extended attributes in the test
case it shouldn't matter for the above.

Is your rsync newer? i.e. does it have the posix-pre-alloc
hints?... Mine has a pre-alloc patch, but I think that was
suse-added and not the one in the mainline code.  Not sure.

rsync --version
rsync  version 3.1.0  protocol version 31
    64-bit files, 64-bit inums, 64-bit timestamps, 64-bit long ints,
    socketpairs, hardlinks, symlinks, IPv6, batchfiles, inplace,
    append, ACLs, xattrs, iconv, symtimes, prealloc, SLP

I happened to run that test on Debian Sid and it is 3.1.1.  However
Debian Stable, which I have most widely deployed, has 3.0.9.  So you
are both ahead of and behind me at the same time. :-)

        Throw a few TB copies at rsync -- where all the data
won't fit in memory.... it also, I'm told, has problems with
hardlinks, acls and xattrs slowing it down, so it may be a
matter of usage...

I have had problems running rsync with -H for large data sets.  Bad
enough that I recommend against it.  Don't do it!  I don't know
anything about -A and -X.  But rsync -a is fine for very large data
sets.
----
        But then you can't compare to 'cp' which does handle
that case.


(don't ya just love performance talk?)

Except that we should have moved all of this to the discussion list.
---
:-( ?discussion list? -- bugs-coreutils? (don't know about others)...

'sides, I didn't bring up rsync, all I added was
"If rsync wasn't so slow at local I/O...*sigh*.... "


Its good for when you need "diffs", but not as a general replacement
for 'cp'.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]