[ Back to Kevin's Homepage | Back to Ramblings and Brain Farts ]
Why is cpio better than tar? A number of reasons.
find . -type f -name '*.sh' -print | cpio -o | gzip >sh.cpio.gz
find . -type f -name '*.sh' -print >/tmp/includeme tar -cf - . -I /tmp/includeme | gzip >sh.tar.gz
find . -type f -name '*.sh' -print >/tmp/includeme tar -cf - . --files-from=/tmp/includeme | gzip >sh.tar.gz
A couple of specific notes here: for large lists of files, you can't put find in reverse quotes; the command-line length will be overrun; you must use an intermediate file. Separate find and tar commands are inherently slower, since the actions are done serially.
Consider this more complex case where you want a tree completely packaged up, but some files in one tar, and the remaining files in another.
find . -depth -print >/tmp/files egrep '\.sh$' /tmp/files | cpio -o | gzip >with.cpio.gz egrep -v '\.sh$' /tmp/files | cpio -o | gzip >without.cpio.gz
find . -depth -print >/tmp/files egrep '\.sh$' /tmp/files >/tmp/with tar -cf - . -I /tmp/with | gzip >with.tar.gz tar -cf - . /tmp/without | gzip >without.tar.gz ## ^^-- no there's no missing argument here. It's just empty that way
find . -depth -print >/tmp/files egrep '\.sh$' /tmp/files >/tmp/with tar -cf - . -I /tmp/with | gzip >with.tar.gz tar -cf - . -X /tmp/without | gzip >without.tar.gz
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
find . -depth -print >/tmp/files split /tmp/files for F in /tmp/files?? ; do cat $F | cpio -o | ssh destination "cd /target && cpio -idum" & done
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
find . -depth -print >/tmp/files npipe -4 /tmp/files 'cpio -o | ssh destination "cd /target && cpio -idum"'
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.