[ Back to Kevin's Homepage | Back to Ramblings and Brain Farts ]

Why is cpio better than tar?

Why is cpio better than tar? A number of reasons.

  1. cpio preserves hard links, which is important if you're using it for backups.
  2. cpio doesn't have that annoying filename length limitation. Sure, gnutar has a "hack" that allows you to use longer filenames (it creates a temporary file in which it stores the real name), but it's inherently not portable to non-gnu tar's.
  3. By default, cpio preserves timestamps
  4. When scripting, it has much better control over which files are and are not copied, since you must explicitly list the files you want copied. For example, which of the following is easier to read and understand?
    find . -type f -name '*.sh' -print | cpio -o | gzip >sh.cpio.gz
    

    or on Solaris:
    find . -type f -name '*.sh' -print >/tmp/includeme
    tar -cf - . -I /tmp/includeme | gzip >sh.tar.gz
    

    or with gnutar:
    find . -type f -name '*.sh' -print >/tmp/includeme
    tar -cf - . --files-from=/tmp/includeme | gzip >sh.tar.gz
    

    A couple of specific notes here: for large lists of files, you can't put find in reverse quotes; the command-line length will be overrun; you must use an intermediate file. Separate find and tar commands are inherently slower, since the actions are done serially.

    Consider this more complex case where you want a tree completely packaged up, but some files in one tar, and the remaining files in another.

    find . -depth -print >/tmp/files
    egrep    '\.sh$' /tmp/files | cpio -o | gzip >with.cpio.gz
    egrep -v '\.sh$' /tmp/files | cpio -o | gzip >without.cpio.gz
    

    or under Solaris:
    find . -depth -print >/tmp/files
    egrep    '\.sh$' /tmp/files >/tmp/with
    tar -cf - . -I /tmp/with    | gzip >with.tar.gz
    tar -cf - .    /tmp/without | gzip >without.tar.gz
    ##          ^^-- no there's no missing argument here.  It's just empty that way
    

    or with gnutar:
    find . -depth -print >/tmp/files
    egrep    '\.sh$' /tmp/files >/tmp/with
    tar -cf - . -I /tmp/with    | gzip >with.tar.gz
    tar -cf - . -X /tmp/without | gzip >without.tar.gz
    

    Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!

  5. If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
    find . -depth -print >/tmp/files
    split /tmp/files
    for F in /tmp/files?? ; do
    	cat $F | cpio -o | ssh destination "cd /target && cpio -idum" &
    	done
    

    Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.

    find . -depth -print >/tmp/files
    npipe -4 /tmp/files 'cpio -o | ssh destination "cd /target && cpio -idum"'
    

    How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!

    A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's

In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.


created - 2001.09.22 kjw
last modified - 2001.09.22 kjw