ANN: Tar.jl

I’d like to announce Tar.jl, a pure Julia package for reading and writing POSIX TAR files for the purpose of transmitting trees of files and directories from one system to another. The README should hopefully explain the design and usage clearly. If you need to record and send file trees between systems, please give it a try!

49 Likes

Ooo this is something to celebrate!

Could you add examples how to use it in readme?

How could one do for example:

# transfer /home/julia to dest_machine
cd /home; tar -czpf - julia | ssh dest_machine  'cd  /home; tar -xzpvf -'

I also miss motivation part. Is it about doing standard things at least on primitive level (for example without restoring timestamps) in Windows?

If I understand correctly, permissions (-p) would not be supported.

Did you read the motivation part in the readme? I think it describes the use case extensively (intersystem file transfer).

You would do something like this:

using Tar
tarball = Tar.create("/home/julia")
# compress by your favorite means
# send to destination system somehow

On the destination system:

using Tar
# decompress tarball first
Tar.extract(tarball, "/home/julia")

The readme is mostly motivation, so I’m a bit unclear on what’s missing there.

It’s very nice indeed! Just curious - what’s the motivation for creating this package? Is GNU tar not nice enough? :smile:

1 Like

it’s stated in the README

1 Like

I showed one-liner which don’t need additional space on hardisks for tarball and could tar + pack and unpack + untar parallely. It also preserve timestamps and permissions. Which I think is standard clever way how to do file tree transfer and implement steps (you have “somehow” in comments) in elegant way.

I read the readme and didn’t find any good reason not to prefer proved old solution.

I was also curious if there is some clever simple (oneliner or twoliner) way how to use tarball::IO to write tarball through network and read it (and extract it) on destination machine. Something similarly elegant to posix solution I showed.

One possible reason I saw was for people who need to use inferior OSes without ssh and tar utilities.

Second possible reason could be performance gain. Did you do performance comparison? Is there some benefit?

Feel free to continue using command-line tar. It’s quite possible to use Tar.jl to send data directly over a network connection (open a socket/http/whatever and pass the I/O handle to create), but based on your stance here, is suspect this package isn’t for you.

4 Likes

If you mean this:

Unlike the tar command line tool, which was originally designed to archive data in order to restore it back to the same system or to a replica thereof, the Tar package is designed for using the TAR format to transfer trees of files and directories from one system to another.

it seems like missunderstanding of unix philosophy where (simplified ->) everything is file.

As I show that philosophy bring elegant solution how to interconnect processes and through pipes send data/tarball to destination machine where another processes would unpack, extract it without using additional space for tarball. I don’t think it is something which was not anticipated in time when tar utility was created! Pipe “files” between processes is very old (and very powerful) basic paradigm behind unix/posix ecosystem.

Is there a practical use case that you can share? Maybe you can describe how this package helps?

GNU tar isn’t available everywhere, eg Windows, macOS, or FreeBSD. On Windows it’s common for no tar command to be present at all. BinaryBuilder + artifacts makes it possible to install a consistent tar binary on any system, but there’s a bootstrapping problem: artifacts are transmitted as tarballs, what do you use to unpack the artifact tarball containing the tar binary? This bootstrapping problem will be exacerbated by the Pkg protocol, which will be opt-in in Julia 1.4 (by doing export JULIA_PKG_SERVER=pkg.julialang.org in your shell) and opt-out in Julia 1.5: with this protocol, all resources are transferred as tarballs, including registries, packages and artifacts, so having a reliable, portable way to create and extract tarballs becomes rather important.

Even when you have a reasonable tar program, it’s remarkably hard to coax it to behave in a way that is “right” for the data transfer use case. For example, this is the line to unpack data correctly in Pkg’s PlatformEngines:

All of the surrounding code is just to try to figure out how to invoke some kind of tar-equivalent program in a cross-platform way. And on any new or slightly different system, it’s likely to break. We’re very eager to get rid of all of that code and replace it with portable, pure-Julia implementations that we know will work anywhere that Julia works. There are three pieces we need to get there:

  1. Ability to download over HTTP/S — provided by HTTP.jl.
  2. Ability to decompress files — potentially provided for .gz files by Inflate.jl, although we may want to use bzip2 instead since it has better compression for the kinds of files we’re sending.
  3. Ability to unpack tarballs — provided by Tar.jl

So Tar.jl is part of the plan to replace the nasty and brittle PlatformEngines code with pure Julia code for downloading and unpacking tarballs.

There’s one last, fairly non-obvious reason for Tar.jl to exist: to be able to compute and apply diffs to file trees. Part of the Pkg protocol plan linked above is to be able to send diffs from the previous state of a registry, package or artifact instead of having to send the whole thing. The bsdiff program is pretty good for computing and sending very small binary diffs. However, bsdiff only computes a diff for a single file. How do you compute diffs of trees of files? One simple approach is to generate uncompressed tarballs of the before and after trees and then compute and send a diff of those two tarballs. However, the client then needs to be able to reproduce the exact same before tarball from an on-disk file tree in order to apply the diff. If the server and client have tar programs that behave differently, these tarballs might be quite different. Even different versions of GNU tar might produce very different tarballs. Enter Tar.jl: it guarantees that if two file trees with the same content are turned into tarballs, those tarballs will be bit-for-bit identical. So if the server and the client are both using Tar.jl to generate tarballs, we can compute and apply tarball diffs of trees, which will allow us to efficiently update registries, packages and artifacts.

As a side benefit, this means that one more step of the Pkg resource chain is perfectly reproducible: given a particular resource version (registry, package, artifact), identified by its content hash, it will always produce the same uncompressed tarball. If we standardize the compression as well, then we will have perfect reproducibility of the final tarball. If, in addition, we start doing reproducible builds in BinaryBuilder, then the whole build chain will be reproducible from end to end: a build will always produce the same final tarball.

33 Likes

Again, if you are not convinced to use this, feel free not to.

If you did want to send a tarball over any kind of IO object, here’s how you would do it:

io = # connect to other end any way you want
Tar.create(source, io)

This streams the tarball to io directly without writing it to disk. On the receiving side:

io = # the other side of that connection
Tar.extract(io, destination)

This extracts the tarball from io without writing it to disk. The io object could, for example, be a TranscodingStream wrapping an HTTPS connection.

2 Likes

Regarding the invocation of tar in PlatformEngines, in just the past month, we have realized that we needed to add the -m flag because without it, systems with clock skew were complaining that tarballs were being extracted with time stamps in the future. Then we realized that we needed the --no-same-owner flag because when Julia was running as root, tar was trying to change the owners/groups of the contents to whatever it was on the origin system, which was generally total nonsense on the new system, and causing all kinds of problems. Are we done adding flags to tar to get it to behave the way we want? Hopefully :grimacing: Do these flags work with non-GNU tar commands? Maybe. The defaults of command-line tar programs are just totally wrong for just sending a tree of files to another unrelated system.

And PlatformEngines is the worst—just read that file I linked, it’s such a nasty rat’s nest. I’m looking forward to getting rid of it so much.

5 Likes

It’s perhaps worth explicitly noting that Tar.jl is using a proven standard for tar files… and files created by Tar.jl can be opened by other POSIX compliant tar implementations. It’s just a minimal implementation (in less than 1k loc) that solves some specific pain points and provides some nice advantages as Stefan describes above.

6 Likes

The API is also quite a bit nicer for programmatic use than calling tar from Julia.

6 Likes

All I know is I used this on a windows VM and it worked perfectly for my needs(made a zip). Was far easier to grok then ZipFiles (no offense ZipFiles!) and no blemishes. More packages of this quality/utility please!

2 Likes

Having read that file is the very reason I came across this thread. I wanted to know whether the pieces needed to bootstrap the unpacking of variously compressed Artifacts with a pure-Julia implementation were available. It seems to me that once the BinaryWrappers can be acquired and unpacked, it will then be possible to use those wrapped binaries to unpack other artifacts compressed with different compression types.