Move ASDF.jl to JuliaIO?

Would it make sense to move the ASDF package to JuliaIO? If so, how would I proceed?

-erik

I can’t comment on whether you should move the package, but if you do decide to do so, I put together a checklist of steps when transferring a package to a github organization: https://github.com/JuliaRobotics/administration/wiki/PackageTransferChecklist (you would just need to replace “JuliaRobotics” everywhere with the appropriate organization).

2 Likes

Sure, that’d be great :slight_smile: I see you’re already a member, but you’ll likely need admin rights?

I seem to have admin rights already, but I’d like to know the opinion of others first.

-erik

Seems like a good idea to me—having key packages in orgs provides a nice safety net since other people can do things like fix bugs and tag new releases without the original author being obliged to do anything. And it doesn’t prevent you from continuing maintenance as you care to.

1 Like

ASDF has been initially thought as a format for astronomy data, so I’d suggest to move to JuliaAstro, where we already have FITSIO.jl, the package to read the format that ASDF is meant to replace. Maybe you think that the format can be used also outside astronomy?

Also, are you thinking about having a native Julia implementation in the long term? That could be a good GSoC project for the future, JuliaAstro applies every year under the umbrella of the OpenAstronomy organization

4 Likes

I am using the ASDF format outside astronomy. I see it as a potential replacement for HDF5 (ASDF is a simpler format with fewer features). Nothing in the ASDF format references astronomy, except that there is a standard way to translate FITS to ASDF. I thus prefer JuliaIO over JuliaAstronomy.

I have a C++ implementation of ASDF. It would be straightforward to translate this into native Julia if there was a suitable powerful YAML library available. Unfortunately there doesn’t seem to be. (ASDF metadata are stored as YAML.)

-erik

2 Likes

Well, the paper that introduced the format is titled ASDF: A new data format for astronomy and has been published in the journal Astronomy and Computing. The ASDF standard does mention astronomy several times.

Of course there is no restriction of any kind and the format can be used also outside astronomy, just like FITS format (for example, FITS has been considered for preserving images by the Vatican Library)

When moving something to a group, the main consideration in my view is: maintainers and discoverability.
That the format is for astro doesn’t really count as much - almost every IO package in JuliaIO is actually domain specific, but they still live in JuliaIO and not JuliaTheDomain…
If there is a lively community of maintainers that have lots of interest in keeping the format going, I’d say that’d be a good argument for moving it to JuliaAstro.

I like that most IO libraries by now are consistently part of JuliaIO and considering that @schnetter himself doesn’t seem to focus on using ASDF only for astronomy - JuliaIO seems like a good place to me;)

1 Like

Like BioJulia/YAML.jl? :grimacing:

Wherever the package will go I’ll be happy to contribute, as much as I can :slightly_smiling_face: The ASDF format, together with the FITS format, will be used by the James Webb Space Telescope and it would be great to have a good support for it in Julia.

1 Like

Oooh. I and there I thought that FITS was so entrenched that it would never go away. I guess an ASDF ↔ FITS converter in Julia would be convenient…

There are quite a few “advanced” ASDF features that ASDF.jl does not yet support. I’d be happy to discuss, in particular if your usage model is different from mine. Mine is: (1) data are generated and written in one go, (2) files do not change afterwards (no incremental modifications), (3) analysis often looks only at small fractions of a file.

Regarding BioJulia/YAML.jl: “(Dumping Julia objects to YAML has not yet been implemented.)” So no writer yet…

On the other hand I’ve heard that every JSON file is a legal YAML file; maybe that would provide a work-around? Also, writing is much easier than reading because you can choose the format, e.g. quoting every string.

Well, to be honest I don’t have a usage model as I’ve never used ASDF so far :joy: I heard for the first time about this format a couple of days ago (regarding its use by the JWST) and then saw your message with a perfect timing. I’m probably most interested in reading reading ASDF files, but of course it would be great to have a feature-rich package in the end.

What advantages (speed, features, memory…) does it have over other alternatives? How does it compare to hdf5, feather, tileDB, parquet or arrow?

1 Like

We are currently using HDF5, and explored ASDF as alternative. I am not familiar with the other four libraries/formats you mention. Our use case is handling large datasets (GByte to TByte) in numerical calculations.

Our starting point was that HDF5 is sometimes unreasonably slow, and it was difficult for us to find out why. It might well be that our read or write access patterns are inefficient, and that a change in control flow would make things faster, but we found that the HDF5 library is too much of a black box for us to understand its performance behaviour.

Having implemented ASDF myself, I understand its performance characteristics. It basically forces the writer to emit all metadata first, and then (sequentially) the content of the datasets. It turns out that, if we use HDF5 in the same way, it is similarly efficient. Maybe HDF5 is slower by a factor of two, maybe it isn’t – benchmarking I/O on an otherwise busy HPC system has lots of noise.

The size of ASDF and HDF5 files is very similar.

The main differences then are (and this is also described on the ASDF format web pages):

  • ASDF stores metadata in a human-readable form (YAML), which is often convenient

  • ASDF is a much simpler format, which has a theoretical advantage if you are left without a reader ten years from now and have to reverse-engineer

  • HDF5 is much more well-known and has a much larger user community.

  • The HDF5 is structured like a file system (or database for arrays of floating-point data), and one can thus modify existing files, i.e. add, remove, change datasets. ASDF is write-once.

  • HDF5 has many more options for storing and laying out attributes and datasets on disk. Most of the time, people don’t care, though.

  • HDF5 can store datasets in chunks, which can greatly increase read performance if you access only a small subset of a dataset.

A few comments on the other formats:

  • From the feather web pages: “Feather is not designed for long-term data storage”.
  • tileDB: Looks very interesting; I’d give it a close look before deciding for ASDF.
  • parquet and arrow: Do not mention multi-dimensional arrays in its documentation, might be targeting different use cases

-erik

6 Likes

I looked at TileDB. It falls short in two respects:

  • There are no provisions for having links inside a file (e.g. hard links, soft links)
  • Arrays are broken into fragments, and each fragment is stored as a file. This creates a lot of files in the file system.

The second point is a no-go argument for me. On the systems I use (large scale HPC systems with parallel file systems), file metadata operations are very expensive. Creating many files is not a good idea.

I’m sure the on-disk format could be remedied to avoid that. Even putting all the files into a tar archive might work as TileDB’s files are immutable…

-erik

1 Like

Thanks a lot for reviewing data storage formats.

FYI Feather files are Arrow memory on disk so both share the same limitation. From the emphasis on columnar memory format in Arrow, I guess it gears toward tabular data.

I thought it was because Arrow has not reached to 1.0 yet. But I just found that Arrow devs recently commented that

The feather format will probably never be stable. For long term storage it is better to use a format like Apache Parquet which is support by pyarrow in Python and arrow in R

If you do store data as Feather, there will be always a away to migrate the files away to Parquet format (e.g. using pyarrow) if there is a breaking change.

(Just noting it for anyone who has the same impression.)

Regarding the file format zoo, did you also looked at netCDF? I found it was quite nice as the data file storage for xarray Python package which provides N-dimensional labeled array.