Package request: STAR file parser and writer

Hello there,

First of all, I hope package requests are appropriate here. I will try my best to make this request as constructive as can be. If requests are not appropriate, I apologize, and I won’t take it personally if the discussion gets closed.

I think it would be very beneficial to have a Julia package implementing reading and writing of files in the STAR format. One application of this format is to store image metadata information when processing cryogenic electron microscopy data, and it’s common to have STAR files with hundreds of thousands of lines, sometimes millions of lines. In this context, these files contain a lot of information, only a small subset of which is exposed to users of the software that generates these files; many interesting things can be done with programmatic access to these files (statistics, visualizations, etc.). Since these files store tabular data, it would make sense for the parser to produce an output that could be readily turned into a DataFrame (from DataFrames.jl), like what CSV.jl does.

There is a Python package to read STAR files into pandas dataframes and write such dataframes out to STAR files: starfile. So, I anticipate people will recommend using this package through PyCall.jl. This is probably possible, I have not tried, but it would be far from ideal since it entails having to manage a Python installation. The good part though is that this package’s license (3-clause BSD) allows drawing inspiration from it as much as one would need.

I would try to build a Julia package to read/write STAR files myself, but I am much too ignorant about too many things for this to be a tractable project: I don’t know enough computer science to know how to implement a parser, enough Python to understand how the starfile package works, nor enough Julia to implement all this. So, if anybody would like to take on this project, I will be very happy to help: I can help design an API; I can provide STAR files for testing purposes, including large ones (couple hundreds of MB) to test for performance; maybe I can even do some coding if you can walk me though the logic of the implementation like I’m 5 and give me pointers (happy to read documentation any time, if it helps accomplishing this).

Thank you in advance!

1 Like

Unless there are particular advantages to this format, I would just use a pre-existing library in another language to convert it to a more widely used format like Parquet, CSV, JSON, HDFS, etc…

1 Like

Try registered package CrystalInfoFramework.jl. It reads and writes CIF files, which are the commonly-used subset of STAR files - CIF files are just STAR files with no nested loops and restricted line lengths. If you do try it out, would be great to get your feedback and what is missing from your point of view.

Method get_loop returns a DataFrame. As far as performance goes, on my i5 laptop a 500K mmCIF file from the PDB takes ~1s to read in.

3 Likes

@cjdoris I know there are ways to read these files by first converting them to a different format outside of Julia, but the point of a Julia package would precisely be that one doesn’t have to use such workarounds.

Regarding your suggestions, I know CSV would not work because STAR files can contain multiple tables (so they cannot map to a single CSV file). HDF5 would probably work (it’s designed to store multiple datasets). Not sure about JSON, but it would probably work too since it can store nested structures. I don’t know the other formats you mentioned.

@frtps thank you for pointing out this package! I will test that on my files with multiple tables. If it doesn’t work, I will ask the maintainer if compatibility with the complete STAR format (including multiple tables in a single file) would be within the scope of this package and reasonable to implement.

At the risk of straying off topic, let me just clarify that both CIF and STAR are happy with as many tables as you want in a single file. The “advantage” of STAR is that tables can be multi-level (so imagine potentially each cell having a table attached to it and so on to arbitrary depth). Although I have coded readers for this in the past, I avoid doing it now as I’m not aware of anybody who actually takes advantage of it. And it makes my head hurt.

Ah ok, this distinction was unclear to me until I read your message. I started a discussion at the repository, and it looks like reading the files I am interested in is relatively straightforward with this package, so that’s good news.

BioStructures.jl also has a mmCIF reader (MMCIFDict) and writer and can read multiple data blocks with readmultimmcif.

1 Like

I saw it, but thought at first that it’s a very large dependency simply to read files. But that’s not too much of a problem actually. How easy is it to turn the result of readmultimmcif into a DataFrame?

The result is a Dict{String, MMCIFDict} where the key is the data token and the value is the MMCIFDict corresponding to that data token.

A MMCIFDict has String field names as keys and a Vector{String} as values for all data types.

Regarding data frames, it depends how you want to turn that into tabular data. You could do it for the atom records for instance by passing the columns into a data frame constructor. In fact there is another function in BioStructures to do that, see https://biojulia.net/BioStructures.jl/stable/documentation/#Reading-PDB-files, but it might not suit your generic needs.

1 Like

This is very helpful, I did some tests and it looks like BioStructures mostly addresses my question. Thank you!

1 Like