ANN: Feather.jl v0.4.0 (lazy edition)

Hello all, it’s finally time to announce something that we’ve been working on and using extensively for a good while now, that is the new Feather.jl. (We still have to fix our docs, see latest for the v0.4.0 docs.)

For those of you who are not aware, feather is a lightweight binary tabular format. The most exciting thing about this new release is that it now supports completely lazy loading of tables via memory mapping. So, for example

df = Feather.read("a_table_that_is_3_GB.feather")  # this happens almost instantly

# this *only* reads the first few rows, it doesn't matter how big your file is, this operation should take approximately the same amount of time
head(df)

# the lazy loading is completely seamless. if you perform an operation, *only* the data you are asking for gets touched
# the following *only* reads in data from column_A
mean(df.column_A) 

# the following *only* reads in rows 100 to 110 from column_B
mean(df.column_B[100:110])

# or *only* rows 100 through 110 of the entire dataframe
df[100:110, :]

# but you are still just working with ordinary DataFrames
typeof(df) == DataFrame

(Note that on Windows memory mapping is disabled by default, so you should do Feather.read(filename, use_mmap=true) to use memory mapping on Windows.)

One caveat is that the feather metadata does not yet support tables larger than 4 GB (ultimately these will involve multiple files). Fortunately, this should be relatively simple to implement, so if they don’t come up with a standard we’ll eventually have our own solution.

25 Likes

Good stuffs! Just curious why you decided to disable memory mapping on windows by default.

See here.

As memory mapping is now a central feature of feather I suspect this may have to be revisited at some point.

1 Like