Pandas to DataFrame conversion with nested arrays takes very long


#1

I wanted to work on the analysis of experimental data that I gathered with a python program. The data is saved in a pickled Pandas dataframe. Each row is a trial and there are about 20,000 of those. There are about 40 columns, most of which just hold time stamps, so normal floats, but some columns have whole arrays in each cell (time series data from a motion and eye tracker, around 500x9 floats per array) or simple python objects (definitions of rigid bodys, with a couple floats so nothing too big).

Loading these files from disk (about 3.5GB) using Pandas.jl takes only a couple seconds, but then I tried to convert them to a DataFrame because I assumed that for type stable computations with my columns I need julia arrays, not wrapped pandas series. But the call

df = DataFrame(pandas_df)

takes almost half an hour to complete, basically killing all the nice time savings I could get during computations later. I don’t know what causes this as I don’t know how I could time the internals of the conversion. (I do end up with the correct output, though, arrays of arrays for the special columns, arrays of pyobjects for the rigid bodies, and arrays of floats or ints or strings for the rest.)

Any advice is appreciated! (Aside from store your data differently, that ship has sailed :wink: )