Pandas to DataFrame conversion with nested arrays takes very long

jules · March 9, 2019, 11:35am

I wanted to work on the analysis of experimental data that I gathered with a python program. The data is saved in a pickled Pandas dataframe. Each row is a trial and there are about 20,000 of those. There are about 40 columns, most of which just hold time stamps, so normal floats, but some columns have whole arrays in each cell (time series data from a motion and eye tracker, around 500x9 floats per array) or simple python objects (definitions of rigid bodys, with a couple floats so nothing too big).

Loading these files from disk (about 3.5GB) using Pandas.jl takes only a couple seconds, but then I tried to convert them to a DataFrame because I assumed that for type stable computations with my columns I need julia arrays, not wrapped pandas series. But the call

df = DataFrame(pandas_df)

takes almost half an hour to complete, basically killing all the nice time savings I could get during computations later. I don’t know what causes this as I don’t know how I could time the internals of the conversion. (I do end up with the correct output, though, arrays of arrays for the special columns, arrays of pyobjects for the rigid bodies, and arrays of floats or ints or strings for the rest.)

Any advice is appreciated! (Aside from store your data differently, that ship has sailed )

Topic		Replies	Views
Converting Pandas Dataframe returned from PyCall to Julia DataFrame General Usage pycall , dataframes	18	5606	May 27, 2022
Converting Pandas DataFrame to Julia DataFrame? General Usage	19	8825	July 8, 2021
Problem Reading Python Pandas object into Julia General Usage question	9	1753	November 28, 2018
Cannot convert an object of type DataFrame to an object of type Array General Usage question , dataframes , convert	2	3715	June 16, 2021
Pandas dataframe convert to Array New to Julia plotting , dataframes	4	596	November 27, 2021

Pandas to DataFrame conversion with nested arrays takes very long

Related topics