Sharing a python dataframe in Julia

Hi there,
I have a small dataframe in a python script ( 300 rows each row probably about 1k). I am new to Julia so uncertain as to how to share the memory space between the python script and a julia program.

I was going to look at feather but wonder if there is a better way?

thank you

It’s unclear to me what exactly do you mean by: “share the memory space between the python script and a julia program.”. But if all you need is to load the data in Julia, I guess you can just save it to disk in some format in the python script and then load it from the Julia code?

If you say the dataframe is small I think saving it to csv, and then reading in Julia using CSV.jl could suffice (probably there are higher performance storage formats out there, but if you need to do this one time for a small dataframe it shouldn’t matter much).

I mean exactly “share the memory space” so I don’t want to save it to disk but thank you for your suggestion.
I would like to examine the creation and modification of the dataframe in python 3.+
then use something like feather to access it inside a julia program. At first just read it but later on see what the effectiveness is when I modify it in both programs. I would then move the julia code to a different machine to see what happens.

I see. So as I understand, you want to be able to read/write to the dataframe simultaneously from both python an Julia or something like this. Is this correct?

I’m not an expert on dataframes but from looking at Feather.jl docs it seems to me that feather is just a binary format to save data-frames to disk, so I am not sure how would it be conceptually different than saving to disk as csv (it might be of course faster, but you are still saving it to disk). Am I missing something?

1 Like

I haven’t used it, but I believe PyCall allows for no-copy/shared memory for basic types. See
https://github.com/JuliaPy/PyCall.jl/blob/master/README.md#arrays-and-pyarray

Sharing an entire dataframe is probably going to be more difficult as I doubt the dataframes will be using the same memory layout.

1 Like

@orialb I am new to Julia so it’s VERY possible I misread the Feather.jl as I rushed it. Thank you for taking a look with a more careful eye. I suppose that I could create a ram disk BUT that’s REALLY not what I want to do. I was hoping to just share the memory space.

hi @robsmith11
thank you for taking an interest. I did look at PyCall but was worried about the dataframe memory layout as well. I also am looking at PyJulia (https://github.com/JuliaPy/pyjulia) but the experimental category worries me a little.

Why do you want to use shared memory? Are your performance requirements really that high? Passing dataframes back and forth with PyCall should be fairly fast.

If you insist on shared memory, I’d suggest trying to work with plain vectors rather than a dataframe.

To share non-trivial data structures between languages, you would need to design them specifically with interoperability in mind. It’s highly unlikely that something like a dataframe would have the same memory layout for any two languages/packages.

2 Likes

I want to get an idea as to how the two environments work together. It’s NOT a speed issue more a getting to know julia. I agree that vectors would be easier. I was always going to use PyCall but want to know if there is anything else out there.
thanks

The memory layout of Pandas and Julia dataframes is different to my knowledge.
You can use PyJulia (in Python) for creating a Julia dataframe as follows:

import pandas as pd
from julia import Julia
jl = Julia()
jl.eval("using DataFrames")
df = pd.DataFrame([{'col1': 1, 'col2': 'test', 'col3': 12.232}, {'col1': 2, 'col2': 'try_again', 'col3': 178.21}, ])
df_julia = jl.DataFrame(df.to_dict(orient='list'))

This works fine for small amounts of data. For large data amounts you should pass Numpy arrays (which have the same memory layout as Julia arrays for C datatypes - I tested this for GB sized data). However, you need to prepare / parse your data from and to strictly typed arrays both on Python and Julia side.

The other way around (calling Python from Julia) works as well with PyCall (actually PyJulia is based on PyCall).

2 Likes

@lungben
thank you for your excellent guidance. I certainly will look more into this. I liked this presentation given at Pydata 2016 as Mr Lattner spent some time on the diagnostics of the memory layouts.

Since 2016 things have moved on a tad and ursalabs.org apache arrow is certainly interesting as is nvidia rapids. Thank you so much for your code sample and sharing your experiences.

1 Like

Thanks for the video link, very interesting!
Now I finally understand the motivation of column-major array ordering.