Sharing a python dataframe in Julia

anon69491625 · January 27, 2020, 10:57am

Hi there,
I have a small dataframe in a python script ( 300 rows each row probably about 1k). I am new to Julia so uncertain as to how to share the memory space between the python script and a julia program.

I was going to look at feather but wonder if there is a better way?

thank you

orialb · January 27, 2020, 2:25pm

It’s unclear to me what exactly do you mean by: “share the memory space between the python script and a julia program.”. But if all you need is to load the data in Julia, I guess you can just save it to disk in some format in the python script and then load it from the Julia code?

If you say the dataframe is small I think saving it to csv, and then reading in Julia using CSV.jl could suffice (probably there are higher performance storage formats out there, but if you need to do this one time for a small dataframe it shouldn’t matter much).

anon69491625 · January 27, 2020, 3:01pm

I mean exactly “share the memory space” so I don’t want to save it to disk but thank you for your suggestion.
I would like to examine the creation and modification of the dataframe in python 3.+
then use something like feather to access it inside a julia program. At first just read it but later on see what the effectiveness is when I modify it in both programs. I would then move the julia code to a different machine to see what happens.

orialb · January 27, 2020, 3:12pm

I see. So as I understand, you want to be able to read/write to the dataframe simultaneously from both python an Julia or something like this. Is this correct?

I’m not an expert on dataframes but from looking at Feather.jl docs it seems to me that feather is just a binary format to save data-frames to disk, so I am not sure how would it be conceptually different than saving to disk as csv (it might be of course faster, but you are still saving it to disk). Am I missing something?

robsmith11 · January 27, 2020, 3:23pm

I haven’t used it, but I believe PyCall allows for no-copy/shared memory for basic types. See
https://github.com/JuliaPy/PyCall.jl/blob/master/README.md#arrays-and-pyarray

Sharing an entire dataframe is probably going to be more difficult as I doubt the dataframes will be using the same memory layout.

anon69491625 · January 27, 2020, 3:51pm

@orialb I am new to Julia so it’s VERY possible I misread the Feather.jl as I rushed it. Thank you for taking a look with a more careful eye. I suppose that I could create a ram disk BUT that’s REALLY not what I want to do. I was hoping to just share the memory space.

anon69491625 · January 27, 2020, 3:58pm

hi @robsmith11
thank you for taking an interest. I did look at PyCall but was worried about the dataframe memory layout as well. I also am looking at PyJulia (GitHub - JuliaPy/pyjulia: python interface to julia) but the experimental category worries me a little.

robsmith11 · January 27, 2020, 4:20pm

Why do you want to use shared memory? Are your performance requirements really that high? Passing dataframes back and forth with PyCall should be fairly fast.

If you insist on shared memory, I’d suggest trying to work with plain vectors rather than a dataframe.

To share non-trivial data structures between languages, you would need to design them specifically with interoperability in mind. It’s highly unlikely that something like a dataframe would have the same memory layout for any two languages/packages.

anon69491625 · January 27, 2020, 6:05pm

I want to get an idea as to how the two environments work together. It’s NOT a speed issue more a getting to know julia. I agree that vectors would be easier. I was always going to use PyCall but want to know if there is anything else out there.
thanks

lungben · January 27, 2020, 6:19pm

The memory layout of Pandas and Julia dataframes is different to my knowledge.
You can use PyJulia (in Python) for creating a Julia dataframe as follows:

import pandas as pd
from julia import Julia
jl = Julia()
jl.eval("using DataFrames")
df = pd.DataFrame([{'col1': 1, 'col2': 'test', 'col3': 12.232}, {'col1': 2, 'col2': 'try_again', 'col3': 178.21}, ])
df_julia = jl.DataFrame(df.to_dict(orient='list'))

This works fine for small amounts of data. For large data amounts you should pass Numpy arrays (which have the same memory layout as Julia arrays for C datatypes - I tested this for GB sized data). However, you need to prepare / parse your data from and to strictly typed arrays both on Python and Julia side.

The other way around (calling Python from Julia) works as well with PyCall (actually PyJulia is based on PyCall).

anon69491625 · January 27, 2020, 7:30pm

@lungben
thank you for your excellent guidance. I certainly will look more into this. I liked this presentation given at Pydata 2016 as Mr Lattner spent some time on the diagnostics of the memory layouts.

Since 2016 things have moved on a tad and ursalabs.org apache arrow is certainly interesting as is nvidia rapids. Thank you so much for your code sample and sharing your experiences.

lungben · January 27, 2020, 8:07pm

Thanks for the video link, very interesting!
Now I finally understand the motivation of column-major array ordering.

Topic		Replies	Views
Sharing python memory Internals & Design question , memory , python	17	1875	December 27, 2020
How to share a dataframe across machines on a wired wan New to Julia question , dataframes	6	353	January 29, 2023
Pyjulia - accessing data in a Julia Dataframe that was called from Python General Usage	5	1630	July 16, 2020
Efficiently using single large dataframe over multiple workers Performance	10	2396	June 15, 2018
Converting Pandas Dataframe returned from PyCall to Julia DataFrame General Usage pycall , dataframes	18	5609	May 27, 2022

Sharing a python dataframe in Julia

Related topics