Converting Pandas Dataframe returned from PyCall to Julia DataFrame

clibassi · July 13, 2020, 7:43pm

I’ve been trying to pull some baseball data into a dataframe and there’s a very nice Python package called pybaseball for conveniently doing this. So I’ve been playing with using PyCall to pull this data in - but am running into an issue with converting the resulting pyobject into a DataFrame. Any help? The below code should be runnable for anyone, if you pip install pybaseball for your copy of Python:

using PyCall
pyb = pyimport("pybaseball")
df = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-27")
# Now what? df2 = DataFrame(df) does not work

lungben · July 13, 2020, 8:20pm

Pandas DataFrames and Julia DataFrames.jl cannot directly be converted to my knowledge.
However, the underlying data structures, Numpy arrays and Julia Arrays, can be passed very efficiently with PyCall.

using PyCall
using DataFrames

pd = pyimport("pandas")
df= pd.read_csv("test_data.csv")

function pd_to_df(df_pd)
    df= DataFrame()
    for col in df_pd.columns
        df[!, col] = getproperty(df_pd, col).values
    end
    df
end

df_julia = pd_to_df(df)

Performance is very good - 0.1s for 450k rows and 20 columns on my quite weak machine.

clibassi · July 14, 2020, 3:24pm

Hmm - so for some reason this code was giving me a data type error - but a friend tried to sort it out and came up with this modification of your code, which works for me:

function pd_to_df(df_pd)
    colnames = map(Symbol, df_pd.columns)
    df = DataFrame(Any[Array(df_pd[c].values) for c in colnames], colnames)
end

df_julia = pd_to_df(df)

pdeffebach · July 14, 2020, 3:27pm

It would probably be very easy to implement the Tables.jl interface for a pandas dataframe in julia.

lungben · July 14, 2020, 4:49pm

I tried it with the NYC Taxi data set and it worked fine.

download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv", 
    "test_data.csv")

It looks like a data type inference problem in your data set.
By casting all columns to data type Any you got rid of this problem, but performance will be bad.

pdeffebach · July 14, 2020, 5:13pm

Note that DataFrames takes care of this in the construction so that internally the columns are stored as Vector{AbstractVector}, which will have the same behavior as Vector{Any} in terms of inference for the most part.

This doesn’t cause performance problems unless your function acts on a DataFrame as a whole and there are no function barriers taking in the columns themselves.

lungben · July 14, 2020, 5:20pm

Thanks, good to know!

clibassi · July 18, 2020, 12:18am

Interestingly, my friend’s solution works for a CSV file I was using, but when I tried to use it with the pybaseball call, it didn’t work. I believe they are both returning pandas dataframe, which is curious. Anyone have thoughts on why that could be?

Tamas_Papp · July 18, 2020, 11:12am

This is difficult to answer in general without an MWE (that includes at least a mock dataset reproducing the issue).

clibassi · July 18, 2020, 2:50pm

Sure thing, pybaseball is free and provides publicly available baseball data so anyone should be able to recreate this example. I’ve also added the things that aren’t working that I have tried.

# Be sure to install pybaseball for the version of Python your PyCall is attached to with 'pip install pybaseball'
using PyCall
pyb = pyimport("pybaseball")

# Pull the Statcast Data Between 6/24/2017 and 6/26/2017 
df = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")

Things that haven’t worked:

#Traditional DataFrame conversion
using DataFrames
df2 = DataFrame(df)

#lungben's suggested solution 
using DataFrames

function pd_to_df(df_pd)
    df= DataFrame()
    for col in df_pd.columns
        df[!, col] = getproperty(df_pd, col).values
    end
    df
end

df_julia = pd_to_df(df)

#Alteration of lungben's code
using DataFrames
function pd_to_df(df_pd)
    colnames = map(Symbol, df_pd.columns)
    df = DataFrame(Any[Array(df_pd[c].values) for c in colnames], colnames)
end

df_julia = pd_to_df(df)

hhaensel · March 31, 2021, 11:34pm

I encountered a bug (or feature?) in the indexing of pandas dataframes.
df_pd["T"] will refer to the transposed dataframe. If by chance the dataframe contains a column named ‘T’, the conversion will throw an error.

EDIT: It’s a feature! Pandas dataframes has a so-called accessor property which calls the transpose method.

Here’s a workaround for non-mixed data:

function pd_to_df(df_pd)
  colnames = Symbol.(df_pd.columns)
  df = DataFrame(Any[pdf.values[:, i] for i in 1:size(pdf.values, 2)], colnames)
end

and here’s another workaround that can handle columns named T and different column types (not sure, whether it is the most clever solution, though …)

function pd_to_df(df_pd)
  colnames = convert(Vector{Symbol}, df_pd.columns)
  :T in colnames && df_pd.rename(columns=Dict(:T=>:T_no_transpose), inplace=true)
  vv = (Vector{T} where T)[]
  for c in replace(colnames, :T => :T_no_transpose)
    v =  df_pd[c].values
    if v isa PyObject
      T = Vector{typeof(df_pd[c][1])}
      v = convert(T,  df_pd[c])
    elseif !isa(df_pd[c][1], PyObject)
      v = [df_pd[c][i] for i in 1:length(df_pd[c])]
    elseif v[1] isa PyObject
      PyAny.(v)
    else
      v
    end
    push!(vv, v)
  end
  df = DataFrame(vv, colnames)
  :T in colnames && df_pd.rename(columns=Dict(:T_no_transpose => :T), inplace=true)
  df
end

hhaensel · March 15, 2022, 10:05am

I saw that this attracted some attention. I’d like add some comments

replacing df_pd[c] by get(df_pd, c) removes the need of renaming
defining colnames = PyAny.(df_pd.columns.values) makes it possible to use indexing by other key types than strings, e.g. Int
Here’s the modified conversion function:

function pd_to_df(df_pd)
    colnames = PyAny.(df_pd.columns.values)
    vv = Vector[]
    for c in colnames
      col = get(df_pd, c)
      v =  col.values
      if v isa PyObject
        T = Vector{typeof(col[1])}
        v = convert(T,  col)
      elseif !isa(col[1], PyObject)
        v = [col[i] for i in 1:length(col)]
      elseif v[1] isa PyObject
        PyAny.(v)
      else
        v
      end
      push!(vv, v)
    end
    DataFrame(vv, Symbol.(colnames))
  end

Integer-based indexing is also possible with the following definitions:

slice(args...) = pycall(pybuiltin("slice"), PyObject, args...)
slice(u::UnitRange) = slice(u.start - 1, u.stop)
slice(sr::StepRange) = slice(sr.start - 1, sr.stop, sr.step)
slice(::Colon) = slice(nothing)
slice(i::Integer) = i - 1
import Base.getindex
getindex(p::PyObject, i::Union{<:Integer, Colon, UnitRange, StepRange}...) = get(p, slice.(i))

Then you can do

pdf = pd.DataFrame(Dict(:a => [1,2], 0 => [2, 3], :T => [4, "5"]))
pdf.iloc[:,2]

lawless-m · March 15, 2022, 2:31pm

anything wrong with this?

julia> using PyCall, CSV, DataFrames
julia> pyb = pyimport("pybaseball")
julia> pdf = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-27", parallel=false) # it crashed for me if it was true
 
julia> df = CSV.File(IOBuffer(pdf.to_csv())) |> DataFrame
15949×93 DataFrame
   Row │ Column1  pitch_type  game_date   release_speed  release_pos_x  release_pos_z  player_name  batter  pitcher  events     descript ⋯ 
       │ Int64    String3?    Date        Float64?       Float64?       Float64?       String31     Int64   Int64    String31?  String31 ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │    3323  CU          2017-06-27           79.7          -1.34           5.41  Bush, Matt   608070   456713  field_out  hit_into ⋯
     2 │    3485  FF          2017-06-27           98.1          -1.35           5.42  Bush, Matt   429665   456713  field_out  hit_into  
     3 │    3622  FF          2017-06-27           98.3          -1.29           5.45  Bush, Matt   429665   456713  missing    ball      
     4 │    3637  FC          2017-06-27           90.9          -1.46           5.25  Bush, Matt   429665   456713  missing    called_s  
     5 │    3784  SI          2017-06-27           99.1          -1.37           5.34  Bush, Matt   488726   456713  single     hit_into ⋯
   ⋮   │    ⋮         ⋮           ⋮             ⋮              ⋮              ⋮             ⋮         ⋮        ⋮         ⋮           ⋮ ⋱ 
 15946 │    3978  FS          2017-06-24           82.6          -1.73           6.38  Faria, Jake  452234   607188  missing    ball      
 15947 │    4026  SL          2017-06-24           85.9          -1.63           6.37  Faria, Jake  452234   607188  missing    swinging  
 15948 │    4173  FF          2017-06-24           91.9          -1.47           6.59  Faria, Jake  452234   607188  missing    called_s  
 15949 │    4244  FF          2017-06-24           92.4          -1.57           6.57  Faria, Jake  452234   607188  missing    ball     ⋯                                                                                                         

83 columns and 15940 rows omitted

icweaver · March 15, 2022, 3:47pm

It looks like we can get most of the way with the new PythonCall.jl package with a Tables.jl compatible interface as well:

(@v1.7) pkg> activate --temp
  Activating new project at `/tmp/jl_cjhpZ1`

(jl_cjhpZ1) pkg> add CondaPkg

(jl_cjhpZ1) julia> using CondaPkg

(jl_cjhpZ1) pkg> conda add --pip pybaseball

(jl_cjhpZ1) pkg> add PythonCall

(jl_cjhpZ1) julia> using PythonCall # Should auto resolve and add pybaseball

(jl_cjhpZ1) julia> @py import pybaseball as pyb
This is a large query, it may take a moment to complete
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.29s/it]
Python DataFrame:
     pitch_type  game_date  release_speed  ...  spin_axis  delta_home_win_exp delta_run_exp
1206         SL 2017-06-26           83.8  ...        142               0.001        -0.416
1227         FF 2017-06-26           92.7  ...        198                 0.0           0.0
1278         SL 2017-06-26           83.1  ...         99                 0.0         0.087
1308         SL 2017-06-26           84.4  ...        124                 0.0           0.0
1324         SL 2017-06-26           83.6  ...        130                 0.0           0.0
...         ...        ...            ...  ...        ...                 ...           ...
3785         FF 2017-06-24           91.8  ...        182               0.022        -0.216
3978         FS 2017-06-24           82.6  ...        256                 0.0         0.043
4026         SL 2017-06-24           85.9  ...        119                 0.0        -0.062
4173         FF 2017-06-24           91.9  ...        192                 0.0        -0.046
4244         FF 2017-06-24           92.4  ...        193                 0.0         0.036

[11434 rows x 92 columns]

(jl_cjhpZ1) julia> df_py = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")

(jl_cjhpZ1) julia> tbl = PyTable(df_py)
11434×92 PyPandasDataFrame
     pitch_type  game_date  release_speed  ...  spin_axis  delta_home_win_exp delta_run_exp
1206         SL 2017-06-26           83.8  ...        142               0.001        -0.416
1227         FF 2017-06-26           92.7  ...        198                 0.0           0.0
1278         SL 2017-06-26           83.1  ...         99                 0.0         0.087
1308         SL 2017-06-26           84.4  ...        124                 0.0           0.0
1324         SL 2017-06-26           83.6  ...        130                 0.0           0.0
...         ...        ...            ...  ...        ...                 ...           ...
3785         FF 2017-06-24           91.8  ...        182               0.022        -0.216
3978         FS 2017-06-24           82.6  ...        256                 0.0         0.043
4026         SL 2017-06-24           85.9  ...        119                 0.0        -0.062
4173         FF 2017-06-24           91.9  ...        192                 0.0        -0.046
4244         FF 2017-06-24           92.4  ...        193                 0.0         0.036

[11434 rows x 92 columns]

I think this can usually be converted to a DataFrame by just doing:

(jl_cjhpZ1) julia> using DataFrames

(jl_cjhpZ1) julia> df = DataFrame(tbl)

but here it looks to throw an out of bounds datetime error that may be related to this: python - pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset - Stack Overflow

Sorry if this is the wrong place to ping you @cjdoris, but would this be something that could (or should) be handled on PythonCall.jl’s end in its datetime conversions?

cjdoris · March 15, 2022, 5:21pm

I’ve fixed that issue on the main branch, your example works now. I also tweaked conversion rules for missing data so the column types are more useful now.

icweaver · March 15, 2022, 5:45pm

Wow, thanks for the quick fix!!


(jl_kYnoWZ) julia> df = DataFrame(tbl)
11434×92 DataFrame
   Row │ pitch_type  game_date            release_speed  release_pos_x  release_pos_z  player_name   batter  pitcher  events        descrip ⋯
       │ String?     DateTime             Float64?       Float64?       Float64?       String        Int64   Int64    String?       String  ⋯
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │ SL          2017-06-26T00:00:00           83.8          -1.43           5.61  Goody, Nick   518902   580792  strikeout     swingin ⋯
     2 │ FF          2017-06-26T00:00:00           92.7          -0.94           5.75  Goody, Nick   518902   580792  missing       foul
     3 │ SL          2017-06-26T00:00:00           83.1          -1.58           5.64  Goody, Nick   518902   580792  missing       blocked

hhaensel · March 21, 2022, 7:28am

I agree, PythonCall is great! And I use it for most of my cases since I discovered it.

There is still a point where also PythonCall fails; when the columns of the DataFrames are not strings. But this can be easily worked around:

julia> using PythonCall, DataFrames

julia> @py import pandas as pd

julia> pdf = pd.DataFrame(pydict(Dict(:a => [1,2], 0 => [2, 3], :T => [4, "5"])))
Python DataFrame:
   a  0  T
0  1  2  4
1  2  3  5

julia> DataFrame(pdf)
ERROR: name of column '<py NULL>' is not a string
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] keys(df::PyPandasDataFrame)
   @ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyPandasDataFrame.jl:40
 [3] columns(df::PyPandasDataFrame)
   @ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyPandasDataFrame.jl:123
 [4] columns(x::Py)
   @ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyTable.jl:17
 [5] DataFrame(x::Py; copycols::Nothing)
   @ DataFrames C:\Users\<xxx>\.julia\packages\DataFrames\MA4YO\src\other\tables.jl:58
 [6] DataFrame(x::Py)
   @ DataFrames C:\Users\<xxx>\.julia\packages\DataFrames\MA4YO\src\other\tables.jl:49
 [7] top-level scope
   @ REPL[6]:1

Solution:

julia> pdf2 = pdf.copy();

julia> pdf2.columns = string.(pdf2.columns);

julia> DataFrame(pdf2)
2×3 DataFrame
 Row │ a      0      T
     │ Int64  Int64  Any
─────┼───────────────────
   1 │     1      2  4
   2 │     2      3  5

Maybe I should file an issue …

Kirby_Zhang · May 25, 2022, 12:58am

PythonCall works great. Fast, no copying. Supports more types than the original PyCall.
For numpy backed pandas DataFrame wrapped around PyTable, you can access the columns directly, no copying.

Kirby_Zhang · May 27, 2022, 6:43am

I found you need to wrap PyArray around the columns still in Py object form, then there is no copying. Wrapping in PyTable (which implements Table.jl interface) will result in large memory allocations.

Topic		Replies	Views
Converting Pandas DataFrame to Julia DataFrame? General Usage	19	8835	July 8, 2021
Pyjulia - accessing data in a Julia Dataframe that was called from Python General Usage	5	1630	July 16, 2020
Problem Reading Python Pandas object into Julia General Usage question	9	1754	November 28, 2018
Pyarrow conversion with PythonCall General Usage dataframes , pythoncall , arrow	11	564	May 27, 2023
Cant add or create new column for my data csv General Usage question	1	71	January 20, 2025

Converting Pandas Dataframe returned from PyCall to Julia DataFrame

Related topics