Converting Pandas Dataframe returned from PyCall to Julia DataFrame

I’ve been trying to pull some baseball data into a dataframe and there’s a very nice Python package called pybaseball for conveniently doing this. So I’ve been playing with using PyCall to pull this data in - but am running into an issue with converting the resulting pyobject into a DataFrame. Any help? The below code should be runnable for anyone, if you pip install pybaseball for your copy of Python:

using PyCall
pyb = pyimport("pybaseball")
df = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-27")
# Now what? df2 = DataFrame(df) does not work
1 Like

Pandas DataFrames and Julia DataFrames.jl cannot directly be converted to my knowledge.
However, the underlying data structures, Numpy arrays and Julia Arrays, can be passed very efficiently with PyCall.

using PyCall
using DataFrames

pd = pyimport("pandas")
df= pd.read_csv("test_data.csv")

function pd_to_df(df_pd)
    df= DataFrame()
    for col in df_pd.columns
        df[!, col] = getproperty(df_pd, col).values
    end
    df
end

df_julia = pd_to_df(df)

Performance is very good - 0.1s for 450k rows and 20 columns on my quite weak machine.

4 Likes

Hmm - so for some reason this code was giving me a data type error - but a friend tried to sort it out and came up with this modification of your code, which works for me:

function pd_to_df(df_pd)
    colnames = map(Symbol, df_pd.columns)
    df = DataFrame(Any[Array(df_pd[c].values) for c in colnames], colnames)
end

df_julia = pd_to_df(df)
1 Like

It would probably be very easy to implement the Tables.jl interface for a pandas dataframe in julia.

I tried it with the NYC Taxi data set and it worked fine.

download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv", 
    "test_data.csv")

It looks like a data type inference problem in your data set.
By casting all columns to data type Any you got rid of this problem, but performance will be bad.

Note that DataFrames takes care of this in the construction so that internally the columns are stored as Vector{AbstractVector}, which will have the same behavior as Vector{Any} in terms of inference for the most part.

This doesn’t cause performance problems unless your function acts on a DataFrame as a whole and there are no function barriers taking in the columns themselves.

1 Like

Thanks, good to know!

Interestingly, my friend’s solution works for a CSV file I was using, but when I tried to use it with the pybaseball call, it didn’t work. I believe they are both returning pandas dataframe, which is curious. Anyone have thoughts on why that could be?

This is difficult to answer in general without an MWE (that includes at least a mock dataset reproducing the issue).

Sure thing, pybaseball is free and provides publicly available baseball data so anyone should be able to recreate this example. I’ve also added the things that aren’t working that I have tried.

# Be sure to install pybaseball for the version of Python your PyCall is attached to with 'pip install pybaseball'
using PyCall
pyb = pyimport("pybaseball")

# Pull the Statcast Data Between 6/24/2017 and 6/26/2017 
df = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")

Things that haven’t worked:

#Traditional DataFrame conversion
using DataFrames
df2 = DataFrame(df)

#lungben's suggested solution 
using DataFrames

function pd_to_df(df_pd)
    df= DataFrame()
    for col in df_pd.columns
        df[!, col] = getproperty(df_pd, col).values
    end
    df
end

df_julia = pd_to_df(df)

#Alteration of lungben's code
using DataFrames
function pd_to_df(df_pd)
    colnames = map(Symbol, df_pd.columns)
    df = DataFrame(Any[Array(df_pd[c].values) for c in colnames], colnames)
end

df_julia = pd_to_df(df)

I encountered a bug (or feature?) in the indexing of pandas dataframes.
df_pd["T"] will refer to the transposed dataframe. If by chance the dataframe contains a column named ‘T’, the conversion will throw an error.

EDIT: It’s a feature! Pandas dataframes has a so-called accessor property which calls the transpose method.

Here’s a workaround for non-mixed data:

function pd_to_df(df_pd)
  colnames = Symbol.(df_pd.columns)
  df = DataFrame(Any[pdf.values[:, i] for i in 1:size(pdf.values, 2)], colnames)
end

and here’s another workaround that can handle columns named T and different column types (not sure, whether it is the most clever solution, though …)

function pd_to_df(df_pd)
  colnames = convert(Vector{Symbol}, df_pd.columns)
  :T in colnames && df_pd.rename(columns=Dict(:T=>:T_no_transpose), inplace=true)
  vv = (Vector{T} where T)[]
  for c in replace(colnames, :T => :T_no_transpose)
    v =  df_pd[c].values
    if v isa PyObject
      T = Vector{typeof(df_pd[c][1])}
      v = convert(T,  df_pd[c])
    elseif !isa(df_pd[c][1], PyObject)
      v = [df_pd[c][i] for i in 1:length(df_pd[c])]
    elseif v[1] isa PyObject
      PyAny.(v)
    else
      v
    end
    push!(vv, v)
  end
  df = DataFrame(vv, colnames)
  :T in colnames && df_pd.rename(columns=Dict(:T_no_transpose => :T), inplace=true)
  df
end
3 Likes

I saw that this attracted some attention. I’d like add some comments

  • replacing df_pd[c] by get(df_pd, c) removes the need of renaming
  • defining colnames = PyAny.(df_pd.columns.values) makes it possible to use indexing by other key types than strings, e.g. Int
    Here’s the modified conversion function:
function pd_to_df(df_pd)
    colnames = PyAny.(df_pd.columns.values)
    vv = Vector[]
    for c in colnames
      col = get(df_pd, c)
      v =  col.values
      if v isa PyObject
        T = Vector{typeof(col[1])}
        v = convert(T,  col)
      elseif !isa(col[1], PyObject)
        v = [col[i] for i in 1:length(col)]
      elseif v[1] isa PyObject
        PyAny.(v)
      else
        v
      end
      push!(vv, v)
    end
    DataFrame(vv, Symbol.(colnames))
  end

Integer-based indexing is also possible with the following definitions:

slice(args...) = pycall(pybuiltin("slice"), PyObject, args...)
slice(u::UnitRange) = slice(u.start - 1, u.stop)
slice(sr::StepRange) = slice(sr.start - 1, sr.stop, sr.step)
slice(::Colon) = slice(nothing)
slice(i::Integer) = i - 1
import Base.getindex
getindex(p::PyObject, i::Union{<:Integer, Colon, UnitRange, StepRange}...) = get(p, slice.(i))

Then you can do

pdf = pd.DataFrame(Dict(:a => [1,2], 0 => [2, 3], :T => [4, "5"]))
pdf.iloc[:,2]

anything wrong with this?

julia> using PyCall, CSV, DataFrames
julia> pyb = pyimport("pybaseball")
julia> pdf = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-27", parallel=false) # it crashed for me if it was true
 
julia> df = CSV.File(IOBuffer(pdf.to_csv())) |> DataFrame
15949×93 DataFrame
   Row │ Column1  pitch_type  game_date   release_speed  release_pos_x  release_pos_z  player_name  batter  pitcher  events     descript ⋯ 
       │ Int64    String3?    Date        Float64?       Float64?       Float64?       String31     Int64   Int64    String31?  String31 ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │    3323  CU          2017-06-27           79.7          -1.34           5.41  Bush, Matt   608070   456713  field_out  hit_into ⋯
     2 │    3485  FF          2017-06-27           98.1          -1.35           5.42  Bush, Matt   429665   456713  field_out  hit_into  
     3 │    3622  FF          2017-06-27           98.3          -1.29           5.45  Bush, Matt   429665   456713  missing    ball      
     4 │    3637  FC          2017-06-27           90.9          -1.46           5.25  Bush, Matt   429665   456713  missing    called_s  
     5 │    3784  SI          2017-06-27           99.1          -1.37           5.34  Bush, Matt   488726   456713  single     hit_into ⋯
   ⋮   │    ⋮         ⋮           ⋮             ⋮              ⋮              ⋮             ⋮         ⋮        ⋮         ⋮           ⋮ ⋱ 
 15946 │    3978  FS          2017-06-24           82.6          -1.73           6.38  Faria, Jake  452234   607188  missing    ball      
 15947 │    4026  SL          2017-06-24           85.9          -1.63           6.37  Faria, Jake  452234   607188  missing    swinging  
 15948 │    4173  FF          2017-06-24           91.9          -1.47           6.59  Faria, Jake  452234   607188  missing    called_s  
 15949 │    4244  FF          2017-06-24           92.4          -1.57           6.57  Faria, Jake  452234   607188  missing    ball     ⋯                                                                                                         

83 columns and 15940 rows omitted
3 Likes

It looks like we can get most of the way with the new PythonCall.jl package with a Tables.jl compatible interface as well:

(@v1.7) pkg> activate --temp
  Activating new project at `/tmp/jl_cjhpZ1`

(jl_cjhpZ1) pkg> add CondaPkg

(jl_cjhpZ1) julia> using CondaPkg

(jl_cjhpZ1) pkg> conda add --pip pybaseball

(jl_cjhpZ1) pkg> add PythonCall

(jl_cjhpZ1) julia> using PythonCall # Should auto resolve and add pybaseball

(jl_cjhpZ1) julia> @py import pybaseball as pyb
This is a large query, it may take a moment to complete
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.29s/it]
Python DataFrame:
     pitch_type  game_date  release_speed  ...  spin_axis  delta_home_win_exp delta_run_exp
1206         SL 2017-06-26           83.8  ...        142               0.001        -0.416
1227         FF 2017-06-26           92.7  ...        198                 0.0           0.0
1278         SL 2017-06-26           83.1  ...         99                 0.0         0.087
1308         SL 2017-06-26           84.4  ...        124                 0.0           0.0
1324         SL 2017-06-26           83.6  ...        130                 0.0           0.0
...         ...        ...            ...  ...        ...                 ...           ...
3785         FF 2017-06-24           91.8  ...        182               0.022        -0.216
3978         FS 2017-06-24           82.6  ...        256                 0.0         0.043
4026         SL 2017-06-24           85.9  ...        119                 0.0        -0.062
4173         FF 2017-06-24           91.9  ...        192                 0.0        -0.046
4244         FF 2017-06-24           92.4  ...        193                 0.0         0.036

[11434 rows x 92 columns]

(jl_cjhpZ1) julia> df_py = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")

(jl_cjhpZ1) julia> tbl = PyTable(df_py)
11434×92 PyPandasDataFrame
     pitch_type  game_date  release_speed  ...  spin_axis  delta_home_win_exp delta_run_exp
1206         SL 2017-06-26           83.8  ...        142               0.001        -0.416
1227         FF 2017-06-26           92.7  ...        198                 0.0           0.0
1278         SL 2017-06-26           83.1  ...         99                 0.0         0.087
1308         SL 2017-06-26           84.4  ...        124                 0.0           0.0
1324         SL 2017-06-26           83.6  ...        130                 0.0           0.0
...         ...        ...            ...  ...        ...                 ...           ...
3785         FF 2017-06-24           91.8  ...        182               0.022        -0.216
3978         FS 2017-06-24           82.6  ...        256                 0.0         0.043
4026         SL 2017-06-24           85.9  ...        119                 0.0        -0.062
4173         FF 2017-06-24           91.9  ...        192                 0.0        -0.046
4244         FF 2017-06-24           92.4  ...        193                 0.0         0.036

[11434 rows x 92 columns]

I think this can usually be converted to a DataFrame by just doing:

(jl_cjhpZ1) julia> using DataFrames

(jl_cjhpZ1) julia> df = DataFrame(tbl)

but here it looks to throw an out of bounds datetime error that may be related to this: python - pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset - Stack Overflow

Sorry if this is the wrong place to ping you @cjdoris, but would this be something that could (or should) be handled on PythonCall.jl’s end in its datetime conversions?

2 Likes

I’ve fixed that issue on the main branch, your example works now. I also tweaked conversion rules for missing data so the column types are more useful now.

2 Likes

Wow, thanks for the quick fix!!


(jl_kYnoWZ) julia> df = DataFrame(tbl)
11434×92 DataFrame
   Row │ pitch_type  game_date            release_speed  release_pos_x  release_pos_z  player_name   batter  pitcher  events        descrip ⋯
       │ String?     DateTime             Float64?       Float64?       Float64?       String        Int64   Int64    String?       String  ⋯
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
     1 │ SL          2017-06-26T00:00:00           83.8          -1.43           5.61  Goody, Nick   518902   580792  strikeout     swingin ⋯
     2 │ FF          2017-06-26T00:00:00           92.7          -0.94           5.75  Goody, Nick   518902   580792  missing       foul
     3 │ SL          2017-06-26T00:00:00           83.1          -1.58           5.64  Goody, Nick   518902   580792  missing       blocked

I agree, PythonCall is great! And I use it for most of my cases since I discovered it.

There is still a point where also PythonCall fails; when the columns of the DataFrames are not strings. But this can be easily worked around:

julia> using PythonCall, DataFrames

julia> @py import pandas as pd

julia> pdf = pd.DataFrame(pydict(Dict(:a => [1,2], 0 => [2, 3], :T => [4, "5"])))
Python DataFrame:
   a  0  T
0  1  2  4
1  2  3  5

julia> DataFrame(pdf)
ERROR: name of column '<py NULL>' is not a string
Stacktrace:
 [1] error(s::String)
   @ Base .\error.jl:33
 [2] keys(df::PyPandasDataFrame)
   @ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyPandasDataFrame.jl:40
 [3] columns(df::PyPandasDataFrame)
   @ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyPandasDataFrame.jl:123
 [4] columns(x::Py)
   @ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyTable.jl:17
 [5] DataFrame(x::Py; copycols::Nothing)
   @ DataFrames C:\Users\<xxx>\.julia\packages\DataFrames\MA4YO\src\other\tables.jl:58
 [6] DataFrame(x::Py)
   @ DataFrames C:\Users\<xxx>\.julia\packages\DataFrames\MA4YO\src\other\tables.jl:49
 [7] top-level scope
   @ REPL[6]:1

Solution:

julia> pdf2 = pdf.copy();

julia> pdf2.columns = string.(pdf2.columns);

julia> DataFrame(pdf2)
2×3 DataFrame
 Row │ a      0      T
     │ Int64  Int64  Any
─────┼───────────────────
   1 │     1      2  4
   2 │     2      3  5

Maybe I should file an issue …

PythonCall works great. Fast, no copying. Supports more types than the original PyCall.
For numpy backed pandas DataFrame wrapped around PyTable, you can access the columns directly, no copying.

1 Like

I found you need to wrap PyArray around the columns still in Py object form, then there is no copying. Wrapping in PyTable (which implements Table.jl interface) will result in large memory allocations.