I’ve been trying to pull some baseball data into a dataframe and there’s a very nice Python package called pybaseball for conveniently doing this. So I’ve been playing with using PyCall to pull this data in - but am running into an issue with converting the resulting pyobject into a DataFrame. Any help? The below code should be runnable for anyone, if you pip install pybaseball for your copy of Python:
using PyCall
pyb = pyimport("pybaseball")
df = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-27")
# Now what? df2 = DataFrame(df) does not work
1 Like
Pandas DataFrames and Julia DataFrames.jl cannot directly be converted to my knowledge.
However, the underlying data structures, Numpy arrays and Julia Arrays, can be passed very efficiently with PyCall.
using PyCall
using DataFrames
pd = pyimport("pandas")
df= pd.read_csv("test_data.csv")
function pd_to_df(df_pd)
df= DataFrame()
for col in df_pd.columns
df[!, col] = getproperty(df_pd, col).values
end
df
end
df_julia = pd_to_df(df)
Performance is very good - 0.1s for 450k rows and 20 columns on my quite weak machine.
4 Likes
Hmm - so for some reason this code was giving me a data type error - but a friend tried to sort it out and came up with this modification of your code, which works for me:
function pd_to_df(df_pd)
colnames = map(Symbol, df_pd.columns)
df = DataFrame(Any[Array(df_pd[c].values) for c in colnames], colnames)
end
df_julia = pd_to_df(df)
1 Like
It would probably be very easy to implement the Tables.jl interface for a pandas dataframe in julia.
I tried it with the NYC Taxi data set and it worked fine.
download("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2019-12.csv",
"test_data.csv")
It looks like a data type inference problem in your data set.
By casting all columns to data type Any you got rid of this problem, but performance will be bad.
Note that DataFrames takes care of this in the construction so that internally the columns are stored as Vector{AbstractVector}
, which will have the same behavior as Vector{Any}
in terms of inference for the most part.
This doesn’t cause performance problems unless your function acts on a DataFrame as a whole and there are no function barriers taking in the columns themselves.
1 Like
Interestingly, my friend’s solution works for a CSV file I was using, but when I tried to use it with the pybaseball call, it didn’t work. I believe they are both returning pandas dataframe, which is curious. Anyone have thoughts on why that could be?
This is difficult to answer in general without an MWE (that includes at least a mock dataset reproducing the issue).
Sure thing, pybaseball is free and provides publicly available baseball data so anyone should be able to recreate this example. I’ve also added the things that aren’t working that I have tried.
# Be sure to install pybaseball for the version of Python your PyCall is attached to with 'pip install pybaseball'
using PyCall
pyb = pyimport("pybaseball")
# Pull the Statcast Data Between 6/24/2017 and 6/26/2017
df = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")
Things that haven’t worked:
#Traditional DataFrame conversion
using DataFrames
df2 = DataFrame(df)
#lungben's suggested solution
using DataFrames
function pd_to_df(df_pd)
df= DataFrame()
for col in df_pd.columns
df[!, col] = getproperty(df_pd, col).values
end
df
end
df_julia = pd_to_df(df)
#Alteration of lungben's code
using DataFrames
function pd_to_df(df_pd)
colnames = map(Symbol, df_pd.columns)
df = DataFrame(Any[Array(df_pd[c].values) for c in colnames], colnames)
end
df_julia = pd_to_df(df)
I encountered a bug (or feature?) in the indexing of pandas dataframes.
df_pd["T"]
will refer to the transposed dataframe. If by chance the dataframe contains a column named ‘T’, the conversion will throw an error.
EDIT: It’s a feature! Pandas dataframes has a so-called accessor property which calls the transpose method.
Here’s a workaround for non-mixed data:
function pd_to_df(df_pd)
colnames = Symbol.(df_pd.columns)
df = DataFrame(Any[pdf.values[:, i] for i in 1:size(pdf.values, 2)], colnames)
end
and here’s another workaround that can handle columns named T and different column types (not sure, whether it is the most clever solution, though …)
function pd_to_df(df_pd)
colnames = convert(Vector{Symbol}, df_pd.columns)
:T in colnames && df_pd.rename(columns=Dict(:T=>:T_no_transpose), inplace=true)
vv = (Vector{T} where T)[]
for c in replace(colnames, :T => :T_no_transpose)
v = df_pd[c].values
if v isa PyObject
T = Vector{typeof(df_pd[c][1])}
v = convert(T, df_pd[c])
elseif !isa(df_pd[c][1], PyObject)
v = [df_pd[c][i] for i in 1:length(df_pd[c])]
elseif v[1] isa PyObject
PyAny.(v)
else
v
end
push!(vv, v)
end
df = DataFrame(vv, colnames)
:T in colnames && df_pd.rename(columns=Dict(:T_no_transpose => :T), inplace=true)
df
end
3 Likes
I saw that this attracted some attention. I’d like add some comments
- replacing
df_pd[c]
by get(df_pd, c)
removes the need of renaming
- defining
colnames = PyAny.(df_pd.columns.values)
makes it possible to use indexing by other key types than strings, e.g. Int
Here’s the modified conversion function:
function pd_to_df(df_pd)
colnames = PyAny.(df_pd.columns.values)
vv = Vector[]
for c in colnames
col = get(df_pd, c)
v = col.values
if v isa PyObject
T = Vector{typeof(col[1])}
v = convert(T, col)
elseif !isa(col[1], PyObject)
v = [col[i] for i in 1:length(col)]
elseif v[1] isa PyObject
PyAny.(v)
else
v
end
push!(vv, v)
end
DataFrame(vv, Symbol.(colnames))
end
Integer-based indexing is also possible with the following definitions:
slice(args...) = pycall(pybuiltin("slice"), PyObject, args...)
slice(u::UnitRange) = slice(u.start - 1, u.stop)
slice(sr::StepRange) = slice(sr.start - 1, sr.stop, sr.step)
slice(::Colon) = slice(nothing)
slice(i::Integer) = i - 1
import Base.getindex
getindex(p::PyObject, i::Union{<:Integer, Colon, UnitRange, StepRange}...) = get(p, slice.(i))
Then you can do
pdf = pd.DataFrame(Dict(:a => [1,2], 0 => [2, 3], :T => [4, "5"]))
pdf.iloc[:,2]
anything wrong with this?
julia> using PyCall, CSV, DataFrames
julia> pyb = pyimport("pybaseball")
julia> pdf = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-27", parallel=false) # it crashed for me if it was true
julia> df = CSV.File(IOBuffer(pdf.to_csv())) |> DataFrame
15949×93 DataFrame
Row │ Column1 pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events descript ⋯
│ Int64 String3? Date Float64? Float64? Float64? String31 Int64 Int64 String31? String31 ⋯
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 3323 CU 2017-06-27 79.7 -1.34 5.41 Bush, Matt 608070 456713 field_out hit_into ⋯
2 │ 3485 FF 2017-06-27 98.1 -1.35 5.42 Bush, Matt 429665 456713 field_out hit_into
3 │ 3622 FF 2017-06-27 98.3 -1.29 5.45 Bush, Matt 429665 456713 missing ball
4 │ 3637 FC 2017-06-27 90.9 -1.46 5.25 Bush, Matt 429665 456713 missing called_s
5 │ 3784 SI 2017-06-27 99.1 -1.37 5.34 Bush, Matt 488726 456713 single hit_into ⋯
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
15946 │ 3978 FS 2017-06-24 82.6 -1.73 6.38 Faria, Jake 452234 607188 missing ball
15947 │ 4026 SL 2017-06-24 85.9 -1.63 6.37 Faria, Jake 452234 607188 missing swinging
15948 │ 4173 FF 2017-06-24 91.9 -1.47 6.59 Faria, Jake 452234 607188 missing called_s
15949 │ 4244 FF 2017-06-24 92.4 -1.57 6.57 Faria, Jake 452234 607188 missing ball ⋯
83 columns and 15940 rows omitted
3 Likes
It looks like we can get most of the way with the new PythonCall.jl package with a Tables.jl compatible interface as well:
(@v1.7) pkg> activate --temp
Activating new project at `/tmp/jl_cjhpZ1`
(jl_cjhpZ1) pkg> add CondaPkg
(jl_cjhpZ1) julia> using CondaPkg
(jl_cjhpZ1) pkg> conda add --pip pybaseball
(jl_cjhpZ1) pkg> add PythonCall
(jl_cjhpZ1) julia> using PythonCall # Should auto resolve and add pybaseball
(jl_cjhpZ1) julia> @py import pybaseball as pyb
This is a large query, it may take a moment to complete
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.29s/it]
Python DataFrame:
pitch_type game_date release_speed ... spin_axis delta_home_win_exp delta_run_exp
1206 SL 2017-06-26 83.8 ... 142 0.001 -0.416
1227 FF 2017-06-26 92.7 ... 198 0.0 0.0
1278 SL 2017-06-26 83.1 ... 99 0.0 0.087
1308 SL 2017-06-26 84.4 ... 124 0.0 0.0
1324 SL 2017-06-26 83.6 ... 130 0.0 0.0
... ... ... ... ... ... ... ...
3785 FF 2017-06-24 91.8 ... 182 0.022 -0.216
3978 FS 2017-06-24 82.6 ... 256 0.0 0.043
4026 SL 2017-06-24 85.9 ... 119 0.0 -0.062
4173 FF 2017-06-24 91.9 ... 192 0.0 -0.046
4244 FF 2017-06-24 92.4 ... 193 0.0 0.036
[11434 rows x 92 columns]
(jl_cjhpZ1) julia> df_py = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")
(jl_cjhpZ1) julia> tbl = PyTable(df_py)
11434×92 PyPandasDataFrame
pitch_type game_date release_speed ... spin_axis delta_home_win_exp delta_run_exp
1206 SL 2017-06-26 83.8 ... 142 0.001 -0.416
1227 FF 2017-06-26 92.7 ... 198 0.0 0.0
1278 SL 2017-06-26 83.1 ... 99 0.0 0.087
1308 SL 2017-06-26 84.4 ... 124 0.0 0.0
1324 SL 2017-06-26 83.6 ... 130 0.0 0.0
... ... ... ... ... ... ... ...
3785 FF 2017-06-24 91.8 ... 182 0.022 -0.216
3978 FS 2017-06-24 82.6 ... 256 0.0 0.043
4026 SL 2017-06-24 85.9 ... 119 0.0 -0.062
4173 FF 2017-06-24 91.9 ... 192 0.0 -0.046
4244 FF 2017-06-24 92.4 ... 193 0.0 0.036
[11434 rows x 92 columns]
I think this can usually be converted to a DataFrame by just doing:
(jl_cjhpZ1) julia> using DataFrames
(jl_cjhpZ1) julia> df = DataFrame(tbl)
but here it looks to throw an out of bounds datetime error that may be related to this: python - pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset - Stack Overflow
Sorry if this is the wrong place to ping you @cjdoris, but would this be something that could (or should) be handled on PythonCall.jl’s end in its datetime conversions?
2 Likes
I’ve fixed that issue on the main branch, your example works now. I also tweaked conversion rules for missing data so the column types are more useful now.
2 Likes
Wow, thanks for the quick fix!!
(jl_kYnoWZ) julia> df = DataFrame(tbl)
11434×92 DataFrame
Row │ pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher events descrip ⋯
│ String? DateTime Float64? Float64? Float64? String Int64 Int64 String? String ⋯
───────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ SL 2017-06-26T00:00:00 83.8 -1.43 5.61 Goody, Nick 518902 580792 strikeout swingin ⋯
2 │ FF 2017-06-26T00:00:00 92.7 -0.94 5.75 Goody, Nick 518902 580792 missing foul
3 │ SL 2017-06-26T00:00:00 83.1 -1.58 5.64 Goody, Nick 518902 580792 missing blocked
I agree, PythonCall is great! And I use it for most of my cases since I discovered it.
There is still a point where also PythonCall fails; when the columns of the DataFrames are not strings. But this can be easily worked around:
julia> using PythonCall, DataFrames
julia> @py import pandas as pd
julia> pdf = pd.DataFrame(pydict(Dict(:a => [1,2], 0 => [2, 3], :T => [4, "5"])))
Python DataFrame:
a 0 T
0 1 2 4
1 2 3 5
julia> DataFrame(pdf)
ERROR: name of column '<py NULL>' is not a string
Stacktrace:
[1] error(s::String)
@ Base .\error.jl:33
[2] keys(df::PyPandasDataFrame)
@ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyPandasDataFrame.jl:40
[3] columns(df::PyPandasDataFrame)
@ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyPandasDataFrame.jl:123
[4] columns(x::Py)
@ PythonCall C:\Users\<xxx>\.julia\packages\PythonCall\XgP8G\src\pywrap\PyTable.jl:17
[5] DataFrame(x::Py; copycols::Nothing)
@ DataFrames C:\Users\<xxx>\.julia\packages\DataFrames\MA4YO\src\other\tables.jl:58
[6] DataFrame(x::Py)
@ DataFrames C:\Users\<xxx>\.julia\packages\DataFrames\MA4YO\src\other\tables.jl:49
[7] top-level scope
@ REPL[6]:1
Solution:
julia> pdf2 = pdf.copy();
julia> pdf2.columns = string.(pdf2.columns);
julia> DataFrame(pdf2)
2×3 DataFrame
Row │ a 0 T
│ Int64 Int64 Any
─────┼───────────────────
1 │ 1 2 4
2 │ 2 3 5
Maybe I should file an issue …
PythonCall works great. Fast, no copying. Supports more types than the original PyCall.
For numpy backed pandas DataFrame wrapped around PyTable, you can access the columns directly, no copying.
1 Like
I found you need to wrap PyArray around the columns still in Py object form, then there is no copying. Wrapping in PyTable (which implements Table.jl interface) will result in large memory allocations.