Converting Pandas Dataframe returned from PyCall to Julia DataFrame

It looks like we can get most of the way with the new PythonCall.jl package with a Tables.jl compatible interface as well:

(@v1.7) pkg> activate --temp
  Activating new project at `/tmp/jl_cjhpZ1`

(jl_cjhpZ1) pkg> add CondaPkg

(jl_cjhpZ1) julia> using CondaPkg

(jl_cjhpZ1) pkg> conda add --pip pybaseball

(jl_cjhpZ1) pkg> add PythonCall

(jl_cjhpZ1) julia> using PythonCall # Should auto resolve and add pybaseball

(jl_cjhpZ1) julia> @py import pybaseball as pyb
This is a large query, it may take a moment to complete
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.29s/it]
Python DataFrame:
     pitch_type  game_date  release_speed  ...  spin_axis  delta_home_win_exp delta_run_exp
1206         SL 2017-06-26           83.8  ...        142               0.001        -0.416
1227         FF 2017-06-26           92.7  ...        198                 0.0           0.0
1278         SL 2017-06-26           83.1  ...         99                 0.0         0.087
1308         SL 2017-06-26           84.4  ...        124                 0.0           0.0
1324         SL 2017-06-26           83.6  ...        130                 0.0           0.0
...         ...        ...            ...  ...        ...                 ...           ...
3785         FF 2017-06-24           91.8  ...        182               0.022        -0.216
3978         FS 2017-06-24           82.6  ...        256                 0.0         0.043
4026         SL 2017-06-24           85.9  ...        119                 0.0        -0.062
4173         FF 2017-06-24           91.9  ...        192                 0.0        -0.046
4244         FF 2017-06-24           92.4  ...        193                 0.0         0.036

[11434 rows x 92 columns]

(jl_cjhpZ1) julia> df_py = pyb.statcast(start_dt="2017-06-24", end_dt="2017-06-26")

(jl_cjhpZ1) julia> tbl = PyTable(df_py)
11434×92 PyPandasDataFrame
     pitch_type  game_date  release_speed  ...  spin_axis  delta_home_win_exp delta_run_exp
1206         SL 2017-06-26           83.8  ...        142               0.001        -0.416
1227         FF 2017-06-26           92.7  ...        198                 0.0           0.0
1278         SL 2017-06-26           83.1  ...         99                 0.0         0.087
1308         SL 2017-06-26           84.4  ...        124                 0.0           0.0
1324         SL 2017-06-26           83.6  ...        130                 0.0           0.0
...         ...        ...            ...  ...        ...                 ...           ...
3785         FF 2017-06-24           91.8  ...        182               0.022        -0.216
3978         FS 2017-06-24           82.6  ...        256                 0.0         0.043
4026         SL 2017-06-24           85.9  ...        119                 0.0        -0.062
4173         FF 2017-06-24           91.9  ...        192                 0.0        -0.046
4244         FF 2017-06-24           92.4  ...        193                 0.0         0.036

[11434 rows x 92 columns]

I think this can usually be converted to a DataFrame by just doing:

(jl_cjhpZ1) julia> using DataFrames

(jl_cjhpZ1) julia> df = DataFrame(tbl)

but here it looks to throw an out of bounds datetime error that may be related to this: python - pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset - Stack Overflow

Sorry if this is the wrong place to ping you @cjdoris, but would this be something that could (or should) be handled on PythonCall.jl’s end in its datetime conversions?

3 Likes