DataFrame vs. Pandas (vs. Excel...), e.g. to refer to previous row

Palli · April 23, 2020, 4:18pm

Hi,

A.
Is there any tutorial to not just DataFrames, but the whole ecosystem around it, e.g. for those coming from Python/Pandas?

Or maybe just some cheatsheet?

B.
I have a specific problem, in Excel it’s simple to refer to a previous row in a formula [EDIT: e.g. for as I’m doing year-on-year growth], and I didn’t want to just give up, and try out Pandas, but it looks like I need to copy a column (I know how) and then shift it down (I’m not sure how, if there’s a function like shift for Pandas available), and then calculate a new column based on the copy and the original.

I realized that I can just use Pandas.jl, and it also felt like cheating, but maybe that’s just what people do? Or am I just overlooking some function in DataFrameMeta, or some windows function package?

C.
Even just knowing what packages Pandas roughly maps to would be helpful, I have a feeling it may be Query.jl and more too. I just know Pandas is popular in the Python world and like many of heir packages, big, while in Julia land packages smaller, and need to be used together.

kevbonham · April 23, 2020, 4:36pm

I think the docs for DataFrames are pretty good, and there’s a wikibook with a more tutorial-like feel for DataFrames that I think stays updated. w/r/t the ecosystem, I assume you’re talking about something like Query.jl, which I don’t use much, but knowing the people involved, I imagine the docs there are pretty good as well. I’m not aware of cheat sheets like those that you describe, though I always found pandas bewildering and DataFrames much clearer, so I wouldn’t ever have gone searching for such a thing. If it doesn’t exist, someone should do it for sure!

As to (B), it’s quite challenging to know how to help without a better description of the problem, preferably with a MWE (see here for some more tips on how to write up your question in a way that will make it easier for us to help). It’s easy to refer to a previous row if you know the index of the current row (df[i-1, :]), but not knowing if you’re doing something in a loop, or want a function that operates on a row, or what, it’s tough to give more guidance.

For (C), I think DataFrames is the best analogue to Panda’s, lots of people like Query (I think that’s more analogous to dyplr, though as I said I don’t use it much). DataFramesMeta was also quite popular, tough my impression is that recent API changes to DataFrames itself are making that package increasingly obsolete. And there are a bunch of others.

On a side note - There’s also no harm in using Pandas.jl if that’s what you’re comfortable with! Do what works, I say. When you find something that doesn’t work there, or is clunky, and you want to branch out, definitely come here for help

Palli · April 23, 2020, 4:54pm

Thanks for answering, it’s basically “YoY growth” or a lag operator, and I don’t have a mWe (I take that back, in Excel the formula is “=O17/O5-1”) or any code yet for it, or I wouldn’t be asking. I didn’t want to do a loop, seemed like there should be a declarative way or function already. And yes, there is shift in Pandas.

I can see how I can do something like “i-1”, or in my case “i-12”, with a loop (and need to avoid referring out of bounds, starting one year into the dest column).

Palli · April 23, 2020, 5:25pm

As an example for a tutorial, DataFrames doesn’t have a shift function, but Pandas.jl has and they can be used together.

In Python/Pandas where this works:

df.shift(periods=1)

in Julia you can do:

using Pandas

df = shift(df, periods=1)

Note, in Python with Pandas, df is changed, but in Julia df in not changed, unless assigned to, as above. [Maybe shift! should also be implemented in the wrapper, equivalent to the original?]

In my case, shift is a means to an end, not what I actually wanted to do in full. I can expand the existing DF tutorial with such, but if I’m overlooking a way to do this in DF without the help of Pandas, please let me know. Might still be good to have at least a pointer to Pandas in the DF tutorial.

tbeason · April 23, 2020, 5:39pm

ShiftedArrays has a nice lag function that works nicely with DataFrames. I think there was a movement to put it into the package directly, but not sure if it happened. Anyway, that’s what I use (extensively!).

kevbonham · April 23, 2020, 5:39pm

Quoting from the thing I linked above:

Please read: make it easier to help you

Post quoted code by enclosing code blocks in triple-backticks ````` :
```julia
function f(x, y)
    x + y
end
```
…

Do your best to make your example self-contained (“minimal working example”, MWE ), so that it runs (or gets to the error that you want help with) as is. This means including package loading (e.g. using ThatPackage ) and any data that the code operates on. If your data is large or proprietary, generate example data if possible and include that.

Simplify your code to the smallest piece of code that still shows your problem. This step takes the most effort but is the most important for fixing your problem. Short, simple examples tend to get answers quickly.

The stuff you put is a start, but because I don’t know what your df is, it can’t run. Sometimes, it’s as easy as doing df = DataFrame(a=repeat([:x,:y], 5), b=rand(10)). But it’s important to have code that runs, and a clear indication of what you want.

lungben · April 23, 2020, 7:30pm

This may be helpful for you:

Topic		Replies	Views
Using previous row values to create values for a new column New to Julia dataframes	4	2660	March 20, 2021
Pandas equivalent library General Usage	26	15060	February 8, 2023
Implementing a ceil function in a complete dataFrame New to Julia	6	356	October 29, 2020
Create lead and lag variable in DataFrame General Usage question	14	8141	October 22, 2019
Any plan for functionality like Pandas loc? General Usage dataframes	15	1170	May 27, 2022

DataFrame vs. Pandas (vs. Excel...), e.g. to refer to previous row

Related topics