DataFrame vs. Pandas (vs. Excel...), e.g. to refer to previous row

Hi,

A.
Is there any tutorial to not just DataFrames, but the whole ecosystem around it, e.g. for those coming from Python/Pandas?

Or maybe just some cheatsheet?

B.
I have a specific problem, in Excel it’s simple to refer to a previous row in a formula [EDIT: e.g. for as I’m doing year-on-year growth], and I didn’t want to just give up, and try out Pandas, but it looks like I need to copy a column (I know how) and then shift it down (I’m not sure how, if there’s a function like shift for Pandas available), and then calculate a new column based on the copy and the original.

I realized that I can just use Pandas.jl, and it also felt like cheating, but maybe that’s just what people do? Or am I just overlooking some function in DataFrameMeta, or some windows function package?

C.
Even just knowing what packages Pandas roughly maps to would be helpful, I have a feeling it may be Query.jl and more too. I just know Pandas is popular in the Python world and like many of heir packages, big, while in Julia land packages smaller, and need to be used together.

I think the docs for DataFrames are pretty good, and there’s a wikibook with a more tutorial-like feel for DataFrames that I think stays updated. w/r/t the ecosystem, I assume you’re talking about something like Query.jl, which I don’t use much, but knowing the people involved, I imagine the docs there are pretty good as well. I’m not aware of cheat sheets like those that you describe, though I always found pandas bewildering and DataFrames much clearer, so I wouldn’t ever have gone searching for such a thing. If it doesn’t exist, someone should do it for sure!

As to (B), it’s quite challenging to know how to help without a better description of the problem, preferably with a MWE (see here for some more tips on how to write up your question in a way that will make it easier for us to help). It’s easy to refer to a previous row if you know the index of the current row (df[i-1, :]), but not knowing if you’re doing something in a loop, or want a function that operates on a row, or what, it’s tough to give more guidance.

For (C), I think DataFrames is the best analogue to Panda’s, lots of people like Query (I think that’s more analogous to dyplr, though as I said I don’t use it much). DataFramesMeta was also quite popular, tough my impression is that recent API changes to DataFrames itself are making that package increasingly obsolete. And there are a bunch of others.

On a side note - There’s also no harm in using Pandas.jl if that’s what you’re comfortable with! Do what works, I say. When you find something that doesn’t work there, or is clunky, and you want to branch out, definitely come here for help :wink:

2 Likes

Thanks for answering, it’s basically “YoY growth” or a lag operator, and I don’t have a mWe (I take that back, in Excel the formula is “=O17/O5-1”) or any code yet for it, or I wouldn’t be asking. I didn’t want to do a loop, seemed like there should be a declarative way or function already. And yes, there is shift in Pandas.

I can see how I can do something like “i-1”, or in my case “i-12”, with a loop (and need to avoid referring out of bounds, starting one year into the dest column).

As an example for a tutorial, DataFrames doesn’t have a shift function, but Pandas.jl has and they can be used together.

In Python/Pandas where this works:

df.shift(periods=1)

in Julia you can do:

using Pandas

df = shift(df, periods=1)

Note, in Python with Pandas, df is changed, but in Julia df in not changed, unless assigned to, as above. [Maybe shift! should also be implemented in the wrapper, equivalent to the original?]

In my case, shift is a means to an end, not what I actually wanted to do in full. I can expand the existing DF tutorial with such, but if I’m overlooking a way to do this in DF without the help of Pandas, please let me know. Might still be good to have at least a pointer to Pandas in the DF tutorial.

ShiftedArrays has a nice lag function that works nicely with DataFrames. I think there was a movement to put it into the package directly, but not sure if it happened. Anyway, that’s what I use (extensively!).

2 Likes

Quoting from the thing I linked above:

The stuff you put is a start, but because I don’t know what your df is, it can’t run. Sometimes, it’s as easy as doing df = DataFrame(a=repeat([:x,:y], 5), b=rand(10)). But it’s important to have code that runs, and a clear indication of what you want.

1 Like

This may be helpful for you:

1 Like