A.
Is there any tutorial to not just DataFrames, but the whole ecosystem around it, e.g. for those coming from Python/Pandas?
Or maybe just some cheatsheet?
B.
I have a specific problem, in Excel it’s simple to refer to a previous row in a formula [EDIT: e.g. for as I’m doing year-on-year growth], and I didn’t want to just give up, and try out Pandas, but it looks like I need to copy a column (I know how) and then shift it down (I’m not sure how, if there’s a function like shift for Pandas available), and then calculate a new column based on the copy and the original.
I realized that I can just use Pandas.jl, and it also felt like cheating, but maybe that’s just what people do? Or am I just overlooking some function in DataFrameMeta, or some windows function package?
C.
Even just knowing what packages Pandas roughly maps to would be helpful, I have a feeling it may be Query.jl and more too. I just know Pandas is popular in the Python world and like many of heir packages, big, while in Julia land packages smaller, and need to be used together.
I think the docs for DataFrames are pretty good, and there’s a wikibook with a more tutorial-like feel for DataFrames that I think stays updated. w/r/t the ecosystem, I assume you’re talking about something like Query.jl, which I don’t use much, but knowing the people involved, I imagine the docs there are pretty good as well. I’m not aware of cheat sheets like those that you describe, though I always found pandas bewildering and DataFrames much clearer, so I wouldn’t ever have gone searching for such a thing. If it doesn’t exist, someone should do it for sure!
As to (B), it’s quite challenging to know how to help without a better description of the problem, preferably with a MWE (see here for some more tips on how to write up your question in a way that will make it easier for us to help). It’s easy to refer to a previous row if you know the index of the current row (df[i-1, :]), but not knowing if you’re doing something in a loop, or want a function that operates on a row, or what, it’s tough to give more guidance.
For (C), I think DataFrames is the best analogue to Panda’s, lots of people like Query (I think that’s more analogous to dyplr, though as I said I don’t use it much). DataFramesMeta was also quite popular, tough my impression is that recent API changes to DataFrames itself are making that package increasingly obsolete. And there are a bunch of others.
On a side note - There’s also no harm in using Pandas.jl if that’s what you’re comfortable with! Do what works, I say. When you find something that doesn’t work there, or is clunky, and you want to branch out, definitely come here for help
Thanks for answering, it’s basically “YoY growth” or a lag operator, and I don’t have a mWe (I take that back, in Excel the formula is “=O17/O5-1”) or any code yet for it, or I wouldn’t be asking. I didn’t want to do a loop, seemed like there should be a declarative way or function already. And yes, there is shift in Pandas.
I can see how I can do something like “i-1”, or in my case “i-12”, with a loop (and need to avoid referring out of bounds, starting one year into the dest column).
As an example for a tutorial, DataFrames doesn’t have a shift function, but Pandas.jl has and they can be used together.
In Python/Pandas where this works:
df.shift(periods=1)
in Julia you can do:
using Pandas
df = shift(df, periods=1)
Note, in Python with Pandas, df is changed, but in Julia df in not changed, unless assigned to, as above. [Maybe shift! should also be implemented in the wrapper, equivalent to the original?]
In my case, shift is a means to an end, not what I actually wanted to do in full. I can expand the existing DF tutorial with such, but if I’m overlooking a way to do this in DF without the help of Pandas, please let me know. Might still be good to have at least a pointer to Pandas in the DF tutorial.
ShiftedArrays has a nice lag function that works nicely with DataFrames. I think there was a movement to put it into the package directly, but not sure if it happened. Anyway, that’s what I use (extensively!).
The stuff you put is a start, but because I don’t know what your df is, it can’t run. Sometimes, it’s as easy as doing df = DataFrame(a=repeat([:x,:y], 5), b=rand(10)). But it’s important to have code that runs, and a clear indication of what you want.