Pandas equivalent library

Hi …I am a new comer to Julia. Is there a library which does function similar to pandas.
I have been struggling to read from excel spreadsheets.

Thanks

1 Like

Hi and welcome to Julia.
You’re probably looking for: https://github.com/JuliaData/DataFrames.jl

7 Likes

For reading excel files try https://github.com/felipenoris/XLSX.jl (I had good results for my files) or https://github.com/queryverse/ExcelReaders.jl (uses Python) or https://github.com/aviks/Taro.jl (uses Java). All of those (can) return DataFrames which you can use to work with the data.

1 Like

Thanks for the suggestions

They already have a Julia’s version of Pandas
https://github.com/JuliaPy/Pandas.jl

no, that’s a wrapper, just use DataFrames.jl ffs

7 Likes

“Just use DataFrames” is not good advice. There is substantial functionality that Pandas has that DataFrames.jl does not and will never have. Anything related to time series for example. There is TimeSeries.jl, but it does not have support for the elaborate time handling and resampling that Pandas does. And it also does not support some of the features of DataFrames without a conversion between the two.

This whole part of the ecosystem is a mess. Unlike with the more mathematical portions of the language, you don’t have a nice type hierarchy with good generic functions that all interact in nice ways.

Julia is more than capable of having a suite of libraries that combine to make something far better than Pandas, but right now, it does not.

pleas as this on a sample to sample base and the community will help you.

Funny you should say that, for me one of the biggest boons of working in Julia is that I can just use DataFrames with the (imho excellent) Dates standard library for all my dated data, which was a refreshing change from the issues I always ran into when working with dates in Python - I’m the person that asked this question almost six years ago:

which is quite representative on the things I was fighting with frequently.

But in any case there’s generally little productive to be gained from fundamentalist discussions about whether some library/language is better than another. As @jling says it’s probably best to ask specific questions about elaborate time handling tasks which pandas can do that can’t be done in DataFrames to discuss what can be done about it.

7 Likes

Yes, datetime handling in Pandas is a complete mess. There is the Python datetime type (flexible but slow), the Pandas datetime type (with nanosecond precision - this means a very limited date range) and the possibility to use Numpy datetime types (with different precisions).
And it is in some cases not trivial to convert one to the other.
The Julia datetime handling is much better - 1 date and datetime type in the standard library which can be used everywhere.

An other weakness of Pandas is the handling of missing data - there is only float nan, but not for int. pd.nan should change this, but is still experimental / very recent, therefore it is not available in many client installations.

See also this thread: Time-period-based time series moving windows in Julia?

I’ll try to interpret what I think @MaxHayden means:

Pandas (“PANel Data AnalysiS” - a kind of time-series analysis), being originally a finance library, has tons of built-in time-series functionality: Time Series / Date functionality — pandas 0.17.1 documentation

DataFrames.jl doesn’t and won’t have this built in, which is fine – it’s not a time-series library and multiple dispatch means it shouldn’t need to be built-in.

But TimeSeries.jl, which provides time-series functionality for tabular data,

  • is missing a number of tools from pandas, like upsampling

  • provides functions that don’t work directly on arbitrary Tables.jl-interface tables. It has to convert into its own custom TypeArray type first.

These are good points, imho. Having a variety of types, all with different tradeoffs, is a good thing (that’s why we have so many Array types), but it’s very inefficient (both for programmers and computers) if they require converting between operations.

3 Likes

Well, I think you gave actually good advice about how to solve a problem by suggesting using Dates from the standard library.

My complaint was just about the nature of the response “just use X ffs” when X doesn’t solve the problem. It’s hard to find all of the relevant bits to do the thing. And “ffs” is not exactly a welcoming attitude.

I assume you could find all the parts to do all the things somehow or another, but the official docs don’t give you the same guidance for finding it.

As for specifics, pandas is well documented and I referenced specific features that it has that neither dataframes nor time series has out of the box. If someone is searching for a Pandas replacement, that’s probably the stuff they are looking for.

2 Likes

Yes. What jzr is saying is correct. Maybe I could have said it better. But it’s certainly a documentation problem when you can’t do a quick search to find the right libraries and types.

Since this is what comes up on Google for it, maybe someone more knowledgeable can suggest a mix of libraries for doing those things or link to a good resource that does.

I think upsampling can probably be done with DataFrames.leftjoin , but we can nail it down better if you post a new topic with a specific example of Pandas time-series code that you want to reproduce in DataFrames.

Fwiw I agree with you about the tone of the comment and “ffs” not being a great way of conveying a point, although it should be seen in the context of someone bringing up Pandas.jl in response to a two year old question asking rather generically for a pandas equivalent in Julia, for which DataFrames.jl was already accepted as an answer (ironically that probably wasn’t a really good answer actually as the OP asked about reading from Excel, which is an example of something pandas does which DataFrames doesn’t do…)

In any case I’m glad to hear that using Dates already gets you some way towards filling in perceived gaps in DataFrames functionality. As @jzr says, leftjoining data onto a new DataFrame with a finer time resolution could be one way of upsampling, but again a more concrete question (ideally with an M(non-)WE and a desired output) would help with answering this more precisely. Apart from that you only referenced “elaborate time handling” as an example of what’s missing in DataFrames, but again a more concrete example would be helpful.

In my view these limited examples do not warrant the conclusion that “this whole part of the ecosystem is a mess”, but both DataFrames and Dates (within the constraints of being a standard library) are actively developed so I’m sure people would be happy to consider suggestions for improving any shortcomings.

4 Likes

I intend to start a new thread with some specific examples once I have some code for it. But I’m currently evaluating what is and isn’t worth trying on this go round. I’d really like to just unify everything inside of Julia instead of having it scattered between Python, R, and Fortran. But attempts with previous versions have shown that it would be too much work for just a rewrite. But every version of Julia is better and closer to what I need.

And this time, I’ve got some new Monte Carlo stuff I need to do that seems like a good fit for Julia. But I’m not sure if it’s worth putting in much effort into just ingesting and preprocessing the data instead of actually solving the problem. So, I’ll probably put off the questions re:Pandas stuff until later in the project.

As for “mess”, it’s hard to find documentation and everything that comes up on Google gives bad advice to the tune of “use X which does not do the thing you asked about”. That’s messy. So is having incompatible types like DataFrame and TimeSeries that each solve part of the problem.

But the bottom line is that “give me a specific example” isn’t a good solution to the question of “what is the general workflow for financial panel data in Julia? What are the relevant libraries? And how good of a replacement for Pandas is it?”

I can’t literally provide you with examples of every function in the Pandas API any better than the Pandas docs. So unless someone has a guide or a tutorial or a conversion cheat sheet, I’m going to have put in a lot of my own effort to figure it out and get things to the point where I can ask an intelligent specific question beyond, “this is the type of work I do, what are the tools and workflow?”

So, that’s probably going to get put on the back burner while I focus on solving a new problem and then having a more compelling reason to move the rest over beyond “it would sure be nice”.

This is a general difference in design principle between Pandas and DataFrames, but also between most Python and Julia packages:
Python packages are often rather large and include lots of functionality, also outside of the core functionality of the package. Julia packages on the other side are more focused on the core functionalities and rely on other packages and the composibility of the language for peripheral functionality.

In particular, Pandas includes functionality for reading and writing different file formats, e.g. CSV, JSON, XLSX, SQL etc. DataFrames.jl does not include this functionality but instead utilizes external libraries via the Tables.jl interface.
The Julia approach has many advantages, especially it allows better extensibility and re-use of existing functionality. But it is slightly more difficult to use as a beginner: e.g. for reading a CSV file you have to install and using 2 libraries (CSV and DataFrames) and combine them with the right syntax, instead of just doing pd.read_csv(file). This particular example is quite well documented imho, but this is not the case for all common library combinations.

2 Likes