[ANN] A new lightning fast package for data manipulation in pure Julia

Oscar_Smith · March 29, 2022, 3:43pm

Julia is generally very memory efficient. It has around 200 Mb overhead to launch, but it gives you very good tools to write memory efficient data manipulation. (The one exception is that String has unfortunately high overhead for small strings currently).

StefanKarpinski · March 29, 2022, 6:33pm

I want to point out to observers of this thread that there seems to be some funny business going on here and in other discussions related to InMemoryDatasets. Specifically, there appear to be a sizable number of people in these threads who are masking their location and there are indications that they may be coming from a common location/network. This does not appear to be straight up sock puppetry (a la Henning Rousseau)—posters seem to (mostly) actually be different people—but there does seem to be some sort of hidden agenda here. My guess is that the goal is to make it appear that there is more widespread dissatisfaction with DataFrames in the Julia community than there actually is. I feel I have to post this warning so that participants in the conversations are not taken in by any deception.

To those people in the thread who are doing this—first of all, welcome! You mostly appear to be new to the Julia community and you come bearing code, which is great! However, please consider taking a different approach here. First, stop trying to start flame wars with DataFrames developers—that is not cool. Also, please stop trying to appear to be independent people who just happen to all be fed up with DataFrames. It’s fine if you all work together and are collectively frustrated with DataFrames. That’s absolutely ok—just don’t be deceptive about it. It is also totally fine to have developed a fork of DataFrames and compete with it. The MIT license very much allows that and if you think you can do better, by all means, give it a try.

(One thing that does need to be fixed is the license copyright notice: InMemoryDatasets appears to be derived from the DataFrames code base and the MIT license does require keeping the copyright notice intact, so if you could fix that, that would put the project in legally upstanding footing.)

Assuming that I’m correct about the “company that is collectively interested in improving on DataFrames” interpretation of what’s going on, my suggestion would be: take a beat, reset the conversation, be direct about working together and that you have created and are promoting InMemoryDatasets as an open source alternative to DataFrames. Maybe I’m wrong about my interpretation and if so, feel free to let me know here or privately what’s actually going on.

johnmyleswhite · March 29, 2022, 7:24pm

I think it would be good to give people a shot to give alternative explanations since I can imagine several hypothetical reasons for this beyond a desire to seem larger in size than appropriate.

StefanKarpinski · March 29, 2022, 8:14pm

Yeah, I’m absolutely open to something else being up here—feel free to DM me—but something odd is going on and it seemed like people ought to be aware.

StefanKarpinski · March 29, 2022, 8:45pm

A post was split to a new topic: Julia PR team?

rafael.guerra · March 29, 2022, 8:48pm

The InMemoryDatasets new users have used all but “subbtle ways”

StefanKarpinski · March 29, 2022, 9:07pm

I really want to make sure that this does not become accusatory and that we don’t pile on. Nothing that has been done is terrible and it’s really exciting to have people interested enough in data wrangling in Julia to take a crack at a new package like this. It’s a lot of work and it’s generously contributed for anyone to use. Who amongst us hasn’t gotten a little vehement in our defense of Julia against its alternatives? Pointing out differences between similar packages can surely easily go the same way without ill intentions. We’re not sure what the motivation is for the funky accounts, but let’s please, please, please let’s give people the benefit of the doubt.

jar1 · March 29, 2022, 9:11pm

Perhaps it’s best to leave it at that (just the facts) and avoid speculating about intentions. That way anybody from the relevant group has an opportunity to explain if they choose, without feeling defensive.

goerch · March 29, 2022, 9:20pm

Yeah, but allow me one remark: I don’t want to read anymore posts about about the benefits of competition vs. cooperation (which we could discuss elsewhere). If we are going Darwinian in an open source forum, I’m out.

monopolynomial · March 29, 2022, 10:16pm

I think these kinds of posts shouldn’t be included in a public discussion, because they are more dangerous to community than any use. Including author there are less than 30 people involve in this topic and call it sizable is a little rash, I also searched topics about InMemoryDatasets and I found 7 of them so far which one of them is this announcement and two of them are also mine.

goerch · March 29, 2022, 10:47pm

Sorry to object: I’m thankful for this kind of transparency.

On the one hand I can imagine frustrations of developers like @sl-solution to get improvement PRs rejected by mature libraries due to compatibility reasons which lead to these kinds of new developments.

On the other hand this is a bad sign, which needs to be addressed:

StefanKarpinski · March 29, 2022, 10:57pm

This is a very commonly misunderstood detail of the MIT license. Seems like a mistake.

Mason · March 29, 2022, 11:12pm

Yeah, that’s a very common mistake that I’ve made myself before. I think so long as people are gracious about fixing these things when brought to their attention and it’s not some clear pattern of bad behaviour, it’s best that we all assume that MIT license violations are accidental.

pursultani · March 29, 2022, 11:14pm

Awesome! Right package in the wrong language.

johnmyleswhite · March 29, 2022, 11:25pm

If you don’t mind: could you help me understand why you joined a Discourse forum for a language you think is bad exactly 1 minute before posting a comment about a very specific package? What was the specific chain of events that led to that happening? It seems like a remarkable coincidence.

Henrique_Becker · March 30, 2022, 1:42am

Sincerely, the most insulting thing about this comment is not the puerile attack to the language but how it genuinely underestimates the community efficiency to spot a troll on sight.

pursultani · March 30, 2022, 3:43am

This comment is intended to the original author of this post within a very specific context which I’m not obligated to share with you. Please don’t make any further assumptions.

I would appreciate it if you could point me to specific guidelines that bans such an “insult” to the language.

DNF · March 30, 2022, 5:18am

Who said it was banned?

Calling someone out for poor behavior can be done regardless of whether that behavior is specifically banned.

And poor behavior is not ok just because you have a hidden agenda, as you yourself admit.

lawless-m · March 30, 2022, 7:54am

https://discourse.julialang.org/guidelines

goerch · March 30, 2022, 8:03am

From the first reactions here I indeed conclude this to be a significant improvement. I’d also expect that the author put some thought into which language to use. So where do you think he misjudged?

Granted. But if you are not interested in discussing this publicly you could use Discourse PM instead.

Topic		Replies	Views
Rowwise compuation in `InMemoryDatasets.jl` vs `DataFrames.jl` Performance data , dataframes , inmemorydatasets	2	682	March 23, 2022
ANN: JuliaDB.jl Community	40	9652	November 13, 2018
[ANN] Cleaner.jl: A toolbox of simple solutions for common data cleaning problems Package Announcements package , announcement	12	2280	October 29, 2021
Tabulations.jl - function tabulation made easy Package Announcements package , announcement , physics	3	1030	December 7, 2021
Column types in DataFrames vs. InMemoryDatasets General Usage dataframes , inmemorydatasets	6	965	March 29, 2022

[ANN] A new lightning fast package for data manipulation in pure Julia

Related topics