you can define your own categorical and categorical! methods as they are pretty simple to add in your own code if you like them (one-liners essentially)
We have just made a 0.22.6 patch release. Hopefully it is the last release before 1.0 release.
We have decided to make 0.22.6 release to deprecate some outstanding things that should be removed in 1.0 release, but were missed in the 0.22 release process. These functionalities are on the border of being an error, but since they worked in the past we have decided to have a release that will allow users to go through deprecation cycle (although we assume that most likely the deprecated methods are not used). You can read the release announcement here.
Here is a brief summary of what decisions is 0.22.6 release motivated by:
most of the convert methods for types in DataFrames.jl package are deprecated; the reason is that they were inconsistent with convert contract from Julia Base. They were left from the past style where constructors were falling back to convert. The only conversions that are left are from DataFrameRow and GroupKey to NamedTuple and from SubDataFrame to DataFrame.
we have concluded not to make AbstractDataFrame an iterator of rows. Still, some methods, like filter will work as-if it were an iterator of rows for user convenience. If you need an iterator of rows use eachrow or Tables.namedtupleiterator functions. This has additionally lead to deprecation of map on GroupedDataFrame as currently it produced a vector, while in the future we might want it to return some other object (for the same reason broadcasting of GroupedDataFrame object is also disallowed - now we have made this consistent).
Therefore we are heading towards 1.0 release with three inconsistencies in the design I am aware of (these things unfortunately have to be learned as exceptions):
filter accepts :col => predicate where predicate is applied per row, as opposed to all other places in the DataFrames.jl ecosystem, where it works on whole columns (we decided to leave this inconsistency for user convenience);
df[row, cols] produces a DataFrameRow, which is a view; normally a copy should be produced, but making a copy is inefficient, ane would be breaking; we believe that leaving this inconsistency will not lead to bugs in code, and in practice it is more convenient than making a copy;
Is it possible to have filter which operates on the whole column? Or at least function with the same functionality but other name. If such a function can also accept DataFrame as a first argument, it would be even better.
No. This is something we discussed, but decided to leave-out all changes that include multi-threading from 0.22 branch. I encourage you to use ]add DataFrames#main as it should be safe to test.
Therefore you can assume that checking out ] add DataFrames#main is a way to beta test the package before 1.0 release, which we would highly appreciate - any issues you would report are welcome.
When doing the testing please kindly keep two things in mind:
test with depwarn=yes enabled to make sure you do not miss any deprecated functionality
also test with -t auto (or -t N) enabled as we have added multi threading support to DataFrames.jl
As this release brings a lot of internal redesign we would highly appreciate both correctness and performance testing to ensure we do not have any regressions.
In general yes. But some operations are still single threaded. The major blocker to making all operations multithreaded is that Dict from Julia Base is not thread safe for writing (we would prefer to use standard implementation for this functionality).
In general yes. But some operations are still single threaded. The major blocker to making all operations multithreaded is that Dict from Julia Base is not thread safe for writing (we would prefer to use standard implementation for this functionality).
How about petitioning Julia Base to create a new data structure called ThreadSafeDict to replace Dict, then you can easily fix the problem by find and replace all Dict with ThreadSafeDict
We have already discussed this. The issue is not that simple. In general there are many standard structures and function in Julia Base that potentially can be multi-threaded but are currently single threaded.
Regarding Dict. The issue is not that we would want just ThreadSafeDict as such dictionary would have to use locking for making sure parallel reading and writing is safe. What we need is lock-free parallel writing to a dictionary data structure (i.e. for cases where writing happens in parallel and reading happens in parallel, but we know that writing and reading cannot happen at the same time). Most likely Julia Base devs consider such use case as too specific to include in Julia Base. We will see.
In general we are going to focus on performance of DataFrames.jl after 1.0 so in the worst case we will have our own implementation if it is needed. It will just take time (note how long it took to stabilize DataFrames.jl API enough to warrant its 1.0 release).
Out of curiosity, can’t you just use Vector{Dict} of the length nthreads? Each thread can write to its own dictionary and you can read from the whole Vector if needed. Of course, you still need some sort of conflict resolution, but it is really dependant on your case, I do not think that it is possible to invent generic enough rules so everyone is satisfied.
df[:, :col] .= value. The point is that df[:, :col] .= value has to stick to eltype of :col while df.col .= value will allow to assign any value to :col (at the cost of allocating).
Note that it will be an issue starting with 1.7 release of Julia, as currently the language does not allow for this distinction.
Ah - sorry. Clearly it should be : not !, as ! is for replacement (and in target API df.col .= value and df[!, :col] .= value will be the same). I have edited my comment. (too many things happen in parallel)