Also there is this issue for fast path with checking if the data frame is sorted (actually what we would need is continuous blocks - not necessarily sorted, but this is harder to do), as this is a common case and also https://github.com/h2oai/db-benchmark has specific tests for it.
I have updated DataFrames package to 15 version. When I run the following command in juno/jupyter, it started throwing a warning like this.
Can someone explain this to me and the corrections to code?
using MLDataUtils #MLDataPattern,
@views begin
X,y=undersample((X_train,y_train),shuffle=false);
# X_resamp,y_resamp=oversample((X_train,y_train),fraction=0.2);
X = convert(Array{Float64}, X);
y = convert(Array{Int64}, y);
X_tst= convert(Array{Float64}, X_test);
y_tst = convert(Array{Int64}, y_test);
X_train0 = convert(Array{Float64}, X_train);
y_train0 = convert(Array{Int64}, y_train);
end
Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
β caller = iterate at abstractarray.jl:838 [inlined]
β @ Core .\abstractarray.jl:838
β Warning: Indexing into a return value of columns on SubDataFrame will return a view of column value
β caller = iterate at abstractarray.jl:838 [inlined]
β @ Core .\abstractarray.jl:838
You do not have to correct this. The warnings will go away in DataFrames 0.16.0 which should be released relatively soon.
The reason is that we want to introduce an optimization that when you access a column of a SubDataFrame
it is not copied, but a view
is returned which should be more efficient (especially for large data sets).
However, as this change is breaking; it will e.g. mean that when you mutate what you got you will mutate the source. Actually - this is what should have happend in the past (this is the contratct the DataFrames.jl promised, not to perform copy on column read) - but the implementation was different.
In 99% of use-cases this should not matter, so you can ignore the warnings.
Thanks. One more quesiton. using first(df,n) instead of head(df,6) gives this warning.
first(data,6)
Warning: In the future eachcol will have names argument set to false by default
β caller = getmaxwidths(::DataFrame, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Symbol) at show.jl:105
β @ DataFrames C:\Users\chatura\.julia\packages\DataFrames\5Rg4Y\src\abstractdataframe\show.jl:105
This deprecation is already fixed in 0.15.1 as it was non-breaking, so please check-out latest release of the DataFrames.jl package.
Thanks for your quick responses.
Alof of these ideas are already implemented in FastGroupBy.jl I just need to update it for Julia v1 and we can start benchmarking! I think the InternedString.jl will yield HUGE improvements for string grouping, because once the strings are interned, we can group by performing a radix-sort on the pointers (which are just UInt64), which is super fast! In fact, I have created a faster radix-sort than data.tableβs in SortingLab.jl; again I need to update it to Julia v1!
Canβt wait to release my updated benchmarks on Julia v1, but before that there is advent of code, work, set up personal website, disk.frame β¦ I wish I can land a full time job to focus on the data ecosystem in Julia. Willing to take pay cut .
I have updated https://github.com/bkamins/Julia-DataFrames-Tutorial to DataFrames 0.15.2. You will see a bunch of deprecation warnings there (not that much but still a bit distirbing).
They will go away in DataFrames 0.16, but I released it now to highlight the upcoming changes in the underlying mechanics (mainly how getindex
, veiw
and eachcol
work).
DataFrames.jl 0.16 has been released. You can find the list of changes in the release notes here Release Version 0.16.0 Β· JuliaData/DataFrames.jl Β· GitHub. The key change is finishing deprecation period for getindex
and view
methods.
The tutorial at https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated.
DataFrames.jl 0.17 has been released (and if you are interested in these updated please subscribe to this thread for the future).
The release notes give the details of the changes. The most important changes are:
- Improved performance of split-apply-combine functions ( for @nalimilan).
-
view
is now fully functional and fast in all cases (we always remember the parentDataFrame
without creating copies). - Improved showing of data frames related types.
- Added
push!
forDataFrameRow
. -
DataFrame
constructor acceptsSubDataFrame
andDataFrameRow
objects.
My tutorial https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated to include the new functionality.
A small, but relevant update. With this version of DataFrames.jl we also re-export a new version of Missings.jl, which introduces passmissing
function. We would welcome a feedback on this functionality (as it might eventually be considered for inclusion into Julia Base).
The passmissing
function wraps any Julia function f
(typically missing
βunawareβ) accepting positional arguments so that if any of them is missing
it returns missing
and otherwise calls f
with the passed arguments. Here is an example showing you the difference:
julia> string(missing, " ", missing)
"missing missing"
julia> passmissing(string)(missing, " ", missing)
missing
We have released DataFrames.jl version 0.17.1. Here you can read release notes.
Apart from polishing corner cases, the thing that a typical user might be interested in is that two convenience functions for working with GroupedDataFrame
were introduced:
-
groupvars
: gives you a vector of names of columns in the parent data frame used for grouping -
groupindices
: gives you a vector of row group indices in the parent data frame
https://github.com/bkamins/Julia-DataFrames-Tutorial has been updated to reflect the changes in the release.
We have released DataFrames.jl version 0.18.0. Here you can read release notes.
This is a big release. The main highlights are:
- functions that create a new
DataFrame
copy passed columns by default; this can be overridden by copycols keyword argument or by using the DataFrame! function that does not copy passed columns; this means that we take safety first approach and you have to opt in for speed when creating a newDataFrame
; we believe that this is a good choice as even experienced users of the package were getting caught by the old behavior (and it is especially targeted at making the usage less error prone for entry-level users); - in order to simplify the work with data frames after the change of default
DataFrame
construction there were significant improvements inappend!
,push!
andvcat
methods to make the workflow using them more smooth; - following requests we have thinned the dependency list of DataFrames.jl which should improve its loading times (the change is mildly breaking so if anyone experiences serious problems because of this please report an issue in DataFrames.jl);
- there is a long list of finished deprecation periods for functions and new deprecation messages; the reason is that we want to clean up the package before 1.0 release (also the list of deprecated features got quite long over the years, we will continue with this process also in the coming releases with the objective to have no deprecated functionalities in 1.0 release);
- there are significant improvements in various
show
method implementations - optimized methods for
PooledArrays
in split-apply-combine were added
I will update Julia-DataFrames-Tutorial soon (and add a link here when it is updated).
In the coming release the next big thing (hopefully) will be making setindex!
and broadcasting behavior of data frames consistent with Julia 1.0 standards.
Update
I have updated the tutorial here.
There were some minor rough edges reported (here and here) so we will soon make a patch release.
A patch release 0.18.1 of DataFrames.jl has just been registered. The release makes DataFrame
constructor behave correctly when passed some objects that met Tables.jl interface (most importantly database connections). Thanks for @cadoubs and @quinnj for helping with the issue.
Thank you for your fast analysis and patch. I validate the update on my application with MySQL.jl
You make a very nice job.
I have just stumbled on an old question on StackOverflow that lacked a good answer for over two years. It is related to a conversion from JSON to a columnar data structure (looking at the number of stars on the question I guess it is a pattern that is needed quite often).
Thanks to the new vcat
functionality proposed by pdeffebach and @oxinabox in DataFrames 0.18 we can do such things cleanly.
DataFrames.jl v0.18.2 patch release was just merged. Here you can read about the changes it introduces.
DataFrames.jl v0.19.0 has just been released. It is a major release towards DataFrames.jl 1.0 (we cannot get there yet as we have to go through deprecation cycle).
The number of changes is significant and includes:
API changes:
- allow
Regex
indexing of columns - allow
Not
from InvertedIndices.jl indexing of rows and columns - add
!
indexing of rows ofAbstractDataFrame
- deprecate indexing with column or columns only (like
df[:a]
ordf[1:2]
) - define target rules for
getindex
,getproperty,
setindex!, and
setproperty!for
AbstractDataFrameand
DataFrameRow` (in this release old behavior is deprecated; in the next release wit will get replaced by target functionality) - add indexing using
CartesianIndex{2}
forAbstractDataFrame
- full support of broadcasting for
AbstractDataFrame
- support for broadcasting assignment for
DataFrameRow
-
keys(::DataFrameRow)
now returns aTuple
of column names - added
get
andmap
methods forDataFrameRow
-
categorical!
now accepts columns that containmissing
values -
get
andhaskey
forAbstractDataFrame
is deprecated now -
empty!
forDataFrame
is deprecated now - add
hasproperty
forAbstractDataFrame
Fixes:
- improved showind
DataFrameRow
with zero columns - fix
combine
with aggregation whenskipmissing=true
Minor changes:
- improvements in error messages and types of thrown exceptions on error
- various documentation improvements
- improved
getindex
speed for vector ofBool
indexing - remove InteractiveUtils.jl dependency
The major change is change of indexing rules and full support for broadcasting. Here are the details. In general in the design there was a tension between: ease of use, flexibility, safety and consistency.
Here are the major highlights:
- you can use
Not
andRegex
for column indexing -
df[col]
is nowdf[!, col]
and gets/replaces a column in a data frame βas isβ -
df[:, col]
will always get a copy of a column/set a column in place -
df[cols]
is nowdf[!, cols]
and gets a new data frame without copying of columns -
df[:, cols]
and gets a new data frame with copying of columns -
df.col
is the same asdf[!, col]
for consistency with Base indicating that it gives you βas isβ access to the property of the data frame (i.e. it gives you the column without copying and replaces the column) - data frames can take part in broadcasting
- You can perform broadcasting assignment to
AbstractDataFrame
andDataFrameRow
; as a special rule: usingdf[!, col]
syntax you can create a new column/replace old one using broadcasting (something which is non standard in regular broadcasting which is always in-place).
In summary !
indicates βan unsafeβ operation. The reason is that people often were tricked by getting columns of a data frame, mutating them (e.g. resizing or sorting), and in consequence corrupting the source data frame. Now we hope that !
will serve them as a warning that this is not a safe operation (as opposed to :
indexing which always makes a copy).
Here are the new rules at work:
julia> df = DataFrame(x1=1:3, x2=2:4, y='a':'c')
3Γ3 DataFrame
β Row β x1 β x2 β y β
β β Int64 β Int64 β Char β
βββββββΌββββββββΌββββββββΌβββββββ€
β 1 β 1 β 2 β 'a' β
β 2 β 2 β 3 β 'b' β
β 3 β 3 β 4 β 'c' β
julia> select(df, r"x")
3Γ2 DataFrame
β Row β x1 β x2 β
β β Int64 β Int64 β
βββββββΌββββββββΌββββββββ€
β 1 β 1 β 2 β
β 2 β 2 β 3 β
β 3 β 3 β 4 β
julia> select(df, Not(r"x"))
3Γ1 DataFrame
β Row β y β
β β Char β
βββββββΌβββββββ€
β 1 β 'a' β
β 2 β 'b' β
β 3 β 'c' β
julia> df[Not(1), Not(1)]
2Γ2 DataFrame
β Row β x2 β y β
β β Int64 β Char β
βββββββΌββββββββΌβββββββ€
β 1 β 3 β 'b' β
β 2 β 4 β 'c' β
julia> df .+ 1
3Γ3 DataFrame
β Row β x1 β x2 β y β
β β Int64 β Int64 β Char β
βββββββΌββββββββΌββββββββΌβββββββ€
β 1 β 2 β 3 β 'b' β
β 2 β 3 β 4 β 'c' β
β 3 β 4 β 5 β 'd' β
julia> df .+= ones(Int, size(df))
3Γ3 DataFrame
β Row β x1 β x2 β y β
β β Int64 β Int64 β Char β
βββββββΌββββββββΌββββββββΌβββββββ€
β 1 β 2 β 3 β 'b' β
β 2 β 3 β 4 β 'c' β
β 3 β 4 β 5 β 'd' β
julia> df[!, :z] .= 1
3-element Array{Int64,1}:
1
1
1
julia> df
3Γ4 DataFrame
β Row β x1 β x2 β y β z β
β β Int64 β Int64 β Char β Int64 β
βββββββΌββββββββΌββββββββΌβββββββΌββββββββ€
β 1 β 2 β 3 β 'b' β 1 β
β 2 β 3 β 4 β 'c' β 1 β
β 3 β 4 β 5 β 'd' β 1 β
this is great great stuff! Is there anywhere where you guys have listed the goals for 1.0, or roughly what is going to be materially different from the structure of a DataFrame now?