I need to manage multiple data series (each is like a sequence of values, or a column in a data frame). Associated with each is an automatic transformation of the data in and out of the storage. For example, these can be used to create input vectors for a model, where raw data need to be transformed before being used by the model, such as normalization or applying cosine function on periodic data.
But I want the transformations to happen automatically. In particular, when I assign values to indices of a series, or append values to it, the values will be automatically transformed before being stored. When I extract values from a series, the transformed values are returned. Of course, there should be an option to assign already transformed values to a series (automatic transformation will not be performed in this case). Itβll be great if inverse transformation, if exists, is also supported.
I can certainly implement such a library (I already have such code in Matlab and now want to convert my work to Julia). I just wonder whether such a library / functionality already exists, then I donβt need to create my own.
Thanks.
Thereβs https://github.com/JuliaArrays/MappedArrays.jl. I am not a DataFrames expert, but the following works to take the modulo 2Ο
of any value before storing it:
julia> using MappedArrays, DataFrames
julia> A = mappedarray(identity, x->mod(x, 2Ο), rand(10))
10-element mappedarray(identity, getfield(Main, Symbol("##13#14"))(), ::Array{Float64,1}) with eltype Float64:
0.6462246044535898
0.3893724745260221
0.10819312044797025
0.7456662437717823
0.2259602265381362
0.03472737190390074
0.4670647981623812
0.924346515776455
0.5413998577384473
0.5462417188978359
julia> df = DataFrame!(Any[A], [:A])
10Γ1 DataFrame
β Row β A β
β β Float64 β
βββββββΌββββββββββββ€
β 1 β 0.646225 β
β 2 β 0.389372 β
β 3 β 0.108193 β
β 4 β 0.745666 β
β 5 β 0.22596 β
β 6 β 0.0347274 β
β 7 β 0.467065 β
β 8 β 0.924347 β
β 9 β 0.5414 β
β 10 β 0.546242 β
julia> df[5,1] = 10
10
julia> df
10Γ1 DataFrame
β Row β A β
β β Float64 β
βββββββΌββββββββββββ€
β 1 β 0.646225 β
β 2 β 0.389372 β
β 3 β 0.108193 β
β 4 β 0.745666 β
β 5 β 3.71681 β
β 6 β 0.0347274 β
β 7 β 0.467065 β
β 8 β 0.924347 β
β 9 β 0.5414 β
β 10 β 0.546242 β
julia> mod(10, 2Ο)
3.7168146928204138
If youβve been doing this in Matlab, I think youβll be pleasantly surprised by the performance of MappedArrays:
julia> foo(A) = @inbounds A[2]
foo (generic function with 1 method)
julia> @code_native foo(A)
.text
; β @ REPL[29]:1 within `foo'
; ββ @ MappedArrays.jl:161 within `getindex'
; βββ @ REPL[29]:1 within `getproperty'
movq (%rdi), %rax
; βββ
; ββ @ array.jl:729 within `getindex'
movq (%rax), %rax
vmovsd 8(%rax), %xmm0 # xmm0 = mem[0],zero
; ββ
retq
nopl (%rax)
; β
7 Likes
Awesome. This is exactly what I need. Just one typo in your example: df = DataFrame!(Any[A], [:A])
. DataFrame!
does not exist; it should have been DataFrame
.
julia> using DataFrames
help?> DataFrame!
search: DataFrame! DataFrame DataFrames DataFrameRow SubDataFrame
DataFrame!(args...; kwargs...)
Equivalent to DataFrame(args...; copycols=false, kwargs...).
If kwargs contains the copycols keyword argument an error is thrown.
Examples
ββββββββββ
```jldoctest julia> df1 = DataFrame(a=1:3) 3Γ1 DataFrame β Row β a β β β
Int64 β βββββββΌββββββββ€ β 1 β 1 β β 2 β 2 β β 3 β 3 β
julia> df2 = DataFrame!(df1)
julia> df1.a === df2.a true
(v1) pkg> st DataFrames
Status `~/.julia/environments/v1/Project.toml`
[34da2185] Compat v2.1.0
[a93c6f00] DataFrames v0.18.3
[e1d29d7a] Missings v0.4.1
[2913bbd2] StatsBase v0.30.0
[bd369af6] Tables v0.2.5
[de0858da] Printf
[3fa0cd96] REPL
[10745b16] Statistics
[4ec0a83e] Unicode
You could alternatively use copycols=false
. If you donβt, Iβm not sure the underlying MappedArray gets used in the DataFrame (it makes a copy of the data), and currently
julia> a = mappedarray(x->x^2, 1:3)
3-element mappedarray(getfield(Main, Symbol("##3#4"))(), ::UnitRange{Int64}) with eltype Int64:
1
4
9
julia> copy(a)
3-element Array{Int64,1}:
1
4
9
That might be viewed as a bug; perhaps we should create another MappedArray? Thoughts?
I think DataFrame!
was added recently. For me:
julia> using DataFrames
help?> DataFrame!
search: DataFrame DataFrames DataFrameRow SubDataFrame GroupedDataFrame AbstractDataFrame
Couldn't find DataFrame!
Perhaps you meant DataFrames, DataFrame, DataFrameRow or SubDataFrame
No documentation found.
Binding DataFrame! does not exist.
My DataFrames has version number v0.17.1.
What is the bug you mentioned at the end of your message? Is copy(a)
supposed to be a MappedArray, not an Array?
Itβs not obvious that it is a bug, or which it should be. For most array types, copy
creates an Array
. copy(1:3)
returns another UnitRange, though. This is in the territory of βwhat does copy
actually mean?β and thatβs really a social decision.
1 Like
I think it makes more sense to have copy(x) return the same data type as x (so itβs really a copy of x). If one wants to convert a MappedArray to a normal array, probably collect(x) or a type conversion makes sense. Just my 2c though.
Truong Nghiem
I think it makes more sense to have copy(x) return the same data type as x
Iβm not sure that makes sense as a general rule (to me it seems copy(view(a, 2:5))
should not return a view), and more fundamentally there isnβt a way to ensure this. similar
works in some cases but if you nest views, e.g., reshape(view(mappedarray(x->x^2, a), 2:5), 2, 3)
then similar
wonβt help you create the same type unless someone has written an insanely-specialized method.
Nevertheless preserving the input/output functions might be appropriate in the case of MappedArray. I wish we had some general principles to guide this decision.
1 Like