Any Julia library for managing data series with automatic transformation

I need to manage multiple data series (each is like a sequence of values, or a column in a data frame). Associated with each is an automatic transformation of the data in and out of the storage. For example, these can be used to create input vectors for a model, where raw data need to be transformed before being used by the model, such as normalization or applying cosine function on periodic data.

But I want the transformations to happen automatically. In particular, when I assign values to indices of a series, or append values to it, the values will be automatically transformed before being stored. When I extract values from a series, the transformed values are returned. Of course, there should be an option to assign already transformed values to a series (automatic transformation will not be performed in this case). It’ll be great if inverse transformation, if exists, is also supported.

I can certainly implement such a library (I already have such code in Matlab and now want to convert my work to Julia). I just wonder whether such a library / functionality already exists, then I don’t need to create my own.

Thanks.

There’s https://github.com/JuliaArrays/MappedArrays.jl. I am not a DataFrames expert, but the following works to take the modulo 2Ο€ of any value before storing it:

julia> using MappedArrays, DataFrames

julia> A = mappedarray(identity, x->mod(x, 2Ο€), rand(10))
10-element mappedarray(identity, getfield(Main, Symbol("##13#14"))(), ::Array{Float64,1}) with eltype Float64:
 0.6462246044535898 
 0.3893724745260221 
 0.10819312044797025
 0.7456662437717823 
 0.2259602265381362 
 0.03472737190390074
 0.4670647981623812 
 0.924346515776455  
 0.5413998577384473 
 0.5462417188978359 

julia> df = DataFrame!(Any[A], [:A])
10Γ—1 DataFrame
β”‚ Row β”‚ A         β”‚
β”‚     β”‚ Float64   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.646225  β”‚
β”‚ 2   β”‚ 0.389372  β”‚
β”‚ 3   β”‚ 0.108193  β”‚
β”‚ 4   β”‚ 0.745666  β”‚
β”‚ 5   β”‚ 0.22596   β”‚
β”‚ 6   β”‚ 0.0347274 β”‚
β”‚ 7   β”‚ 0.467065  β”‚
β”‚ 8   β”‚ 0.924347  β”‚
β”‚ 9   β”‚ 0.5414    β”‚
β”‚ 10  β”‚ 0.546242  β”‚

julia> df[5,1] = 10
10

julia> df
10Γ—1 DataFrame
β”‚ Row β”‚ A         β”‚
β”‚     β”‚ Float64   β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 0.646225  β”‚
β”‚ 2   β”‚ 0.389372  β”‚
β”‚ 3   β”‚ 0.108193  β”‚
β”‚ 4   β”‚ 0.745666  β”‚
β”‚ 5   β”‚ 3.71681   β”‚
β”‚ 6   β”‚ 0.0347274 β”‚
β”‚ 7   β”‚ 0.467065  β”‚
β”‚ 8   β”‚ 0.924347  β”‚
β”‚ 9   β”‚ 0.5414    β”‚
β”‚ 10  β”‚ 0.546242  β”‚

julia> mod(10, 2Ο€)
3.7168146928204138

If you’ve been doing this in Matlab, I think you’ll be pleasantly surprised by the performance of MappedArrays:

julia> foo(A) = @inbounds A[2]
foo (generic function with 1 method)

julia> @code_native foo(A)
	.text
; β”Œ @ REPL[29]:1 within `foo'
; β”‚β”Œ @ MappedArrays.jl:161 within `getindex'
; β”‚β”‚β”Œ @ REPL[29]:1 within `getproperty'
	movq	(%rdi), %rax
; β”‚β””β””
; β”‚β”Œ @ array.jl:729 within `getindex'
	movq	(%rax), %rax
	vmovsd	8(%rax), %xmm0          # xmm0 = mem[0],zero
; β”‚β””
	retq
	nopl	(%rax)
; β””

:smile:

7 Likes

Awesome. This is exactly what I need. Just one typo in your example: df = DataFrame!(Any[A], [:A]). DataFrame! does not exist; it should have been DataFrame.

julia> using DataFrames

help?> DataFrame!
search: DataFrame! DataFrame DataFrames DataFrameRow SubDataFrame

  DataFrame!(args...; kwargs...)

  Equivalent to DataFrame(args...; copycols=false, kwargs...).

  If kwargs contains the copycols keyword argument an error is thrown.

  Examples
  ––––––––––

  ```jldoctest julia> df1 = DataFrame(a=1:3) 3Γ—1 DataFrame β”‚ Row β”‚ a β”‚ β”‚ β”‚
  Int64 β”‚ β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€ β”‚ 1 β”‚ 1 β”‚ β”‚ 2 β”‚ 2 β”‚ β”‚ 3 β”‚ 3 β”‚

  julia> df2 = DataFrame!(df1)

  julia> df1.a === df2.a true

(v1) pkg> st DataFrames
    Status `~/.julia/environments/v1/Project.toml`
  [34da2185] Compat v2.1.0
  [a93c6f00] DataFrames v0.18.3
  [e1d29d7a] Missings v0.4.1
  [2913bbd2] StatsBase v0.30.0
  [bd369af6] Tables v0.2.5
  [de0858da] Printf 
  [3fa0cd96] REPL 
  [10745b16] Statistics 
  [4ec0a83e] Unicode 

You could alternatively use copycols=false. If you don’t, I’m not sure the underlying MappedArray gets used in the DataFrame (it makes a copy of the data), and currently

julia> a = mappedarray(x->x^2, 1:3)
3-element mappedarray(getfield(Main, Symbol("##3#4"))(), ::UnitRange{Int64}) with eltype Int64:
 1
 4
 9

julia> copy(a)
3-element Array{Int64,1}:
 1
 4
 9

That might be viewed as a bug; perhaps we should create another MappedArray? Thoughts?

I think DataFrame! was added recently. For me:

julia> using DataFrames

help?> DataFrame!
search: DataFrame DataFrames DataFrameRow SubDataFrame GroupedDataFrame AbstractDataFrame

Couldn't find DataFrame!
Perhaps you meant DataFrames, DataFrame, DataFrameRow or SubDataFrame
  No documentation found.

  Binding DataFrame! does not exist.

My DataFrames has version number v0.17.1.

What is the bug you mentioned at the end of your message? Is copy(a) supposed to be a MappedArray, not an Array?

It’s not obvious that it is a bug, or which it should be. For most array types, copy creates an Array. copy(1:3) returns another UnitRange, though. This is in the territory of β€œwhat does copy actually mean?” and that’s really a social decision.

1 Like

I think it makes more sense to have copy(x) return the same data type as x (so it’s really a copy of x). If one wants to convert a MappedArray to a normal array, probably collect(x) or a type conversion makes sense. Just my 2c though.

Truong Nghiem

I think it makes more sense to have copy(x) return the same data type as x

I’m not sure that makes sense as a general rule (to me it seems copy(view(a, 2:5)) should not return a view), and more fundamentally there isn’t a way to ensure this. similar works in some cases but if you nest views, e.g., reshape(view(mappedarray(x->x^2, a), 2:5), 2, 3) then similar won’t help you create the same type unless someone has written an insanely-specialized method.

Nevertheless preserving the input/output functions might be appropriate in the case of MappedArray. I wish we had some general principles to guide this decision.

1 Like