How Transducers.jl process table?

How Transducers.jl process table?
get each column mean ?

using Transducers, Table, Statistics
N = 1_000_000
a = [2randn(N÷2) .+ 6; randn(N÷2)]
b = [3randn(N÷2); 2randn(N÷2)]
c = randn(N)
d = c .+ 0.6randn(N)
table = (; a, b, c, d);   # NTuple{4, Vector{Float64}}}

# table_df = DataFrame(table)
row_data = Tables.rows(table)

aa = Transducers.foldl(right, row_data |> Map(mean) |> collect)    

aa is a number such as 0.4
I want to get like this:

a      b     c     d
0.3, 0.4, 0.5   0.6

I don’t know the functions of the Transducers package, but maybe to get what you are looking for, you should build the table as columns and stop at Map without letting foldl intervene

col_data = Tables.columns(table)

collect(Map(mean), col_data)

Indeed, probably not even the intervention of Tables is needed.

table = (; a, b, c, d); 
(; zip(keys(table),collect(Map(mean), table))...)

Probably did not understand the question, but fwiw, doing simply:

map(mean, table)

# results in:
(a = 2.9984519617857526, b = 0.0008151385421462024, c = -0.00045218977614473727, d = -3.069694507042777e-5)

I understood that the request was related to how Map processes tables (in the form of namedtuples) …

1 Like
row_data = Tables.rows(table)
col_data = Tables.columns(table)

Transducers.foldl(right, col_data |> Map(mean) |> collect)    # 0.002434656840427542
Transducers.collect(Map(mean), col_data)   # is ok 
[2.998240292251592, 3.2144679602343106e-5, 0.0010958016920552937, 0.002434656840427542]

julia> map(mean, table)   # is ok
|                   a |                       b |                       c |                      d |
| ------------------- | ----------------------- | ----------------------- | ---------------------- |
|   2.998240292251592 |   3.2144679602343106e-5 |   0.0010958016920552937 |   0.002434656840427542 |

I want to use Transducers.foldl do this,
why foldl() don’t same to collect(Map(mean), col_data) ??

My real intention is:
Suppose my table is large and has many rows, I want to incrementally calculate the mean of each column .
aa = Transducers.foldl(right, row_data |> Map(mean) |> collect)

what you’re doing (calculating mean of each column) is completely columnar, you don’t want to iterater over rolls for both clarity and performance reasons. Don’t use Transducers.jl when you can simply do it in columnar fashion.

Using foldl () you just have the average of the last column left.
Maybe, if you really want to do this, accumulate () better suits your needs

aa = Transducers.foldl(right, col_data |> Map(mean) )    
aa = Transducers.accumulate(right, col_data |> Map(mean) )    

xs |> Map(f) |> collect and collect(Map(f), xs) are both equivalent to

ys = []
for x in xs
    push!(ys, f(x))
end
ys

So, row_data |> Map(mean) |> collect is computing

v = [
    mean((a[1], b[1], c[1], d[1])),
    mean((a[2], b[2], c[2], d[2])),
    ...
    mean((a[end], b[end], c[end], d[end])),
]

Then, since foldl(right, v) is equivalent to v[end], you obtain a number mean((a[end], b[end], c[end], d[end])), from foldl(right, row_data |> Map(mean) |> collect).

As others said, I think using “columnar” functions is the best way to do this. That said, if you really want to do this in row-wise fashion (e.g., input does not fit in the memory), you can use GitHub - JuliaFolds/DataTools.jl

julia> using Transducers

julia> using DataTools: oncol, averaging

julia> foldxl(oncol(a = averaging, b = averaging, c = averaging, d = averaging), Tables.rowtable(table))
(a = 2.999957367033291, b = 0.00019357701607686137, c = 0.0009838445182253483, d = -0.00026117276633180807)