Hi,
I have some survey data which are organized hierarchically and I would like to transform them into a rectangular dataset. For example using JuliaDB:
t = table([1,1,1,2,2,2], [1,2,2,2,4,5],[10,20,10,5,20,10], names=[:ID, :Activity,:Duration])
Table with 6 rows, 3 columns:
ID Activity Duration
──────────────────────
1 1 10
1 2 20
1 2 10
2 2 5
2 4 20
2 5 10
Each activity can happen more than once per ID and some activities are not reported if the duration is zero. I would like to create a new variable for all possible Activities (which are 100) and sum the duration to something like this:
Table with 2 rows, 6 columns:
ID x_1 x_2 x_3 x_4 x_5
───────────────────────────
1 10 30 0 0 0
2 0 5 0 20 10
I have tried the following:
duration = select(t,:Duration)
activity = select(t,:Activity)
for i in 1:5
@eval $(Symbol("x_$i")) = zeros(Int,size(activity,1))
end
for j=1:size(activity,1)
for i=1:5
if activity[j]==i
@eval $(Symbol("x_$i[$j]")) =duration[$j]
else
@eval $(Symbol("x_$i[$j]"))=0
end
end
end
and then use groupreduce
using summation by ID
and push the arrays into a new table.
However, the loop doesn’t work and all the x
Arrays
remain zero. Other variations such as
@eval $(Symbol("x_$i[$j]")) =duration[j]
ERROR: UndefVarError: j not defined
or
@eval $(Symbol("x_$i[$j]")) = $(Symbol("duration[$j]"))
ERROR: UndefVarError: duration[1] not defined
produce errors or return zeros.
Is there a way to make this work or a more efficient way of doing this without creating so many columns? For some activities of interest, I did it manually but since I would like to obtain the information for all the 100 activities it is not a feasible strategy.
The tricky part is that I need the information for the number of individuals that didn’t do some activities. The total duration per activity is straightforward and can be done without the above loops simply by groupreduce
.
Thank you very much in advance!