JuliaDB.flatten equivalent in DataFrames

Hello,
is there a way to produce in DataFrames the same result of JuliaDB.flatten

using DataFrames
using JuliaDB
d1 = DataFrame(cat = ["a","b"],
               vec = [[1,2],[3,4]])
t1 = table(d1)

t2 = flatten(t1,:vec)

flatten returns a table with 4 rows and 2 columns where the first column is repeated to flatten the values in the second

Thanks for the question. There is no default function in DataFrames to do this. However here is a function that will do that for you with dataframes

function flatten(df, datavar)
	N = nrow(df)
	lengths = Vector{Int64}(undef, nrow(df))

	for i in 1:N
	    lengths[i] = length(df[i, datavar])
	end

	new_N = sum(lengths)

	
	new_df = similar(df[!, Not(datavar)], new_N)
	n = names(new_df)

	counter = 1
	for i in 1:N
		for j in 1:lengths[i]
			for name in n
				new_df[counter, name] = df[i, name]
			end
			new_df
			counter += 1
		end
	end

	new_df[!, datavar] = reduce(vcat, df[!, datavar])

	return new_df
end

This is probably a function that should be added to DataFrames.

Not sure how efficient it is, but how about something like:

df = DataFrame(cat=["a", "b"], vec=[[1,2],[3,4]])
df2 = DataFrame([String, Int], [:cat, :vec])

for i in Iterators.flatten(Iterators.product.(Ref.(df.cat), df.vec))
    push!(df2, i)
end

There’s also the Query.jl way of doing it:

using DataFrames, Query

df = DataFrame(cat=["a", "b"], vec=[[1,2],[3,4]])
df2 = @from i in df begin
    @from j in i.vec
    @select {cat=i.cat, vec=j}
    @collect DataFrame
end

Can I ask what your use-case is? I’m curious to know what goal you are trying to accomplish.

1 Like

I am analyzing behavioral data collected in a MatLab structure where different fields have different lengths. Some field is a single value like the subject ID, some have value for each trial and others have a value for each event. I am trying to have these data in 2 formats: 1 where each row is a trial and 2 where each row is an single event. The idea was to produce first the data frame of the trials repeating the session values for each row, then produce the event data frame flattening the columns containing the events.

Thanks for the suggestions, I will try them and get back with the performances

Thanks for the suggestions, I tried to generalize your approach to DataFrames with multiple columns, but I am not proficient enough neither on Iterators or Query, apparently.

If you want to give an example df with extra columns that you’re having trouble generalizing to, I’d be happy to help. That said, if the function that @pdeffebach wrote works for you, it’s likely more efficient and definitely more general (thanks so much for immediately contributing it upstream to DataFrames.jl by the way; I love seeing that kind of quick contribution).

I just thought it might be useful to show how you could handle these kind of problems concisely with builtin tools (or Query.jl).