JuliaDB.flatten equivalent in DataFrames

Dario_Sarra · November 13, 2019, 5:09pm

Hello,
is there a way to produce in DataFrames the same result of JuliaDB.flatten

using DataFrames
using JuliaDB
d1 = DataFrame(cat = ["a","b"],
               vec = [[1,2],[3,4]])
t1 = table(d1)

t2 = flatten(t1,:vec)

flatten returns a table with 4 rows and 2 columns where the first column is repeated to flatten the values in the second

pdeffebach · November 13, 2019, 6:16pm

Thanks for the question. There is no default function in DataFrames to do this. However here is a function that will do that for you with dataframes

function flatten(df, datavar)
	N = nrow(df)
	lengths = Vector{Int64}(undef, nrow(df))

	for i in 1:N
	    lengths[i] = length(df[i, datavar])
	end

	new_N = sum(lengths)

	
	new_df = similar(df[!, Not(datavar)], new_N)
	n = names(new_df)

	counter = 1
	for i in 1:N
		for j in 1:lengths[i]
			for name in n
				new_df[counter, name] = df[i, name]
			end
			new_df
			counter += 1
		end
	end

	new_df[!, datavar] = reduce(vcat, df[!, datavar])

	return new_df
end

This is probably a function that should be added to DataFrames.

non-Jedi · November 13, 2019, 6:17pm

Not sure how efficient it is, but how about something like:

df = DataFrame(cat=["a", "b"], vec=[[1,2],[3,4]])
df2 = DataFrame([String, Int], [:cat, :vec])

for i in Iterators.flatten(Iterators.product.(Ref.(df.cat), df.vec))
    push!(df2, i)
end

There’s also the Query.jl way of doing it:

using DataFrames, Query

df = DataFrame(cat=["a", "b"], vec=[[1,2],[3,4]])
df2 = @from i in df begin
    @from j in i.vec
    @select {cat=i.cat, vec=j}
    @collect DataFrame
end

pdeffebach · November 14, 2019, 2:43am

Can I ask what your use-case is? I’m curious to know what goal you are trying to accomplish.

Dario_Sarra · November 14, 2019, 10:17am

I am analyzing behavioral data collected in a MatLab structure where different fields have different lengths. Some field is a single value like the subject ID, some have value for each trial and others have a value for each event. I am trying to have these data in 2 formats: 1 where each row is a trial and 2 where each row is an single event. The idea was to produce first the data frame of the trials repeating the session values for each row, then produce the event data frame flattening the columns containing the events.

Dario_Sarra · November 14, 2019, 10:24am

Thanks for the suggestions, I will try them and get back with the performances

Dario_Sarra · November 14, 2019, 11:49am

Thanks for the suggestions, I tried to generalize your approach to DataFrames with multiple columns, but I am not proficient enough neither on Iterators or Query, apparently.

non-Jedi · November 14, 2019, 1:01pm

If you want to give an example df with extra columns that you’re having trouble generalizing to, I’d be happy to help. That said, if the function that @pdeffebach wrote works for you, it’s likely more efficient and definitely more general (thanks so much for immediately contributing it upstream to DataFrames.jl by the way; I love seeing that kind of quick contribution).

I just thought it might be useful to show how you could handle these kind of problems concisely with builtin tools (or Query.jl).

Topic		Replies	Views
Flattening YFinance.jl JSON result into a DataFrame New to Julia question	22	1299	April 11, 2023
Flatten in case column contains string and array Data question	6	718	May 17, 2021
JuliaDB flatten multiple columns General Usage question	7	720	May 31, 2019
How to create JuliaDB table from DataFrame? General Usage	2	1792	April 15, 2018
Transforming DataFrame from column of vector General Usage dataframes	3	245	December 1, 2022

JuliaDB.flatten equivalent in DataFrames

Related topics