Converting NamedTuple to DataFrame seems expensive?

I have a huge NamedTuple and

DataFrame(larged_namedtuple)

takes a long time like 12s and uses quite a bit of RAM

The eventual target for me is a DataFrame but I created the named tuple so that users can choose the sink they want. Is it better to not create the named tuple and just create the DataFrame? That would force a dependency on DataFrame on a package that I am only a potential contributor to, so if it can be avoided would be good.

Any good solutions?

Use copycols = false in the DataFrame constructor.

If you can avoid a dependency on data frames that would be best.

1 Like

Huge in what sense? Many columns?

1 Like

100s col 26m rows

Does it make the dataframe immutable?

As far as I can tell, no. I think the issue you are thinking of is CSV.read which used to return an immutable AbstractArray type that would cause problems with copycols = false.

2 Likes

@xiaodai, it is a lot easier to help you if you provide more concrete information. Are you passing one named tuple with 100s fields and each field is a vector with one element per row? Or are you passing a vector of named tuples? Or an iterator of named tuples?

1 Like

Firstly, I create 100s of vectors using the multi-threading, so I have a (unamed) tuple of 100s of materialized vectors.

Then I create names for them using namedtuple. Come to think of it, I can just create DataFrame from tuple. And then give them names so I can skip the name tuple dependency.