I have a huge NamedTuple and
DataFrame(larged_namedtuple)
takes a long time like 12s and uses quite a bit of RAM
The eventual target for me is a DataFrame but I created the named tuple so that users can choose the sink they want. Is it better to not create the named tuple and just create the DataFrame? That would force a dependency on DataFrame on a package that I am only a potential contributor to, so if it can be avoided would be good.
Any good solutions?
Use copycols = false
in the DataFrame
constructor.
If you can avoid a dependency on data frames that would be best.
1 Like
Huge in what sense? Many columns?
1 Like
Does it make the dataframe immutable?
As far as I can tell, no. I think the issue you are thinking of is CSV.read
which used to return an immutable AbstractArray
type that would cause problems with copycols = false
.
2 Likes
@xiaodai, it is a lot easier to help you if you provide more concrete information. Are you passing one named tuple with 100s fields and each field is a vector with one element per row? Or are you passing a vector of named tuples? Or an iterator of named tuples?
1 Like
Firstly, I create 100s of vectors using the multi-threading, so I have a (unamed) tuple of 100s of materialized vectors.
Then I create names for them using namedtuple. Come to think of it, I can just create DataFrame from tuple. And then give them names so I can skip the name tuple dependency.