I am having an issue with a transform
operation on a DataFrame
that is read from a .csv using CSV
. I donโt have the following issue when my dataframe has few rows because the importing from the .csv file uses SentinelArrays
. However, when I have a larger dataframe, PooledArrays
are used.
The problem is the following: I have a column with missing
values. I would like to replace each missing
value with a unique symbol using gensym()
. The transformation works as expected on the freshly create dataframe, but when I save it as a .csv and read it into a new DataFrame, all missing
values are replaced with the same gensym()
. Any thoughts here? If I try to use map
instead of transform
, the same issue occurs.
julia> using CSV, DataFrames
julia> x=DataFrame(A=vcat("C",repeat([missing],1000)),B=3)
1001ร2 DataFrame
Row โ A B
โ String? Int64
โโโโโโโผโโโโโโโโโโโโโโโโ
1 โ C 3
2 โ missing 3
3 โ missing 3
4 โ missing 3
5 โ missing 3
โฎ โ โฎ โฎ
995 โ missing 3
996 โ missing 3
997 โ missing 3
998 โ missing 3
999 โ missing 3
1000 โ missing 3
1001 โ missing 3
955 rows omitted
julia> CSV.write("x.csv",x)
"x.csv"
julia> transform(x,:A => ByRow(i -> ismissing(i) ? gensym() : i) => :A)
1001ร2 DataFrame
Row โ A B
โ Any Int64
โโโโโโโผโโโโโโโโโโโโโโโ
1 โ C 3
2 โ ##1754 3
3 โ ##1755 3
4 โ ##1756 3
5 โ ##1757 3
โฎ โ โฎ โฎ
995 โ ##2747 3
996 โ ##2748 3
997 โ ##2749 3
998 โ ##2750 3
999 โ ##2751 3
1000 โ ##2752 3
1001 โ ##2753 3
955 rows omitted
julia> y=CSV.read("x.csv",DataFrame);
julia> transform(y,:A => ByRow(i -> ismissing(i) ? gensym() : i) => :A)
1001ร2 DataFrame
Row โ A B
โ Any Int64
โโโโโโโผโโโโโโโโโโโโโโโ
1 โ C 3
2 โ ##2755 3
3 โ ##2755 3
4 โ ##2755 3
5 โ ##2755 3
โฎ โ โฎ โฎ
995 โ ##2755 3
996 โ ##2755 3
997 โ ##2755 3
998 โ ##2755 3
999 โ ##2755 3
1000 โ ##2755 3
1001 โ ##2755 3
955 rows omitted