Converting DataValues.DataValue{String} to String

Hi all, just continuing from Read CSV and change rows later - #6 by pdeffebach

I’m trying to convert a column like this:

using CSV
using DataFrames
using Query

file = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13",
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

io = IOBuffer(file)
df = CSV.File(io; header = true, delim = ',') |>
    DataFrame |>
    @mutate(Splits = !ismissing(_.Splits) ? split(_.Splits, ',') : String[]) |>
    DataFrame
show(df)

This gives me an error ERROR: LoadError: MethodError: no method matching split(::DataValues.DataValue{String}, ::Char). Ok, I got it, some conversion is needed.

But this works (just trying to use string joining to test the behaviour):

io = IOBuffer(file)
df = CSV.File(io; header = true, delim = ',') |>
   DataFrame |>
   @mutate(Splits = _.Splits * "some suffix") |>
   DataFrame

So I’m confused that * operator handles that object correctly.

How to work with the DataValue{String} in my case, please?

split(::DataValues.DataValue{String}, ::Char)
is not defined, because DataValue{String} is not AbstractString as you can see here:

ERROR: MethodError: no method matching split(::DataValues.DataValue{String}, ::Char)
Closest candidates are:
  split(::T, ::AbstractChar; limit, keepempty) where T<:AbstractString at strings/util.jl:321

But you can convert DataValue{String} to a String split(string(_.Splits), ','):

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = !ismissing(_.Splits) ? split(string(_.Splits), ',') : String[]) |>
           DataFrame
3Γ—5 DataFrame
β”‚ Row β”‚ X1      β”‚ X2        β”‚ X3           β”‚ x4                  β”‚ Splits                                                      β”‚
β”‚     β”‚ Int64   β”‚ Int64     β”‚ String       β”‚ String              β”‚ Array{SubString{String},1}                                  β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ 530489692 β”‚ batch_145322 β”‚ 10/31/2019 15:00:13 β”‚ ["DataValue{String}()"]                                     β”‚
β”‚ 2   β”‚ 5674012 β”‚ 530489702 β”‚ batch_145323 β”‚ 10/31/2019 15:00:32 β”‚ ["DataValue{String}(\"9b4e08e5\")"]                         β”‚
β”‚ 3   β”‚ 5674012 β”‚ 530489728 β”‚ batch_145327 β”‚ 10/31/2019 15:01:56 β”‚ ["DataValue{String}(\"b036aa66", "b036aa67", "b036aa68\")"] β”‚
1 Like

We should just add a split method to DataValues.jl that handles this case (that is why the * case works, we have that method defined in DataValues.jl).

The two canonical ways to go from DataValue{T} to T are x[] (assuming x is of type DataValue{T}) or get(x). With get you can also specify a default value that should be returned in case x has no value: get(x, "something").

Before the broadcasting revamp I also had lifting via the . operator working, so in that case you could have just written split.(_.Splits, ','), but I never found the time to reenable that for Julia 1.x… I should probably look into that again.

1 Like

This would probably work, but DataValue conversion to string adds it’s type as well. How it behaves:

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(string(_.Splits), 'a')) |>
           DataFrame
3Γ—5 DataFrame
β”‚ Row β”‚ X1      β”‚ X2        β”‚ X3           β”‚ x4                  β”‚ Splits                                                                           β”‚
β”‚     β”‚ Int64   β”‚ Int64     β”‚ String       β”‚ String              β”‚ Array{SubString{String},1}                                                       β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ 530489692 β”‚ batch_145322 β”‚ 10/31/2019 15:00:13 β”‚ ["D", "t", "V", "lue{String}()"]                                                 β”‚
β”‚ 2   β”‚ 5674012 β”‚ 530489702 β”‚ batch_145323 β”‚ 10/31/2019 15:00:32 β”‚ ["D", "t", "V", "lue{String}(\"9b4e08e5\")"]                                     β”‚
β”‚ 3   β”‚ 5674012 β”‚ 530489728 β”‚ batch_145327 β”‚ 10/31/2019 15:01:56 β”‚ ["D", "t", "V", "lue{String}(\"b036", "", "66,b036", "", "67,b036", "", "68\")"]

See the last column.

@davidanthoff I already tried get(x) before, but it was throwing exceptions.

Simplified example:

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(get(_.Splits), ',')) |>
           DataFrame
ERROR: DataValues.DataValueException()
Stacktrace:
 [1] get at C:\Users\u\.julia\packages\DataValues\N7oeL\src\scalar\core.jl:78 [inlined]
 [2] #104 at C:\Users\u\.julia\packages\Query\AwBtd\src\query_translation.jl:58 [inlined]
 [3] iterate at C:\Users\u\.julia\packages\QueryOperators\g4G21\src\enumerable\enumerable_map.jl:25 [inlined]
 [4] iterate at C:\Users\u\.julia\packages\Tables\okt7x\src\tofromdatavalues.jl:45 [inlined]
 [5] iterate at .\iterators.jl:139 [inlined]
 [6] iterate at .\iterators.jl:138 [inlined]
 [7] buildcolumns at C:\Users\u\.julia\packages\Tables\okt7x\src\fallbacks.jl:126 [inlined]
 [8] columns at C:\Users\u\.julia\packages\Tables\okt7x\src\fallbacks.jl:237 [inlined]
 [9] DataFrame(::QueryOperators.EnumerableMap{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Array{SubString{String},1}}},QueryOperators.EnumerableIterable{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.DataValueRowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.Schema{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Union{Missing, String}}},Tables.RowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{Union{Missing, String},1}}}}}},var"#104#106"}; copycols::Bool) at C:\Users\u\.julia\packages\DataFrames\S3ZFo\src\other\tables.jl:40
 [10] DataFrame(::QueryOperators.EnumerableMap{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Array{SubString{String},1}}},QueryOperators.EnumerableIterable{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.DataValueRowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.Schema{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Union{Missing, String}}},Tables.RowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{Union{Missing, String},1}}}}}},var"#104#106"}) at C:\Users\u\.julia\packages\DataFrames\S3ZFo\src\other\tables.jl:31
 [11] |>(::QueryOperators.EnumerableMap{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Array{SubString{String},1}}},QueryOperators.EnumerableIterable{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.DataValueRowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.Schema{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Union{Missing, String}}},Tables.RowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{Union{Missing, String},1}}}}}},var"#104#106"}, ::Type{T} where T) at .\operators.jl:823
 [12] top-level scope at REPL[211]:100:

Anyway, when I specify the default value, it works as expected.

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(get(_.Splits, ""), ',')) |>
           DataFrame
3Γ—5 DataFrame
β”‚ Row β”‚ X1      β”‚ X2        β”‚ X3           β”‚ x4                  β”‚ Splits                               β”‚
β”‚     β”‚ Int64   β”‚ Int64     β”‚ String       β”‚ String              β”‚ Array{SubString{String},1}           β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ 530489692 β”‚ batch_145322 β”‚ 10/31/2019 15:00:13 β”‚ [""]                                 β”‚
β”‚ 2   β”‚ 5674012 β”‚ 530489702 β”‚ batch_145323 β”‚ 10/31/2019 15:00:32 β”‚ ["9b4e08e5"]                         β”‚
β”‚ 3   β”‚ 5674012 β”‚ 530489728 β”‚ batch_145327 β”‚ 10/31/2019 15:01:56 β”‚ ["b036aa66", "b036aa67", "b036aa68"] β”‚

Guys, thanks for help. I’ll probably return with new simple questions later. Please bear with me :slight_smile:

That suggests that you have some rows with missing values in your data, in which case just converting won’t work, but the get variant where you tell it what value to use if a value is missing does work.

1 Like

Ok, got it.

One more question please. I have bunch of CSV files. Some of them have all columns with proper data, but some of them are missing.

So e.g. file 1 with missing data at [1,5]:

contents = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13",
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

file 2:

contents = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13","somethinghere"
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

How should I handle such case if I’d like to use this code

df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(get(_.Splits, ""), ',')) |>
           DataFrame

(this get(_.Splits, "") fails for file 2, because type of values in last column is String whereas it’s :DataValues.DataValue{String} in file 1)

I feel that I might go in wrong direction because checking of types of the DataFrame’s column doesn’t feel too natural…

Ok, so I solved it like this:

df[ismissing.(df[!, :Splits]), :Splits] .= ""
df[!, :Splits] = convert.(String, df[!, :Splits])
df = df |> 
     @mutate(Splits = length(_.Splits) > 0 ? split(_.Splits, ';') : String[]) |> 
     DataFrame

This manual replacement and conversion to given type feels kinda dirty. Not happy with that, but it works.

I’m also worried about the performance impact when reading large CSVs…

With the updated version of DataFrames we the following should work with both files, though to be fair the ByRow(t -> passmissing(split)(t, ','))) is a bit arcane. passmissing is a helper function to return a missing value if any of the values is missing. You can then do another pass of the data to make missings what you want them to be, an empty string array I think.

It’s just

julia> contents = """
       "X1", "X2", "X3", "x4", "Splits"
       "5674012","530489692","batch_145322","10/31/2019 15:00:13","somethinghere"
       "5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
       "5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
       """;

julia> io = IOBuffer(contents);

julia> df = CSV.File(io; header = true, delim = ',') |>
                  DataFrame;

julia> transform(df, [:Splits] => ByRow(t -> passmissing(split)(t, ',')))
3Γ—6 DataFrame. Omitted printing of 1 columns
β”‚ Row β”‚ X1      β”‚ X2        β”‚ X3           β”‚ x4                  β”‚ Splits                     β”‚
β”‚     β”‚ Int64   β”‚ Int64     β”‚ String       β”‚ String              β”‚ String                     β”‚
β”œβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 1   β”‚ 5674012 β”‚ 530489692 β”‚ batch_145322 β”‚ 10/31/2019 15:00:13 β”‚ somethinghere              β”‚
β”‚ 2   β”‚ 5674012 β”‚ 530489702 β”‚ batch_145323 β”‚ 10/31/2019 15:00:32 β”‚ 9b4e08e5                   β”‚
β”‚ 3   β”‚ 5674012 β”‚ 530489728 β”‚ batch_145327 β”‚ 10/31/2019 15:01:56 β”‚ b036aa66,b036aa67,b036aa68 β”‚

EDIT: Broadcasting is probably simpler. You can do

julia> transform(df, [:Splits] => t -> passmissing(split).(t, ','))

Thank you @pdeffebach . This is something I’ll try probably later. Meanwhile I battle with Query @mutate changes other column type :slight_smile: I’m just curious what’s going on, that’s why I post a new question.