Converting DataValues.DataValue{String} to String

stej · May 7, 2020, 8:41am

Hi all, just continuing from Read CSV and change rows later - #6 by pdeffebach

I’m trying to convert a column like this:

using CSV
using DataFrames
using Query

file = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13",
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

io = IOBuffer(file)
df = CSV.File(io; header = true, delim = ',') |>
    DataFrame |>
    @mutate(Splits = !ismissing(_.Splits) ? split(_.Splits, ',') : String[]) |>
    DataFrame
show(df)

This gives me an error ERROR: LoadError: MethodError: no method matching split(::DataValues.DataValue{String}, ::Char). Ok, I got it, some conversion is needed.

But this works (just trying to use string joining to test the behaviour):

io = IOBuffer(file)
df = CSV.File(io; header = true, delim = ',') |>
   DataFrame |>
   @mutate(Splits = _.Splits * "some suffix") |>
   DataFrame

So I’m confused that * operator handles that object correctly.

How to work with the DataValue{String} in my case, please?

oheil · May 7, 2020, 9:20am

split(::DataValues.DataValue{String}, ::Char)
is not defined, because DataValue{String} is not AbstractString as you can see here:

ERROR: MethodError: no method matching split(::DataValues.DataValue{String}, ::Char)
Closest candidates are:
  split(::T, ::AbstractChar; limit, keepempty) where T<:AbstractString at strings/util.jl:321

But you can convert DataValue{String} to a String split(string(_.Splits), ','):

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = !ismissing(_.Splits) ? split(string(_.Splits), ',') : String[]) |>
           DataFrame
3×5 DataFrame
│ Row │ X1      │ X2        │ X3           │ x4                  │ Splits                                                      │
│     │ Int64   │ Int64     │ String       │ String              │ Array{SubString{String},1}                                  │
├─────┼─────────┼───────────┼──────────────┼─────────────────────┼─────────────────────────────────────────────────────────────┤
│ 1   │ 5674012 │ 530489692 │ batch_145322 │ 10/31/2019 15:00:13 │ ["DataValue{String}()"]                                     │
│ 2   │ 5674012 │ 530489702 │ batch_145323 │ 10/31/2019 15:00:32 │ ["DataValue{String}(\"9b4e08e5\")"]                         │
│ 3   │ 5674012 │ 530489728 │ batch_145327 │ 10/31/2019 15:01:56 │ ["DataValue{String}(\"b036aa66", "b036aa67", "b036aa68\")"] │

davidanthoff · May 7, 2020, 9:56am

We should just add a split method to DataValues.jl that handles this case (that is why the * case works, we have that method defined in DataValues.jl).

The two canonical ways to go from DataValue{T} to T are x[] (assuming x is of type DataValue{T}) or get(x). With get you can also specify a default value that should be returned in case x has no value: get(x, "something").

Before the broadcasting revamp I also had lifting via the . operator working, so in that case you could have just written split.(_.Splits, ','), but I never found the time to reenable that for Julia 1.x… I should probably look into that again.

stej · May 7, 2020, 11:44am

This would probably work, but DataValue conversion to string adds it’s type as well. How it behaves:

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(string(_.Splits), 'a')) |>
           DataFrame
3×5 DataFrame
│ Row │ X1      │ X2        │ X3           │ x4                  │ Splits                                                                           │
│     │ Int64   │ Int64     │ String       │ String              │ Array{SubString{String},1}                                                       │
├─────┼─────────┼───────────┼──────────────┼─────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│ 1   │ 5674012 │ 530489692 │ batch_145322 │ 10/31/2019 15:00:13 │ ["D", "t", "V", "lue{String}()"]                                                 │
│ 2   │ 5674012 │ 530489702 │ batch_145323 │ 10/31/2019 15:00:32 │ ["D", "t", "V", "lue{String}(\"9b4e08e5\")"]                                     │
│ 3   │ 5674012 │ 530489728 │ batch_145327 │ 10/31/2019 15:01:56 │ ["D", "t", "V", "lue{String}(\"b036", "", "66,b036", "", "67,b036", "", "68\")"]

See the last column.

stej · May 7, 2020, 11:49am

@davidanthoff I already tried get(x) before, but it was throwing exceptions.

Simplified example:

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(get(_.Splits), ',')) |>
           DataFrame
ERROR: DataValues.DataValueException()
Stacktrace:
 [1] get at C:\Users\u\.julia\packages\DataValues\N7oeL\src\scalar\core.jl:78 [inlined]
 [2] #104 at C:\Users\u\.julia\packages\Query\AwBtd\src\query_translation.jl:58 [inlined]
 [3] iterate at C:\Users\u\.julia\packages\QueryOperators\g4G21\src\enumerable\enumerable_map.jl:25 [inlined]
 [4] iterate at C:\Users\u\.julia\packages\Tables\okt7x\src\tofromdatavalues.jl:45 [inlined]
 [5] iterate at .\iterators.jl:139 [inlined]
 [6] iterate at .\iterators.jl:138 [inlined]
 [7] buildcolumns at C:\Users\u\.julia\packages\Tables\okt7x\src\fallbacks.jl:126 [inlined]
 [8] columns at C:\Users\u\.julia\packages\Tables\okt7x\src\fallbacks.jl:237 [inlined]
 [9] DataFrame(::QueryOperators.EnumerableMap{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Array{SubString{String},1}}},QueryOperators.EnumerableIterable{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.DataValueRowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.Schema{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Union{Missing, String}}},Tables.RowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{Union{Missing, String},1}}}}}},var"#104#106"}; copycols::Bool) at C:\Users\u\.julia\packages\DataFrames\S3ZFo\src\other\tables.jl:40
 [10] DataFrame(::QueryOperators.EnumerableMap{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Array{SubString{String},1}}},QueryOperators.EnumerableIterable{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.DataValueRowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.Schema{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Union{Missing, String}}},Tables.RowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{Union{Missing, String},1}}}}}},var"#104#106"}) at C:\Users\u\.julia\packages\DataFrames\S3ZFo\src\other\tables.jl:31
 [11] |>(::QueryOperators.EnumerableMap{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Array{SubString{String},1}}},QueryOperators.EnumerableIterable{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.DataValueRowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,DataValues.DataValue{String}}},Tables.Schema{(:X1, :X2, :X3, :x4, :Splits),Tuple{Int64,Int64,String,String,Union{Missing, String}}},Tables.RowIterator{NamedTuple{(:X1, :X2, :X3, :x4, :Splits),Tuple{Array{Int64,1},Array{Int64,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{String,1},WeakRefStrings.StringArray{Union{Missing, String},1}}}}}},var"#104#106"}, ::Type{T} where T) at .\operators.jl:823
 [12] top-level scope at REPL[211]:100:

Anyway, when I specify the default value, it works as expected.

julia> df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(get(_.Splits, ""), ',')) |>
           DataFrame
3×5 DataFrame
│ Row │ X1      │ X2        │ X3           │ x4                  │ Splits                               │
│     │ Int64   │ Int64     │ String       │ String              │ Array{SubString{String},1}           │
├─────┼─────────┼───────────┼──────────────┼─────────────────────┼──────────────────────────────────────┤
│ 1   │ 5674012 │ 530489692 │ batch_145322 │ 10/31/2019 15:00:13 │ [""]                                 │
│ 2   │ 5674012 │ 530489702 │ batch_145323 │ 10/31/2019 15:00:32 │ ["9b4e08e5"]                         │
│ 3   │ 5674012 │ 530489728 │ batch_145327 │ 10/31/2019 15:01:56 │ ["b036aa66", "b036aa67", "b036aa68"] │

stej · May 7, 2020, 11:56am

Guys, thanks for help. I’ll probably return with new simple questions later. Please bear with me

davidanthoff · May 7, 2020, 12:41pm

That suggests that you have some rows with missing values in your data, in which case just converting won’t work, but the get variant where you tell it what value to use if a value is missing does work.

stej · May 7, 2020, 12:57pm

Ok, got it.

One more question please. I have bunch of CSV files. Some of them have all columns with proper data, but some of them are missing.

So e.g. file 1 with missing data at [1,5]:

contents = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13",
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

file 2:

contents = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13","somethinghere"
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

How should I handle such case if I’d like to use this code

df = CSV.File(io; header = true, delim = ',') |>
           DataFrame |>
           @mutate(Splits = split(get(_.Splits, ""), ',')) |>
           DataFrame

(this get(_.Splits, "") fails for file 2, because type of values in last column is String whereas it’s :DataValues.DataValue{String} in file 1)

I feel that I might go in wrong direction because checking of types of the DataFrame’s column doesn’t feel too natural…

stej · May 11, 2020, 11:24am

Ok, so I solved it like this:

df[ismissing.(df[!, :Splits]), :Splits] .= ""
df[!, :Splits] = convert.(String, df[!, :Splits])
df = df |> 
     @mutate(Splits = length(_.Splits) > 0 ? split(_.Splits, ';') : String[]) |> 
     DataFrame

This manual replacement and conversion to given type feels kinda dirty. Not happy with that, but it works.

I’m also worried about the performance impact when reading large CSVs…

pdeffebach · May 11, 2020, 11:57am

With the updated version of DataFrames we the following should work with both files, though to be fair the ByRow(t -> passmissing(split)(t, ','))) is a bit arcane. passmissing is a helper function to return a missing value if any of the values is missing. You can then do another pass of the data to make missings what you want them to be, an empty string array I think.

It’s just

julia> contents = """
       "X1", "X2", "X3", "x4", "Splits"
       "5674012","530489692","batch_145322","10/31/2019 15:00:13","somethinghere"
       "5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
       "5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
       """;

julia> io = IOBuffer(contents);

julia> df = CSV.File(io; header = true, delim = ',') |>
                  DataFrame;

julia> transform(df, [:Splits] => ByRow(t -> passmissing(split)(t, ',')))
3×6 DataFrame. Omitted printing of 1 columns
│ Row │ X1      │ X2        │ X3           │ x4                  │ Splits                     │
│     │ Int64   │ Int64     │ String       │ String              │ String                     │
├─────┼─────────┼───────────┼──────────────┼─────────────────────┼────────────────────────────┤
│ 1   │ 5674012 │ 530489692 │ batch_145322 │ 10/31/2019 15:00:13 │ somethinghere              │
│ 2   │ 5674012 │ 530489702 │ batch_145323 │ 10/31/2019 15:00:32 │ 9b4e08e5                   │
│ 3   │ 5674012 │ 530489728 │ batch_145327 │ 10/31/2019 15:01:56 │ b036aa66,b036aa67,b036aa68 │

EDIT: Broadcasting is probably simpler. You can do

julia> transform(df, [:Splits] => t -> passmissing(split).(t, ','))

stej · May 11, 2020, 3:51pm

Thank you @pdeffebach . This is something I’ll try probably later. Meanwhile I battle with Query @mutate changes other column type I’m just curious what’s going on, that’s why I post a new question.

Topic		Replies	Views
Query.jl join error - DataValues.DataValue{String} vs String New to Julia	2	774	January 11, 2018
Query @mutate changes other column type New to Julia query , dataframes , csv	8	1077	May 13, 2020
Can't read a CSV file with empty fields into a DataFrame Data dataframes , csv	7	1090	May 16, 2022
CSV woes and SubString documentation New to Julia	9	1479	December 24, 2017
CSV.jl error - cannot convert an object of type WeakRefString General Usage	13	1794	October 13, 2018

Converting DataValues.DataValue{String} to String

Related topics