Split and flatten a field from CSV and use as DataFrame

stej · May 26, 2020, 7:32pm

Hi all, my MWE is this one with output:

julia> people = DataFrame(name=["john", "marry", "tim"], goodat=["gardening", "teaching,negotiation", ""])
3×2 DataFrame
│ Row │ name   │ goodat               │
│     │ String │ String               │
├─────┼────────┼──────────────────────┤
│ 1   │ john   │ gardening            │
│ 2   │ marry  │ teaching,negotiation │
│ 3   │ tim    │                      │

julia> transform!(people, :goodat => ByRow(p -> split(p, ',')) => :goodatarray)
3×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray                 │
│     │ String │ String               │ Array{SubString{String},1}  │
├─────┼────────┼──────────────────────┼─────────────────────────────┤
│ 1   │ john   │ gardening            │ ["gardening"]               │
│ 2   │ marry  │ teaching,negotiation │ ["teaching", "negotiation"] │
│ 3   │ tim    │                      │ [""]                        │

julia> neededSkils = DataFrame(field=["teaching", "gardening"])
2×1 DataFrame
│ Row │ field     │
│     │ String    │
├─────┼───────────┤
│ 1   │ teaching  │
│ 2   │ gardening │

julia> goodat = people[!, :goodatarray] |>
                   Iterators.flatten |>
                   @map(_) |>
                   DataFrame
4×3 DataFrame
│ Row │ string               │ offset │ ncodeunits │
│     │ String               │ Int64  │ Int64      │
├─────┼──────────────────────┼────────┼────────────┤
│ 1   │ gardening            │ 0      │ 9          │
│ 2   │ teaching,negotiation │ 0      │ 8          │
│ 3   │ teaching,negotiation │ 9      │ 11         │
│ 4   │                      │ 0      │ 0          │

julia> DataFrames.innerjoin(neededSkils, goodat, on = :field => :string)
1×3 DataFrame
│ Row │ field     │ offset │ ncodeunits │
│     │ String    │ Int64  │ Int64      │
├─────┼───────────┼────────┼────────────┤
│ 1   │ gardening │ 0      │ 9          │

This example is really artificial, but let’s say there are

people with skills and
needed skills at market.

I need to join these two DataFrames to find what needs could be fulfilled.

What’s wrong? the DataFrames.innerjoin should find gardening and teaching, but found only gardening.

It’s obvious that goodat is constructed in a bad way. I expected it will be just flattened array of strings.

What is an idiomatic way how to do that?

For faster review, this is the same example without output

using Query
using DataFrames

people = DataFrame(name=["john", "marry", "tim"], goodat=["gardening", "teaching,negotiation", ""])
transform!(people, :goodat => ByRow(p -> split(p, ',')) => :goodatarray)

neededSkils = DataFrame(field=["teaching", "gardening"])

goodat = people[!, :goodatarray] |> 
            Iterators.flatten |>
            @map(_) |> 
            DataFrame

DataFrames.innerjoin(neededSkils, goodat, on = :field => :string)

pdeffebach · May 26, 2020, 7:40pm

There is a flatten function in DataFrames that does what you want.

julia> people
3×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray                 │
│     │ String │ String               │ Array{SubString{String},1}  │
├─────┼────────┼──────────────────────┼─────────────────────────────┤
│ 1   │ john   │ gardening            │ ["gardening"]               │
│ 2   │ marry  │ teaching,negotiation │ ["teaching", "negotiation"] │
│ 3   │ tim    │                      │ [""]                        │

julia> flatten(people, :goodatarray)
4×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray │
│     │ String │ String               │ SubString…  │
├─────┼────────┼──────────────────────┼─────────────┤
│ 1   │ john   │ gardening            │ gardening   │
│ 2   │ marry  │ teaching,negotiation │ teaching    │
│ 3   │ marry  │ teaching,negotiation │ negotiation │
│ 4   │ tim    │                      │             │

After you flatten, the join should work as you expected.

stej · May 27, 2020, 6:51am

Thanks a lot. This really solves my problem. Totally missed that function in documentation…

Output:

julia> DataFrames.innerjoin(neededSkils, flatten(people, :goodatarray), on = :field => :goodatarray)
2×3 DataFrame
│ Row │ field     │ name   │ goodat               │
│     │ String    │ String │ String               │
├─────┼───────────┼────────┼──────────────────────┤
│ 1   │ teaching  │ marry  │ teaching,negotiation │
│ 2   │ gardening │ john   │ gardening            │

stej · May 27, 2020, 6:53am

Besides that I see the types are different

julia> neededSkils
2×1 DataFrame
│ Row │ field     │
│     │ String    │
├─────┼───────────┤
│ 1   │ teaching  │
│ 2   │ gardening │

julia> flatten(people, :goodatarray)
4×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray │
│     │ String │ String               │ SubString…  │
├─────┼────────┼──────────────────────┼─────────────┤
│ 1   │ john   │ gardening            │ gardening   │
│ 2   │ marry  │ teaching,negotiation │ teaching    │
│ 3   │ marry  │ teaching,negotiation │ negotiation │
│ 4   │ tim    │                      │             │

Is there any implicit conversion between String and SubString or how does that work?

pdeffebach · May 27, 2020, 1:05pm

Thankfully it “just works” in this scenario. In DataFrames, rows are matched by the hash of their ids. Fortunately it looks like the developers decided to make Strings and SubStrings have the same hash.

julia> s = "a string"
"a string"

julia> a = "a a string"
"a a string"

julia> t = chop(a, head = 2, tail = 0)
"a string"

julia> typeof(t)
SubString{String}

julia> typeof(s)
String

julia> hash(t) === hash(s)
true

stej · May 27, 2020, 2:44pm

@pdeffebach I appreciate your answers

If anyone knows some good reads about such things like equality, type conversions, etc., please answer this post.

Topic		Replies	Views
Split dataframe row into multiple rows Data dataframes	8	1795	May 1, 2022
Transforming DataFrame from column of vector General Usage dataframes	3	246	December 1, 2022
JuliaDB.flatten equivalent in DataFrames Data dataframes	7	1148	November 14, 2019
Expand row to multiple rows Data	11	1293	December 15, 2020
Flatten in case column contains string and array Data question	6	719	May 17, 2021

Split and flatten a field from CSV and use as DataFrame

Related topics