Split and flatten a field from CSV and use as DataFrame

Hi all, my MWE is this one with output:

julia> people = DataFrame(name=["john", "marry", "tim"], goodat=["gardening", "teaching,negotiation", ""])
3×2 DataFrame
│ Row │ name   │ goodat               │
│     │ String │ String               │
├─────┼────────┼──────────────────────┤
│ 1   │ john   │ gardening            │
│ 2   │ marry  │ teaching,negotiation │
│ 3   │ tim    │                      │

julia> transform!(people, :goodat => ByRow(p -> split(p, ',')) => :goodatarray)
3×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray                 │
│     │ String │ String               │ Array{SubString{String},1}  │
├─────┼────────┼──────────────────────┼─────────────────────────────┤
│ 1   │ john   │ gardening            │ ["gardening"]               │
│ 2   │ marry  │ teaching,negotiation │ ["teaching", "negotiation"] │
│ 3   │ tim    │                      │ [""]                        │

julia> neededSkils = DataFrame(field=["teaching", "gardening"])
2×1 DataFrame
│ Row │ field     │
│     │ String    │
├─────┼───────────┤
│ 1   │ teaching  │
│ 2   │ gardening │

julia> goodat = people[!, :goodatarray] |>
                   Iterators.flatten |>
                   @map(_) |>
                   DataFrame
4×3 DataFrame
│ Row │ string               │ offset │ ncodeunits │
│     │ String               │ Int64  │ Int64      │
├─────┼──────────────────────┼────────┼────────────┤
│ 1   │ gardening            │ 0      │ 9          │
│ 2   │ teaching,negotiation │ 0      │ 8          │
│ 3   │ teaching,negotiation │ 9      │ 11         │
│ 4   │                      │ 0      │ 0          │

julia> DataFrames.innerjoin(neededSkils, goodat, on = :field => :string)
1×3 DataFrame
│ Row │ field     │ offset │ ncodeunits │
│     │ String    │ Int64  │ Int64      │
├─────┼───────────┼────────┼────────────┤
│ 1   │ gardening │ 0      │ 9          │

This example is really artificial, but let’s say there are

  1. people with skills and
  2. needed skills at market.

I need to join these two DataFrames to find what needs could be fulfilled.

What’s wrong? the DataFrames.innerjoin should find gardening and teaching, but found only gardening.

It’s obvious that goodat is constructed in a bad way. I expected it will be just flattened array of strings.

What is an idiomatic way how to do that?


For faster review, this is the same example without output

using Query
using DataFrames

people = DataFrame(name=["john", "marry", "tim"], goodat=["gardening", "teaching,negotiation", ""])
transform!(people, :goodat => ByRow(p -> split(p, ',')) => :goodatarray)

neededSkils = DataFrame(field=["teaching", "gardening"])

goodat = people[!, :goodatarray] |> 
            Iterators.flatten |>
            @map(_) |> 
            DataFrame

DataFrames.innerjoin(neededSkils, goodat, on = :field => :string)

There is a flatten function in DataFrames that does what you want.

julia> people
3×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray                 │
│     │ String │ String               │ Array{SubString{String},1}  │
├─────┼────────┼──────────────────────┼─────────────────────────────┤
│ 1   │ john   │ gardening            │ ["gardening"]               │
│ 2   │ marry  │ teaching,negotiation │ ["teaching", "negotiation"] │
│ 3   │ tim    │                      │ [""]                        │

julia> flatten(people, :goodatarray)
4×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray │
│     │ String │ String               │ SubString…  │
├─────┼────────┼──────────────────────┼─────────────┤
│ 1   │ john   │ gardening            │ gardening   │
│ 2   │ marry  │ teaching,negotiation │ teaching    │
│ 3   │ marry  │ teaching,negotiation │ negotiation │
│ 4   │ tim    │                      │             │

After you flatten, the join should work as you expected.

1 Like

Thanks a lot. This really solves my problem. Totally missed that function in documentation…

Output:

julia> DataFrames.innerjoin(neededSkils, flatten(people, :goodatarray), on = :field => :goodatarray)
2×3 DataFrame
│ Row │ field     │ name   │ goodat               │
│     │ String    │ String │ String               │
├─────┼───────────┼────────┼──────────────────────┤
│ 1   │ teaching  │ marry  │ teaching,negotiation │
│ 2   │ gardening │ john   │ gardening            │

Besides that I see the types are different

julia> neededSkils
2×1 DataFrame
│ Row │ field     │
│     │ String    │
├─────┼───────────┤
│ 1   │ teaching  │
│ 2   │ gardening │

julia> flatten(people, :goodatarray)
4×3 DataFrame
│ Row │ name   │ goodat               │ goodatarray │
│     │ String │ String               │ SubString…  │
├─────┼────────┼──────────────────────┼─────────────┤
│ 1   │ john   │ gardening            │ gardening   │
│ 2   │ marry  │ teaching,negotiation │ teaching    │
│ 3   │ marry  │ teaching,negotiation │ negotiation │
│ 4   │ tim    │                      │             │

Is there any implicit conversion between String and SubString or how does that work?

Thankfully it “just works” in this scenario. In DataFrames, rows are matched by the hash of their ids. Fortunately it looks like the developers decided to make Strings and SubStrings have the same hash.

julia> s = "a string"
"a string"

julia> a = "a a string"
"a a string"

julia> t = chop(a, head = 2, tail = 0)
"a string"

julia> typeof(t)
SubString{String}

julia> typeof(s)
String

julia> hash(t) === hash(s)
true
2 Likes

@pdeffebach I appreciate your answers :bowing_man:

If anyone knows some good reads about such things like equality, type conversions, etc., please answer this post.