Extracting hashtags from text: Flattening in Query.jl

Hi everyone! I’m currently trying to use extract hashtags from text. While I’ve figured out a working solution, I can’t shake the feeling there’s a more elegant solution. I’m used to the Tidyverse in R, which is why I’m using Queryverse/Query.jl.

In the end, each hashtag should be in its own row (“unnesting” in tidyverse). Here’s what I’ve got so far:

using Queryverse

df = DataFrame(text = ["This is the #best #thing #ever", "I #love #Julia"])

df_tags = df |>
@mutate(matches = collect(eachmatch(r"#[\wÄäÖöÜüß]+", _.text))) |>
@mutate(tags = map(x -> x.match, _.matches)) |>
@select(-:matches) |>
DataFrame

df_tags = flatten(df_tags, :tags)

These are my questions:

  1. Is there a way to combine the two @mutate statements into one? In R, this was a single line: mutate(tags = str_extract_all(text, '#[\\wÄäÖöÜüß]+')) %>%
  2. Is there a way to integrate the flattening into the piping sequence? I have tried something like @map(x -> flatten(x, :tags), _) or @map(flatten(_, :tags)), but neither work. In general words: How do I apply a non-Query.jl function to the whole dataframe?

I’d be really glad if anyone could help me get a better understanding of Julia.

I can’t comment on the Queryverse stuff but as for the regex I believe doing r"#\p{L}+" would pick up a hash followed by any unicode characters. Unless the only non ascii characters allowed are ÄäÖöÜüß?

1 Like

I don’t use Queryverse, but I guess you could put the flatten call in there by making another anonymous function out of it like |> x -> flatten(x, :tags).

With regard to the @mutate calls, you can get your matches directly with [x.match for x in eachmatch(r"#\p{L}+", _.text)], instead of collecting first and then extracting. If you come from R you might not know this list comprehension syntax yet.

Finally, this is how I would write this, just plain DataFrames.jl plus another helper package called Chain.jl, which allows to use any function in the pipe without making anonymous functions first (no matter whether the piped thing is the first, second, etc. argument), and without needing the |> symbol.

using DataFrames
using Chain


df = DataFrame(text = ["This is the #best #thing #ever", "I #love #Julia"])

get_tags(s) = [x.match for x in eachmatch(r"#\p{L}+", s)]

@chain df begin
    transform(:text => ByRow(get_tags) => :tags)
    flatten(:tags)
end
1 Like

Hi everyone!

Excuse the late reply, it just took some time for me to play around with the code. Thank you both for your input—I’d never seen that \p{L} syntax before and it’ll be very useful in the future.

Experimenting with your code @jules I discovered DataFramesMeta.jl, which gives me a familiar grammar coming from R and, from what I’ve gathered, better integrates with the wider Julia ecosystem compared to Queryverse.jl. Splitting the eachmatch into its own function is a great tip and I’ll have a closer look at the list comprehension syntax.

This is what I ended up with, for reference:

using DataFrames
using DataFramesMeta

df = DataFrame(text = ["This is the #best #thing #ever", "I #love #Julia"])

get_tags(s) = [x.match for x in eachmatch(r"#\p{L}+", s)]

df_tags = @linq df |>
transform(tags = get_tags.(:text)) |>
flatten(:tags)

Thanks again!

1 Like