Dropping rows from DataFrame

I have a DataFrame read in from a json file, and the first step it to select just a few columns and then strip out any rows where one of the columns (a string) starts with ‘<’. I have tried a couple of things, ultimately ending with

function filter_captures(df)
@linq df |>
select(:user, :term, :policy, :time) |>
where(:term .startswith(r"[1].*))
end

the where term I can’t work out. I see I can broadcast over :term column, but I want to broadcast where the first character is NOT ‘<’ (or alternatively where it IS a ‘*’ or a letter). I’m unclear how broadcast is supposed to work with a function like startswith. Any ideas?

thanks in advance


  1. ^< ↩︎

Hi, a while ago I posted a solved quite similar question here. Maybe this helps.

Thanks for the reply Mike. I did look at that, but it doesn’t deal with the essential issue, which is how you broadcast a function that takes arguments (including the content of the item). I have filled in the startswith() function now, but startswith takes two arguments, I don’t understand what the first one should be, or whether you can broadcast such a function at all.

I am not sure if I really understand what you mean with “broadcast”. But the problem of your initial question (as I understood it) should work with:

using DataFrames

function filter_captures!(df)
    filter!(row -> !startswith(row.A, "<"),  df)
end

df = DataFrame(A = ["<X", "*X", "Y", "<Y"], B = ["C", "C", "C", "D"])
print(df, "\n\n")

filter_captures!(df)
print(df)

gives

4×2 DataFrame
│ Row │ A      │ B      │
│     │ String │ String │
├─────┼────────┼────────┤
│ 1   │ <X     │ C      │
│ 2   │ *X     │ C      │
│ 3   │ Y      │ C      │
│ 4   │ <Y     │ D      │

2×2 DataFrame
│ Row │ A      │ B      │
│     │ String │ String │
├─────┼────────┼────────┤
│ 1   │ *X     │ C      │
│ 2   │ Y      │ C      │

Not sure about @linq, but you can just use normal syntax:

julia> using DataFrames

julia> df = DataFrame(term = rand([string.('a':'z'); "<"], 100), b = rand(100));

julia> df[startswith.(df.term, "<"), :]
2×2 DataFrame
│ Row │ term   │ b        │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ <      │ 0.918707 │
│ 2   │ <      │ 0.942731 │

and negate with .!

That did it! I went down the @linq route, and the examples showed that the where clause could broadcast .< over a column, but it got me nowhere. I found the function would compile if I changed to

where(:term, :startswith.(:term, r"[1].*"))

But that fails on run with an indecipherable error about ‘&’. But your simpler solution works. Thanks very much!


  1. < ↩︎

got it, at least in part. My original problem was looking at the example code for DataFrames which broadcasts using .<, which I thought meant I should use .startswith(), when in fact it should be startswith.(). What you have there makes sense, I was clearly making it too complicated. Thanks very much.

you can also write:

filter(:term => startswith("<"), df)

Shorter and shorter. Now I need to go away and work out exactly how that works. Thanks!

In 0.22 that will be soon released small changes will be introduced, but in this case nothing changes.

To answer your original question with DataFramesMeta

julia> df = DataFrame(a = ["xy", "xz", "yz"])
3×1 DataFrame
│ Row │ a      │
│     │ String │
├─────┼────────┤
│ 1   │ xy     │
│ 2   │ xz     │
│ 3   │ yz     │

julia> @where(df, startswith.(:a, r"x") .== false)
1×1 DataFrame
│ Row │ a      │
│     │ String │
├─────┼────────┤
│ 1   │ yz     │

got it, so within @linq it becomes

@linq df |>
select(:user, :term, :policy, :time) |>
where(startswith.(:term, r"<") .== false)

correct? So now I have several approaches, very useful.

yeah that sounds right