Dropping rows from DataFrame

I have a DataFrame read in from a json file, and the first step it to select just a few columns and then strip out any rows where one of the columns (a string) starts with ‘<’. I have tried a couple of things, ultimately ending with

function filter_captures(df)
@linq df |>
select(:user, :term, :policy, :time) |>
where(:term .startswith(r"[1].*))
end

the where term I can’t work out. I see I can broadcast over :term column, but I want to broadcast where the first character is NOT ‘<’ (or alternatively where it IS a ‘*’ or a letter). I’m unclear how broadcast is supposed to work with a function like startswith. Any ideas?

thanks in advance


  1. ^< ↩︎

Hi, a while ago I posted a solved quite similar question here. Maybe this helps.

Thanks for the reply Mike. I did look at that, but it doesn’t deal with the essential issue, which is how you broadcast a function that takes arguments (including the content of the item). I have filled in the startswith() function now, but startswith takes two arguments, I don’t understand what the first one should be, or whether you can broadcast such a function at all.

I am not sure if I really understand what you mean with “broadcast”. But the problem of your initial question (as I understood it) should work with:

using DataFrames

function filter_captures!(df)
    filter!(row -> !startswith(row.A, "<"),  df)
end

df = DataFrame(A = ["<X", "*X", "Y", "<Y"], B = ["C", "C", "C", "D"])
print(df, "\n\n")

filter_captures!(df)
print(df)

gives

4×2 DataFrame
│ Row │ A      │ B      │
│     │ String │ String │
├─────┼────────┼────────┤
│ 1   │ <X     │ C      │
│ 2   │ *X     │ C      │
│ 3   │ Y      │ C      │
│ 4   │ <Y     │ D      │

2×2 DataFrame
│ Row │ A      │ B      │
│     │ String │ String │
├─────┼────────┼────────┤
│ 1   │ *X     │ C      │
│ 2   │ Y      │ C      │
1 Like

Not sure about @linq, but you can just use normal syntax:

julia> using DataFrames

julia> df = DataFrame(term = rand([string.('a':'z'); "<"], 100), b = rand(100));

julia> df[startswith.(df.term, "<"), :]
2×2 DataFrame
│ Row │ term   │ b        │
│     │ String │ Float64  │
├─────┼────────┼──────────┤
│ 1   │ <      │ 0.918707 │
│ 2   │ <      │ 0.942731 │

and negate with .!

1 Like

That did it! I went down the @linq route, and the examples showed that the where clause could broadcast .< over a column, but it got me nowhere. I found the function would compile if I changed to

where(:term, :startswith.(:term, r"[1].*"))

But that fails on run with an indecipherable error about ‘&’. But your simpler solution works. Thanks very much!


  1. < ↩︎

got it, at least in part. My original problem was looking at the example code for DataFrames which broadcasts using .<, which I thought meant I should use .startswith(), when in fact it should be startswith.(). What you have there makes sense, I was clearly making it too complicated. Thanks very much.

you can also write:

filter(:term => startswith("<"), df)
1 Like

Shorter and shorter. Now I need to go away and work out exactly how that works. Thanks!

In 0.22 that will be soon released small changes will be introduced, but in this case nothing changes.

2 Likes

To answer your original question with DataFramesMeta

julia> df = DataFrame(a = ["xy", "xz", "yz"])
3×1 DataFrame
│ Row │ a      │
│     │ String │
├─────┼────────┤
│ 1   │ xy     │
│ 2   │ xz     │
│ 3   │ yz     │

julia> @where(df, startswith.(:a, r"x") .== false)
1×1 DataFrame
│ Row │ a      │
│     │ String │
├─────┼────────┤
│ 1   │ yz     │
1 Like

got it, so within @linq it becomes

@linq df |>
select(:user, :term, :policy, :time) |>
where(startswith.(:term, r"<") .== false)

correct? So now I have several approaches, very useful.

yeah that sounds right