Bug with filtering DataFrames in combination with ShiftedArrays

PeterDM · May 24, 2025, 9:21pm

It seems filtering DataFrames is not compatible with ShiftedArrays in a way one would expect it to be:

using DataFrames
using ShiftedArrays

issquare(n::Integer)=(n==round(sqrt(n))^2)

N=20
numbers=collect(1:N)
successors=ShiftedArrays.lead(numbers)
predecessors=ShiftedArrays.lag(numbers)

Test=DataFrame([numbers,successors,predecessors],[:n,:s,:p])
GoodFiltered=filter!( row->issquare(row.n), Test )

This gives the correct result, for example containing a row with n=9, s=10, p=8
However, when I slightly vary the code and replace the last two lines of code with

Test=DataFrame([numbers],[:n])
Test.s=ShiftedArrays.lead(Test.n)
Test.p=ShiftedArrays.lag(Test.n)
BadFiltered=filter!( row->issquare(row.n), Test )

This gives the wrong result, for example containing a row with n=9, s=16, p=4

It is very unexpected to me, there might be some explanation but it certainly feels like this should not be the way filter! works, especially since filter without ! works well as before.

Thanks for any help!

Dan · May 24, 2025, 9:41pm

It’s an issue with copycols behaviour. For example:

Test=DataFrame([numbers],[:n])
Test.p=copy(ShiftedArrays.lag(Test.n))
Test.s=copy(ShiftedArrays.lead(Test.n))
BadFiltered=filter!( row->issquare(row.n), Test )

works okay. The first version has a default copycols named parameter set to true. So, to get the error in the first case:

julia> Test=DataFrame([numbers,successors,predecessors],[:n,:s,:p]; 
  copycols=false)   # get bad behaviour without copy

julia> filter!( row->issquare(row.n), Test )
4×3 DataFrame
 Row │ n      s        p       
     │ Int64  Int64?   Int64?  
─────┼─────────────────────────
   1 │     1        4  missing 
   2 │     4        9        1
   3 │     9       16        4
   4 │    16  missing        9

The bad behaviour is caused by modifying memory which is aliased by several vectors (efficient but dangerous in this case).

PeterDM · May 26, 2025, 8:05am

Thanks a lot, Dan! I figured there would be an explanation. Like you say, it is somewhat dangerous. I would have preferred filter! to detect that columns are only aliases, or at least that the documentation warns for this. Anyhow, case closed.

Benny · May 26, 2025, 1:21pm

If you’re going to delete or insert rows in-place, then it’s better to have your table eagerly evaluate and own all its entries, that’s why copycols defaults to true. Even without aliased columns, deleting or inserting elements to uncopied columns from external references makes it easy to corrupt a DataFrame, cause row mismatches, or access memory unsafely. This isn’t particular to DataFrames, deleting or inserting elements can cause the same weird behaviors among aliased vectors:

julia> x = [1,2]; y = view(x, :); pop!(x); (x,y)
([1], [1, 2])

julia> push!(x, 3); (x,y)
([1, 3], [1, 3])

As for why uncopied columns are allowed in DataFrames at all, it’s intended for sharing simpler setindex!, which is why copycols=false still copies some types like ranges into new Vectors. If you need to filter these, the non-inplace filter is available to handle all the copying, at the cost of the aliasing.

That’s the thing, it ~~does use a conservative heuristic Base.mightalias (exact test is undecideable IIRC) to~~ entirely skips over more blatantly aliased columns. If it didn’t, filter! (and the underlying deleteat!) would just error on many wrappers of vectors, including ShiftedVector in your case, and result in a corrupted DataFrame with columns of different lengths. In this case that’s arguably better than a silent mismatch of rows, but there are also cases where filter! skipping the aliased columns is not an issue, like unshifted columns intended just for properties to access the same data. There isn’t a strictly known condition for filter! having unwanted behavior.

PeterDM · May 26, 2025, 1:54pm

Thanks Benny. I see your point, but then I have some more questions:

Apparently, Base.mightalias(v,ShiftedArrays.lag(v)) returns false, and not true as I would have expected according to your explanation. Edit: After some tries I found that Base.mightalias(v,parent(ShiftedArrays.lag(v))) does return true…
I acknowledge that filter! would want to check this to avoid corrupted tables, but instead of skipping the columns, why not throw an exception in that case (it was suggested in the discussion i found on the discourse discussion “whats-the-aliasing-story-in-julia” if i interpret correctly… Perhaps it is like you say only occasionally that it does not work properly, and sometimes the wanted behaviour is indeed to skip columns. I personally find it a bit dangerous to just assume skipping is the best thing to do and not talk about it in docs.

The undecidability is intriguing, but as tempting as it is, I won’t go into this…

Benny · May 26, 2025, 2:29pm

Good factcheck, Base.mightalias is an internal function because it only checks for some Base types, falling back to an often wrong false otherwise. It indeed doesn’t work on ShiftedArrays because it was never implemented for it. I misread its use in DataFrames, the actual alias check is just a much simpler check of whether deleting entries of the first column had reduced the length of the others:

    for i in 2:length(cols)
        col = cols[i]
        # this check is done to handle column aliases
        if length(col) == n
            deleteat!(col, drop)
        end
    end

It would error if the columns weren’t aliasing each other because deleteat! is intentionally not implemented for ShiftedArrays, and it wouldn’t need to error in some very useful cases of aliasing (ShiftedArray(x) defaulting to 0 shift would be entirely safe). The “What’s the aliasing story” thread also had proponents of hard errors or automatic handling, but that unfairly punishes some entirely proper use cases or adds an uncertain heuristic every single call. Not wasting electricity on the same check in a hot loop justifies the existence of functions that don’t perform that check, and semantically ensuring unaliased inputs or intermediates (for example, like using the non-inplace filter for DataFrames) is much better than relying on an uncertain heuristic. That’s what made me take the side of “document this”, and the copycols keyword section doesn’t really say why there’s a true/false distinction. It’s easy to run into these issues if people overlook array aliasing (ShiftedArrays does openly say it’s lazy) or go overboard with preallocating memory.

PeterDM · May 26, 2025, 2:33pm

After careful consideration, I have to agree with your point of view! Thanks!

Topic		Replies	Views
`filter` on a `DataFrame` is not allocating a new object as expected Data dataframes	7	449	February 19, 2022
Using previous row values to create values for a new column New to Julia dataframes	4	2651	March 20, 2021
Frustrated using DataFrames New to Julia dataframes , data_structures	97	10538	April 22, 2022
Query.jl extremely slow General Usage	6	739	February 23, 2019
DataFrames: obtaining the subset of rows by a set of values New to Julia dataframes	45	24035	April 27, 2024

Bug with filtering DataFrames in combination with ShiftedArrays

Related topics