It seems filtering DataFrames is not compatible with ShiftedArrays in a way one would expect it to be:
using DataFrames
using ShiftedArrays
issquare(n::Integer)=(n==round(sqrt(n))^2)
N=20
numbers=collect(1:N)
successors=ShiftedArrays.lead(numbers)
predecessors=ShiftedArrays.lag(numbers)
Test=DataFrame([numbers,successors,predecessors],[:n,:s,:p])
GoodFiltered=filter!( row->issquare(row.n), Test )
This gives the correct result, for example containing a row with n=9, s=10, p=8
However, when I slightly vary the code and replace the last two lines of code with
Test=DataFrame([numbers],[:n])
Test.s=ShiftedArrays.lead(Test.n)
Test.p=ShiftedArrays.lag(Test.n)
BadFiltered=filter!( row->issquare(row.n), Test )
This gives the wrong result, for example containing a row with n=9, s=16, p=4
It is very unexpected to me, there might be some explanation but it certainly feels like this should not be the way filter! works, especially since filter without ! works well as before.
Thanks a lot, Dan! I figured there would be an explanation. Like you say, it is somewhat dangerous. I would have preferred filter! to detect that columns are only aliases, or at least that the documentation warns for this. Anyhow, case closed.
If you’re going to delete or insert rows in-place, then it’s better to have your table eagerly evaluate and own all its entries, that’s why copycols defaults to true. Even without aliased columns, deleting or inserting elements to uncopied columns from external references makes it easy to corrupt a DataFrame, cause row mismatches, or access memory unsafely. This isn’t particular to DataFrames, deleting or inserting elements can cause the same weird behaviors among aliased vectors:
As for why uncopied columns are allowed in DataFrames at all, it’s intended for sharing simpler setindex!, which is why copycols=false still copies some types like ranges into new Vectors. If you need to filter these, the non-inplace filter is available to handle all the copying, at the cost of the aliasing.
That’s the thing, it does use a conservative heuristic Base.mightalias (exact test is undecideable IIRC) to entirely skips over more blatantly aliased columns. If it didn’t, filter! (and the underlying deleteat!) would just error on many wrappers of vectors, including ShiftedVector in your case, and result in a corrupted DataFrame with columns of different lengths. In this case that’s arguably better than a silent mismatch of rows, but there are also cases where filter! skipping the aliased columns is not an issue, like unshifted columns intended just for properties to access the same data. There isn’t a strictly known condition for filter! having unwanted behavior.
Thanks Benny. I see your point, but then I have some more questions:
Apparently, Base.mightalias(v,ShiftedArrays.lag(v)) returns false, and not true as I would have expected according to your explanation. Edit: After some tries I found that Base.mightalias(v,parent(ShiftedArrays.lag(v))) does return true…
I acknowledge that filter! would want to check this to avoid corrupted tables, but instead of skipping the columns, why not throw an exception in that case (it was suggested in the discussion i found on the discourse discussion “whats-the-aliasing-story-in-julia” if i interpret correctly… Perhaps it is like you say only occasionally that it does not work properly, and sometimes the wanted behaviour is indeed to skip columns. I personally find it a bit dangerous to just assume skipping is the best thing to do and not talk about it in docs.
The undecidability is intriguing, but as tempting as it is, I won’t go into this…
Good factcheck, Base.mightalias is an internal function because it only checks for some Base types, falling back to an often wrong false otherwise. It indeed doesn’t work on ShiftedArrays because it was never implemented for it. I misread its use in DataFrames, the actual alias check is just a much simpler check of whether deleting entries of the first column had reduced the length of the others:
for i in 2:length(cols)
col = cols[i]
# this check is done to handle column aliases
if length(col) == n
deleteat!(col, drop)
end
end
It would error if the columns weren’t aliasing each other because deleteat! is intentionally not implemented for ShiftedArrays, and it wouldn’t need to error in some very useful cases of aliasing (ShiftedArray(x) defaulting to 0 shift would be entirely safe). The “What’s the aliasing story” thread also had proponents of hard errors or automatic handling, but that unfairly punishes some entirely proper use cases or adds an uncertain heuristic every single call. Not wasting electricity on the same check in a hot loop justifies the existence of functions that don’t perform that check, and semantically ensuring unaliased inputs or intermediates (for example, like using the non-inplace filter for DataFrames) is much better than relying on an uncertain heuristic. That’s what made me take the side of “document this”, and the copycols keyword section doesn’t really say why there’s a true/false distinction. It’s easy to run into these issues if people overlook array aliasing (ShiftedArrays does openly say it’s lazy) or go overboard with preallocating memory.