Dealing with NaN's

@sijo, the example provided had 10 rows for which:

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  5.400 μs (48 allocations: 4.34 KiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  3.575 μs (41 allocations: 2.17 KiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
  11.100 μs (59 allocations: 2.73 KiB)

But for such small data frames performance doesn’t really matter?

I just realized the data frame has abstract columns (Real). Here are timings with Float64 columns, for many rows:

julia> df = DataFrame(x=rand((-1.0,1.0,NaN),1000000), y=rand((-2.0,2.0,NaN),1000000), z=rand((-3.0,3.0,NaN),1000000));

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  16.694 ms (52 allocations: 32.06 MiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  449.876 ms (10556281 allocations: 202.47 MiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
  4.534 ms (29 allocations: 9.17 MiB)

And for a smaller dataset with many columns (here the compilation time is much longer):

julia> df = DataFrame([rand((-1.0,1.0,NaN)) for _ in 1:10, _ in 1:1000], :auto);

julia> @btime filter(AsTable(:) => row->!any(isnan(x) for x in row), $df);
  1.299 ms (12607 allocations: 1.16 MiB)

julia> @btime filter(row->!any(isnan(x) for x in row), $df);
  190.694 μs (2081 allocations: 145.98 KiB)

julia> @btime filter((:) => (values...)->!any(isnan.(values)), $df);
[compilation takes too long, I didn't wait]
1 Like