CSV.jl: findmax and argmax are still broken, please help

findmax and argmax are still broken with CSV.jl, please help

Test code
using CSV
using DataFrames
using StableRNGs
using Test

function write_data(dims::Dims{3} = (16, 2, 8))
    rng = StableRNG(1230)
    data = rand(rng, 0:9, dims)
    mkpath("test_data")
    for i in 1:size(data, 3)
        CSV.write(
              "test_data/test_$i.csv",
              DataFrame(data[:,:,i], [:x,:y])
        )
    end
    return nothing
end

function read_data(n_files = 8)
    return CSV.read(
        ["test_data/test_$i.csv" for i in 1:n_files],
        DataFrame
    )
end

write_data()
df = read_data()
sq(x) = x^2
@testset "Is CSV.jl broken?" begin
    @testset "findmax" begin
        @test findmax(df.x) == findmax(collect(df.x))
        @test findmax(df.y) == findmax(collect(df.y))
        @test findmax(sq, df.x) == findmax(sq, collect(df.x))
        @test findmax(sq, df.y) == findmax(sq, collect(df.y))
    end
    @testset "argmax" begin
        @test argmax(df.x) == argmax(collect(df.x))
        @test argmax(df.y) == argmax(collect(df.y))
        @test argmax(sq, df.x) == argmax(sq, collect(df.x))
        @test argmax(sq, df.y) == argmax(sq, collect(df.y))
    end
    @testset "findmin" begin
        @test findmin(df.x) == findmin(collect(df.x))
        @test findmin(df.y) == findmin(collect(df.y))
        @test findmin(sq, df.x) == findmin(sq, collect(df.x))
        @test findmin(sq, df.y) == findmin(sq, collect(df.y))
    end
    @testset "argmin" begin
        @test argmin(df.x) == argmin(collect(df.x))
        @test argmin(df.y) == argmin(collect(df.y))
        @test argmin(sq, df.x) == argmin(sq, collect(df.x))
        @test argmin(sq, df.y) == argmin(sq, collect(df.y))
    end
end

When I run the above test code I get the following results.

Test Summary:     | Pass  Fail  Total  Time
Is CSV.jl broken? |   10     6     16  1.5s
  findmax         |          4      4  1.2s
  argmax          |    2     2      4  0.1s
  findmin         |    4            4  0.1s
  argmin          |    4            4  0.1s

Here are the failing tests.

findmax: Test Failed at /home/mkitti/blah/csv_is_broken/test_csv.jl:31
  Expression: findmax(df.x) == findmax(collect(df.x))
   Evaluated: (9, 105) == (9, 9)

findmax: Test Failed at /home/mkitti/blah/csv_is_broken/test_csv.jl:32
  Expression: findmax(df.y) == findmax(collect(df.y))
   Evaluated: (9, 120) == (9, 8)

findmax: Test Failed at /home/mkitti/blah/csv_is_broken/test_csv.jl:33
  Expression: findmax(sq, df.x) == findmax(sq, collect(df.x))
   Evaluated: (81, 105) == (81, 9)

findmax: Test Failed at /home/mkitti/blah/csv_is_broken/test_csv.jl:34
  Expression: findmax(sq, df.y) == findmax(sq, collect(df.y))
   Evaluated: (81, 120) == (81, 8)

argmax: Test Failed at /home/mkitti/blah/csv_is_broken/test_csv.jl:37
  Expression: argmax(df.x) == argmax(collect(df.x))
   Evaluated: 105 == 9

argmax: Test Failed at /home/mkitti/blah/csv_is_broken/test_csv.jl:38
  Expression: argmax(df.y) == argmax(collect(df.y))
   Evaluated: 120 == 8

The problem was previously posted here:

I know the issue is in SentinelArrays.jl:

I think I have a fix:

However, it has been a month, and it is still broken. Please help me fix this.

8 Likes

Sorry if this is obvious, but how can people help?

2 Likes

Looks like somebody just needs to press the merge button, but apparently nobody with that power is reading this. I know that my comment does not help in any way but I understand the frustration, and this is a considerable bug in a widely used package.

1 Like
  1. Review the pull request
  2. Find downstream bugs resulting from this
  3. Write tests for downstream packages including CSV.jl. What else could be broken and how do we systematically find them?
  4. Figure out what process defects exist leading to this situation
  5. Join the JuliaData GitHub organization and volunteer to help maintain these packages

I could find someone to escalate my privileges to merge my own pull request, but that is an act of desperation and a last resort. I should not be merging my own PRs for a package as critical as this.

7 Likes

Write tests for downstream packages including CSV.jl. What else could be broken and how do we systematically find them?

Interestingly, JuliaHub does not list any direct dependents of CSV.jl. It may be one of those packages that mostly ends up in interactive use and scripts and not packages, and this may have contributed to the bug not being found. On the flip side, there may be a large number of scripts/analyses with partially incorrect results beceause of it.

Join the JuliaData GitHub organization and volunteer to help maintain these packages

I’m happy to become a member and help with maintenance. Perhaps merging JuliaData with JuliaIO would also help, as they seem to have large overlap in interest and the latter org has more members already.

That’s odd, as I know that my own SynthControl.jl depends on CSV.jl currently (and somewhat unnecessarily, but that’s a different story…) - indeed JuliaHub shows CSV in the list of dependencies, so it looks like something’s off with JuliaHub’s dependency functionality?

Does it use SentinelArrays.jl somewhere? :sweat_smile:

Jokes aside: JuliaHub does not show any dependents of SentinelArrays.jl which of course is also wrong since CSV.jl depends on it. So I used the snippet from here to find the direct dependents:

julia> collect(keys(reverse_deps("SentinelArrays")))
15-element Vector{String}:
 "CSV"
 "Parquet"
 "ReadStatTables"
 "Arrow"
 "TableOperations"
 "Avro"
 "ChunkedCSV"
 "Dagger"
 "ImpedanceSpectroscopy"
 "UnROOT"
 "Parquet2"
 "DTables"
 "DataFrames"
 "ChunkedBase"
 "ChunkedJSONL"

I find that in total there are 2989 indirect dependents of SentinelArrays.jl.

Code
julia> using Pkg
julia> const ctx = Pkg.Types.Context();

julia> function reverse_deps(name) # from the linked topic
           nuuid = Pkg.Types.registered_uuid(ctx.registries, name)
           dd = Dict{String,Base.UUID}()
           isnothing(nuuid) && "$name not known"
           for reg in Pkg.Registry.reachable_registries()
               for (uuid, regpkg) in reg
                   pkg = Pkg.Registry.registry_info(regpkg)
                   deps = reduce(merge, values(pkg.deps), init=Dict{String,Base.UUID}())
                   x = get(deps, name, nothing)
                   if !isnothing(x) && x == nuuid
                       dd[regpkg.name] = uuid
                   end
       
               end
           end
           dd
       end
julia> function recursive_dependents(rootname)
       todo = reverse_deps(rootname)
       alldeps = empty(todo)
       while length(todo) > 0
           name, uuid = pop!(todo)
           name in keys(alldeps) && continue
           alldeps[name] = uuid
           new = reverse_deps(name)
           todo = merge!(todo, new)
       end
       return alldeps
       end
julia> all = recursive_dependents("SentinelArrays")
Dict{String, Base.UUID} with 2989 entries:
  "TopoPlots"                     => UUID("2bdbdf9c-dbd8-403f-947b-1a4e0dd41a7a")
  "Foresight"                     => UUID("c6e1034d-1844-4b96-abf2-c5824a5aa8c3")
  "BloqadeGates"                  => UUID("9b473568-af80-491c-8f51-e251334462c1")
  "BifurcationKit"                => UUID("0f109fa4-8a5d-4b75-95aa-f515264e7665")
  "DrelTools"                     => UUID("fc805444-c5ec-4e06-bf74-6216be16ab28")
  "Ai4EComponentLib"              => UUID("9901752a-13b3-47c7-8fb4-ce17182539d9")
  "Lectionary"                    => UUID("6244d477-9cf0-4e07-8f6b-f8ee26bc0292")
  "PowerModelsONM"                => UUID("25264005-a304-4053-a338-565045d392ac")
  "Avalon"                        => UUID("d462d321-1585-4840-9f3e-06f5c1b6ea25")
  "GigaSOM"                       => UUID("a03a9c34-069e-5582-a11c-5c984cab887c")
  "ODMXMLTools"                   => UUID("2456a17b-6ca2-4f51-9342-f0287e829718")
  "RestrictedBoltzmannMachines"   => UUID("12e6b396-7db5-4506-8cb6-664a4fe1e50e")
  "Treebanks"                     => UUID("64f30800-8bfd-42c1-8d67-a0fe2fc2f5f5")
  "Trixi2Vtk"                     => UUID("bc1476a1-1ca6-4cc3-950b-c312b255ff95")
  "BeastUtils"                    => UUID("af749cb1-4ea1-40dc-96cd-a9902a2c5721")
  "AlgebraicDynamics"             => UUID("5fd6ff03-a254-427e-8840-ba658f502e32")
  "PosteriorPlots"                => UUID("196f2941-2d58-45ba-9f13-43a2532b2fa8")
  "JqData"                        => UUID("4296d2f8-d960-4a51-97e7-b097944e4d59")
  "FeatherLib"                    => UUID("409f5150-fb84-534f-94db-80d1e10f57e1")
  "RateTables"                    => UUID("d40fb65e-c2ee-4113-9e14-cb96ca0acb32")
  "IntervalConstraintProgramming" => UUID("138f1668-1576-5ad7-91b9-7425abbf3153")
  "ScTenifold"                    => UUID("fcca1770-266e-4af5-8612-876362e279e9")
  "Simulate"                      => UUID("7fe0908f-881e-4672-99a3-35ccdc95dcfc")
  "Solaris"                       => UUID("c31bc6c7-79d6-424e-a85f-5b42ee086500")
  "RvSpectMLPlots"                => UUID("6ad363e8-653f-4efd-a04b-f033e69a984c")
  "DynamicalSystemsBase"          => UUID("6e36e845-645a-534a-86f2-f5d4aa5a06b4")
  "CurrentPopulationSurvey"       => UUID("d0231af6-78e7-4e9e-9602-dcd107aeccbe")
  "QSimulator"                    => UUID("84ebd660-cff8-11e8-33d2-5334dd07f13c")
  ⋮                               => ⋮

1 Like