Curious CSV round-trip problem

TimG · August 18, 2024, 1:44pm

(Apologies for a long post. There is a long set-up for this issue!)

This website provides data on all the grants to good causes that the National Lottery has made since it began 30 years ago. It is possible to download this data as a CSV file by clicking the link on the web page, but it is also possible to make an HTTP get call to https://nationallottery.dcms.gov.uk/api/v1/grants/csv-export/.

The files obtained by these two methods differ slightly. So for example:

f = open("grants-direct.csv", "r")
dcontent = read(f, String)
close(f)
f = open("grants-http.csv", "r")
hcontent = read(f, String)
close(f)
println("Direct file has length $(length(dcontent))")
println("HTTPget file has length $(length(hcontent))")

shows

Direct file has length 323619656
HTTPget file has length 322912529

The files are ordered differently, so direct comparisons aren’t straightforward. However, if I read each file line by line, sort and compare, I find they are reported to be identical:

d = [x for x in eachline("grants-direct.csv")]
h = [x for x in eachline("grants-http.csv")]
sort!(d)
sort!(h)
if (d == h) # true!
    println("Two arrays ARE identical")
else
    println("Two arrays are NOT identical")
end

Two arrays ARE identical

If I read the two files into a dataframe using CSV read, they also report as identical.

    direct = CSV.read("grants-direct.csv", DataFrame)
    HTTPget = CSV.read("grants-http.csv", DataFrame)
sh = sort(HTTPget, :Identifier)
sd = sort(direct, :Identifier)
if sd == sh # Also true!
    println("Two dataframes ARE identical")
else
    println("Two dataframes are NOT identical")
end

comp = sh .== sd
counter = 0
for (i, col) = enumerate(eachcol(comp)) # All columns identical!
    if !all(col)
        global counter += 1
        println(names(comp)[i], sum(col)) 
    end
end
if counter == 0
    println("All dataframe columns ARE identical")
else
    println("All dataframe columns are NOT identical")
end

counter=0
for (i, row) = enumerate(eachrow(comp)) # All rows identical
    if !all(row)
        global counter += 1
        println("Row $i is not identical") 
    end
end

if counter == 0
    println("All dataframe rows ARE identical")
else
    println("All dataframe rows are NOT identical")
end

Two dataframes ARE identical
All dataframe columns ARE identical
All dataframe rows ARE identical

So far so good! Now the tricky part.

fixdcms!(HTTPget)
fixdcms!(direct)

CSV.write("fixeddirect.csv", direct)
test = CSV.read("fixeddirect.csv", DataFrame)
println("Reading fixeddirect.csv succeeded!")
CSV.write("sortedfixedHTTPget.csv", sort(HTTPget, :Identifier))
test = CSV.read("sortedfixedHTTPget.csv", DataFrame)
println("Reading sortedfixedHTTPget.csv succeeded!")
CSV.write("fixedHTTPget.csv", HTTPget)
test = CSV.read("fixedHTTPget.csv", DataFrame)
println("Reading fixedHTTPget.csv succeeded!")

I am applying a cumbersome set of updates to the fields that record the administrative geography where the grant was awarded. Boundaries and names change over time but the database doesn’t get updated.

fixdcms

function fixdcms!(dcms)
    rename!(dcms, # rename columns to (slightly) more convenient names
        "Amount Awarded" => :Amount_Awarded,
        "Award Date" => :Award_Date,
        "Recipient Org:Identifier" => :Recipient_Org_Identifier,
        "Recipient Org:Name" => :Recipient_Org_Name,
        "Recipient Org:Ward" => :Recipient_Org_Ward,
        "Recipient Org:UK Constituency" => :Recipient_Org_UK_Constituency,
        "Recipient Org:Local Authority" => :Recipient_Org_Local_Authority,
        "Recipient Org:Region" => :Recipient_Org_Region,
        "Funding Org:Identifier" => :Funding_Org_Identifier,
        "Funding Org:Name" => :Funding_Org_Name,
        "Good Cause Area" => :Good_Cause_Area,
        "Last Modified" => :Last_Modified
    )

    println()
    print("Removing duplicate regions... ")
    @time begin
        for (i, region) in enumerate(dcms.Recipient_Org_Region)
            if region == "Yorkshire and Humberside"
                dcms.Recipient_Org_Region[i] = "Yorkshire and The Humber"
            elseif region == "Eastern"
                dcms.Recipient_Org_Region[i] = "East of England"
            end
        end
    end


    println()
    print("Removing duplicate Local Authorities... ")
    @time begin
        dcms.Recipient_Org_Local_Authority .= titlecase.(dcms.Recipient_Org_Local_Authority, wordsep=c -> occursin(c, " -/("))
        for (i, LA) in enumerate(dcms.Recipient_Org_Local_Authority)

            if LA in ["County Of Herefordshire", "Herefordshire County Of", "Herefordshire"]
                dcms.Recipient_Org_Local_Authority[i] = "Herefordshire, County Of"
            elseif LA in ["Cardiff Councik", "Cardiff Council", "Caerdydd", "Cardiff Bay"]
                dcms.Recipient_Org_Local_Authority[i] = "Cardiff"
            elseif LA in ["Caerphil", "C'philly", "Caer"]
                dcms.Recipient_Org_Local_Authority[i] = "Caerphilly"
            elseif LA in ["London Borough Of Hammersmith", "London Borough Of Hammersmith And Fulham"]
                dcms.Recipient_Org_Local_Authority[i] = "Hammersmith And Fulham"
            elseif LA in ["Shetland Island", "Shetland"]
                dcms.Recipient_Org_Local_Authority[i] = "Shetland Islands"
            elseif LA in ["Aylesbury Vale", "South Bucks", "Chiltern", "Wycombe"]
                dcms.Recipient_Org_Local_Authority[i] = "Buckinghamshire"
            elseif LA in ["Corby", "East Northamptonshire", "Kettering", "Wellingborough"]
                dcms.Recipient_Org_Local_Authority[i] = "North Northamptonshire"
            elseif LA in ["Sedgemoor", "West Somerset", "Taunton Deane", "South Somerset", "Mendip", "Somerset West And Taunton"]
                dcms.Recipient_Org_Local_Authority[i] = "Somerset"
            elseif LA in ["Shepway"]
                dcms.Recipient_Org_Local_Authority[i] = "Folkestone and Hythe"
            elseif LA in ["Forest Heath", "St Edmundsbury"]
                dcms.Recipient_Org_Local_Authority[i] = "West Suffolk"
            elseif LA in ["Daventry", "Northampton", "South Northamptonshire"]
               dcms.Recipient_Org_Local_Authority[i] = "West Northamptonshire"
            elseif LA in ["Weymouth And Portland", "West Dorset", "North Dorset", "Purbeck", "East Dorset"]
                dcms.Recipient_Org_Local_Authority[i] = "Dorset"
            elseif LA in ["Barrow-In-Furness", "Eden", "South Lakeland"]
                dcms.Recipient_Org_Local_Authority[i] = "Westmorland And Furness"
            elseif LA in ["Craven", "Hambleton", "Harrogate", "Richmondshire", "Ryedale", "Scarborough", "Selby"]
                dcms.Recipient_Org_Local_Authority[i] = "North Yorkshire"
            elseif LA in ["Bristol City Of", "Bristol", "City Of Bristol"]
                dcms.Recipient_Org_Local_Authority[i] = "Bristol, City Of"
            elseif LA in ["North Down And Ards", "Ards"]
                dcms.Recipient_Org_Local_Authority[i] = "Ards And North Down"
            elseif LA in ["Abertawe"]
                dcms.Recipient_Org_Local_Authority[i] = "Swansea"
            elseif LA in ["Antrim", "Newtownabbey"]
                dcms.Recipient_Org_Local_Authority[i] = "Antrim And Newtownabbey"
            elseif LA in ["Neath & Port Talbot"]
                dcms.Recipient_Org_Local_Authority[i] = "Neath Port Talbot"
            elseif LA in ["Allerdale", "Copeland", "Carlisle"]
                dcms.Recipient_Org_Local_Authority[i] = "Cumberland"
            elseif LA in ["Rhondda", "Rhondda Cynon Taff", "Rhondda Cynon Taff County Borough Council", "Rct", "R.c.t.", "Pontypridd"]
                dcms.Recipient_Org_Local_Authority[i] = "Rhondda Cynon Taf"
            elseif LA in ["Armagh", "Craigavon", "Armagh, Banbridge And Craigavon"]
                dcms.Recipient_Org_Local_Authority[i] = "Armagh City, Banbridge And Craigavon"
            elseif LA in ["Derry", "Derry And Strabane"]
                dcms.Recipient_Org_Local_Authority[i] = "Derry City And Strabane"
            elseif LA in ["Fermanagh", "Omagh"]
                dcms.Recipient_Org_Local_Authority[i] = "Fermanagh And Omagh"
            elseif LA in ["Flint"]
                dcms.Recipient_Org_Local_Authority[i] = "Flintshire"
            elseif LA in ["Suffolk Coastal", "Waveney"]
                dcms.Recipient_Org_Local_Authority[i] = "East Suffolk"
            elseif LA in ["Lisburn", "Castlereagh"]
                dcms.Recipient_Org_Local_Authority[i] = "Lisburn And Castlereagh"
            elseif LA in ["Down", "Newry And Mourne"]
                dcms.Recipient_Org_Local_Authority[i] = "Newry, Mourne And Down"
            elseif LA in ["Eilean Siar", "Comhairle Nan Eilean Siar (Western Isles Council)"]
                dcms.Recipient_Org_Local_Authority[i] = "Na H-Eileanan Siar"
            elseif LA in ["Conwy County Borough Council"]
                dcms.Recipient_Org_Local_Authority[i] = "Conwy"
            elseif LA in ["Birming"]
                dcms.Recipient_Org_Local_Authority[i] = "Birmingham"
            elseif LA in ["Coleraine"]
                dcms.Recipient_Org_Local_Authority[i] = "Causeway Coast And Glens"
            elseif LA in ["Cookstown", "Dungannon"]
                dcms.Recipient_Org_Local_Authority[i] = "Mid Ulster"
            elseif LA in ["Ballymena", "Carrickfergus", "Larne"]
                dcms.Recipient_Org_Local_Authority[i] = "Mid And East Antrim"
            elseif LA in ["Gwynedd Council", "Bangor Music"]
                dcms.Recipient_Org_Local_Authority[i] = "Gwynedd"
            elseif LA in ["Blaenau", "Gwent"]
                dcms.Recipient_Org_Local_Authority[i] = "Blaenau Gwent"
            elseif LA in ["Bournemouth", "Christchurch", "Poole"]
                dcms.Recipient_Org_Local_Authority[i] = "Bournemouth, Christchurch And Poole"
            elseif LA in ["City Of Westminster"]
                dcms.Recipient_Org_Local_Authority[i] = "Westminster"
            elseif LA in ["Mid Glam"]
                dcms.Recipient_Org_Local_Authority[i] = "Mid Glamorgan"
            elseif LA in ["S Glam"]
                dcms.Recipient_Org_Local_Authority[i] = "South Glamorgan"
            elseif LA in ["Monmouth"]
                dcms.Recipient_Org_Local_Authority[i] = "Monmouthshire"
            elseif LA in ["Isle Of Anglesey County Council", "Ynys Mon"]
                dcms.Recipient_Org_Local_Authority[i] = "Isle Of Anglesey"
            elseif LA in ["Pembrokeshire County Council", "Pembrokeshire/Ceredigion", "Sir Benfro"]
                dcms.Recipient_Org_Local_Authority[i] = "Pembrokeshire"
            elseif LA in ["Cardiganshire", "Cered"]
                dcms.Recipient_Org_Local_Authority[i] = "Ceredigion"
            elseif LA in ["Carms", "Sir Gâr"]
                dcms.Recipient_Org_Local_Authority[i] = "Carmarthenshire"
            elseif LA in ["Isle Of Anglesey County Council"]
                dcms.Recipient_Org_Local_Authority[i] = "Isle Of Anglesey"
            elseif LA in ["Stratford On Avon", "Stratford Upon Avon"]
                dcms.Recipient_Org_Local_Authority[i] = "Stratford-On-Avon"
            elseif LA in ["Kingston Upon Hull"]
                dcms.Recipient_Org_Local_Authority[i] = "Kingston Upon Hull, City Of"
            end

            if dcms.Recipient_Org_Local_Authority[i] == "Kent"
                dcms.Recipient_Org_Region[i] = "South East"
            elseif dcms.Recipient_Org_Local_Authority[i] == "London"
                dcms.Recipient_Org_Region[i] = "London"
            elseif dcms.Recipient_Org_Local_Authority[i] == "Mid Glamorgan"
                dcms.Recipient_Org_Region[i] = "Wales"
            elseif dcms.Recipient_Org_Local_Authority[i] == "South Glamorgan"
                dcms.Recipient_Org_Region[i] = "Wales"
            elseif dcms.Recipient_Org_Local_Authority[i] == "Surrey"
                dcms.Recipient_Org_Region[i] = "South East"
            end
            if dcms.Recipient_Org_UK_Constituency[i] == "Ruislip, Northwood and Pinner"
                dcms.Recipient_Org_Region[i] = "London"
            end

        end
    end


    println()
    print("Removing duplicate Constituencies... ")
    @time begin
        for (i, Const) in enumerate(dcms.Recipient_Org_UK_Constituency)
            if Const in ["Ealing Southall"]
                dcms.Recipient_Org_UK_Constituency[i] = "Ealing, Southall"
            elseif Const in ["Enfield Southgate"]
                dcms.Recipient_Org_UK_Constituency[i] = "Enfield, Southgate"
            elseif Const in ["Berwick upon Tweed"]
                dcms.Recipient_Org_UK_Constituency[i] = "Berwick-upon-Tweed"
            elseif Const in ["Liverpool Riverside"]
                dcms.Recipient_Org_UK_Constituency[i] = "Liverpool, Riverside"
            elseif Const in ["Liverpool Walton"]
                dcms.Recipient_Org_UK_Constituency[i] = "Liverpool, Walton"
            elseif Const in ["Liverpool Wavertree"]
                dcms.Recipient_Org_UK_Constituency[i] = "Liverpool, Wavertree"
            elseif Const in ["Manchester Gorton"]
                dcms.Recipient_Org_UK_Constituency[i] = "Manchester, Gorton"
            elseif Const in ["Manchester Withington"]
                dcms.Recipient_Org_UK_Constituency[i] = "Manchester, Withington"
            elseif Const in ["Caithness Sutherland and Easter Ross", "Caithness, Sutherland and Ross"]
                dcms.Recipient_Org_UK_Constituency[i] = "Caithness, Sutherland and Easter Ross"
            elseif Const in ["Dumfriesshire"]
                dcms.Recipient_Org_UK_Constituency[i] = "Dumfriesshire, Clydesdale and Tweeddale"
            elseif Const in ["Dunfermline"]
                dcms.Recipient_Org_UK_Constituency[i] = "Dunfermline and West Fife"
            elseif Const in ["Inverness and Nairn"]
                dcms.Recipient_Org_UK_Constituency[i] = "Inverness, Nairn, Badenoch and Strathspey"
            elseif Const in ["Linlithgow"]
                dcms.Recipient_Org_UK_Constituency[i] = "Linlithgow and East Falkirk"
            elseif Const in ["Ross Skye and Lochaber", "Skye, Lochaber and Badenoch"]
                dcms.Recipient_Org_UK_Constituency[i] = "Ross, Skye and Lochaber"
            elseif Const in ["Paisley"]
                dcms.Recipient_Org_UK_Constituency[i] = "Paisley and Renfrewshire North"
            elseif Const in ["Falkirk West"]
                dcms.Recipient_Org_UK_Constituency[i] = "Falkirk"
            elseif Const in ["Kilmarnock and Irvine Valley"]
                dcms.Recipient_Org_UK_Constituency[i] = "Kilmarnock and Loudoun"
            elseif Const in ["Brighton Kemptown"]
                dcms.Recipient_Org_UK_Constituency[i] = "Brighton, Kemptown"
            elseif Const in ["Brighton Pavilion"]
                dcms.Recipient_Org_UK_Constituency[i] = "Brighton, Pavilion"
            elseif Const in ["Southampton Itchen"]
                dcms.Recipient_Org_UK_Constituency[i] = "Southampton, Itchen"
            elseif Const in ["Ynys Mon", "Ynys MÒ´n", "Ynys M��n", "Ynys MÃ´n", "Ynys MÃƒÂ´n"]
                dcms.Recipient_Org_UK_Constituency[i] = "Ynys Môn"
            elseif Const in ["Birmingham Edgbaston"]
                dcms.Recipient_Org_UK_Constituency[i] = "Birmingham, Edgbaston"
            elseif Const in ["Birmingham Hall Green"]
                dcms.Recipient_Org_UK_Constituency[i] = "Birmingham, Hall Green"
            elseif Const in ["Birmingham Ladywood"]
                dcms.Recipient_Org_UK_Constituency[i] = "Birmingham, Ladywood"
            elseif Const in ["Birmingham Perry Barr"]
                dcms.Recipient_Org_UK_Constituency[i] = "Birmingham, Perry Barr"
            elseif Const in ["Birmingham Selly Oak"]
                dcms.Recipient_Org_UK_Constituency[i] = "Birmingham, Selly Oak"
            elseif Const in ["Sheffield Hallam"]
                dcms.Recipient_Org_UK_Constituency[i] = "Sheffield, Hallam"
            elseif Const in ["Sheffield Heeley"]
                dcms.Recipient_Org_UK_Constituency[i] = "Sheffield, Heeley"
            elseif Const in ["Lewisham Deptford"]
                dcms.Recipient_Org_UK_Constituency[i] = "Lewisham, Deptford"
            end
        end
    end


    println()
    print("Removing duplicate Wards... ")
    @time begin
        dcms.Recipient_Org_Ward .= replace.(dcms.Recipient_Org_Ward, "St." => "St", "&" => "and", "-" => " ",)
        for (i, Ward) in enumerate(dcms.Recipient_Org_Ward)
            if Ward in ["Unknown", "NULL", "FAILED"]
                dcms.Recipient_Org_Ward[i] = "Not Derived"
            end
        end
    end
    println()
    print("Removing \"Not Derived\" where possible ... ")
    @time begin

        sort!(dcms, :Recipient_Org_Ward)
        for i = 2:nrow(dcms)
            if dcms.Recipient_Org_Ward[i] == "Not Derived" || dcms.Recipient_Org_Ward[i-1] == "Not Derived"
            else
                if dcms.Recipient_Org_Local_Authority[i] == "Not Derived" && dcms.Recipient_Org_Ward[i] == dcms.Recipient_Org_Ward[i-1]
                    dcms.Recipient_Org_Local_Authority[i] = dcms.Recipient_Org_Local_Authority[i-1]
                end
            end
        end
        sort!(dcms, :Recipient_Org_Local_Authority)
        for i = 2:nrow(dcms)
            if dcms.Recipient_Org_Local_Authority[i] == "Not Derived" || dcms.Recipient_Org_Local_Authority[i-1] == "Not Derived"
            elseif dcms.Recipient_Org_Region[i] == "Not Derived" && dcms.Recipient_Org_Local_Authority[i] == dcms.Recipient_Org_Local_Authority[i-1]
                dcms.Recipient_Org_Region[i] = dcms.Recipient_Org_Region[i-1]
            end
        end

        sort!(dcms, :Recipient_Org_UK_Constituency)
        for i in findall(l -> l == "Not Derived", dcms.Recipient_Org_Region)
            j = 0
            if dcms.Recipient_Org_UK_Constituency[i] !== "Not Derived"
                while dcms.Recipient_Org_Region[i] == "Not Derived" && (dcms.Recipient_Org_UK_Constituency[i] == dcms.Recipient_Org_UK_Constituency[i+j] || dcms.Recipient_Org_UK_Constituency[i] == dcms.Recipient_Org_UK_Constituency[i-j])
                    j += 1
                    if dcms.Recipient_Org_Region[i+j] !== "Not Derived" && dcms.Recipient_Org_UK_Constituency[i+j] == dcms.Recipient_Org_UK_Constituency[i]
                        dcms.Recipient_Org_Region[i] = dcms.Recipient_Org_Region[i+j]
                    elseif dcms.Recipient_Org_Region[i-j] !== "Not Derived" && dcms.Recipient_Org_UK_Constituency[i-j] == dcms.Recipient_Org_UK_Constituency[i]
                        dcms.Recipient_Org_Region[i] = dcms.Recipient_Org_Region[i-j]
                    end
                end
            end
        end
    end
end

Having made this update, the ability to roundtrip the dataframe now depends on it’s source. The data from the directly downloaded file will roundtrip, but the file from HTTP get won’t. More strangely, merely sorting the HTTP dataframe somehow allows it to make the round trip.

Reading fixeddirect.csv succeeded!
Reading sortedfixedHTTPget.csv succeeded!
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
┌ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 210003. Filling remaining columns with `missing`
└ @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:586
ERROR: LoadError: TaskFailedException

    nested task error: CSV.Error("thread = 2 fatal error, encountered an invalidly quoted field while parsing around row = 175539, col = 3: \"\"I will undertake a research trip hosted by Michele Bryd-McPhee curator of ‘Ladies of Hip-Hop Festival’ in New York City in March and July 2018 with 3 fundamental areas of enquiry; \n\", error=INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD , check your `quotechar` arguments or manually fix the field in the file itself")
    Stacktrace:
     [1] fatalerror(buf::Vector{UInt8}, pos::Int64, len::Int64, code::Int16, row::Int64, col::Int64)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:590
     [2] parsevalue!(::Type{String}, buf::Vector{UInt8}, pos::Int64, len::Int64, row::Int64, rowoffset::Int64, i::Int64, col::CSV.Column, ctx::CSV.Context)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:798
     [3] parserow
       @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:640 [inlined]
     [4] parsefilechunk!(ctx::CSV.Context, pos::Int64, len::Int64, rowsguess::Int64, rowoffset::Int64, columns::Vector{CSV.Column}, ::Type{Tuple{}})
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:550
     [5] multithreadparse(ctx::CSV.Context, pertaskcolumns::Vector{Vector{CSV.Column}}, rowchunkguess::Int64, i::Int64, rows::Vector{Int64}, wholecolumnslock::ReentrantLock)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:360
     [6] (::CSV.var"#34#39"{CSV.Context, Vector{Vector{CSV.Column}}, Int64, Int64, Vector{Int64}, ReentrantLock})()
       @ CSV C:\Users\TGebbels\.julia\packages\WorkerUtilities\ey0fP\src\WorkerUtilities.jl:384
Stacktrace:
  [1] sync_end(c::Channel{Any})
    @ Base .\task.jl:448
  [2] macro expansion
    @ .\task.jl:480 [inlined]
  [3] CSV.File(ctx::CSV.Context, chunking::Bool)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:240
  [4] File
    @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:227 [inlined]
  [5] #File#32
    @ C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:223 [inlined]
  [6] CSV.File(source::String)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\file.jl:162
  [7] read(source::String, sink::Type; copycols::Bool, kwargs::@Kwargs{})
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:117
  [8] read(source::String, sink::Type)
    @ CSV C:\Users\TGebbels\.julia\packages\CSV\cwX2w\src\CSV.jl:113
  [9] top-level scope
    @ c:\Users\TGebbels...\Documents\DCMS Database\CompareCsv.jl:361
 [10] include(fname::String)
    @ Base.MainInclude .\client.jl:489
 [11] run(debug_session::VSCodeDebugger.DebugAdapter.DebugSession, error_handler::VSCodeDebugger.var"#3#4"{String})
    @ VSCodeDebugger.DebugAdapter c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\DebugAdapter\src\packagedef.jl:126
 [12] startdebugger()
    @ VSCodeDebugger c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\packages\VSCodeDebugger\src\VSCodeDebugger.jl:45
 [13] top-level scope
    @ c:\Users\TGebbels\.vscode\extensions\julialang.language-julia-1.105.2\scripts\debugger\run_debugger.jl:12
 [14] include(mod::Module, _path::String)
    @ Base .\Base.jl:495
 [15] exec_options(opts::Base.JLOptions)
    @ Base .\client.jl:318
 [16] _start()
    @ Base .\client.jl:552
in expression starting at c:\Users\TGebbels\...\Documents\DCMS Database\CompareCsv.jl:361

Furthermore, if I make one of a number of small changes to fixdcms! the problem goes away. For example, if I comment out the very final conditional element in the function:

#                   dcms.Recipient_Org_Region[i] = dcms.Recipient_Org_Region[i-j]

Then the resultant dataset will roundtrip just fine.
Or, I can leave that line in, but comment out the elseif blocks that assign an LA name that includes a comma like this one, for example:

#         elseif LA in ["Armagh", "Craigavon", "Armagh, Banbridge And Craigavon"]
 #            dcms.Recipient_Org_Local_Authority[i] = "Armagh City, Banbridge And Craigavon"

Again, the resultant dataset will round-trip just fine. Either of these works even though the other is left in the code. I don’t need to take them both out. Just removing one is sufficient.

I understand that the underlying issue is almost certainly in the difference between the two files. I’m not sure how to pin this down, though.

But my question is really about how this difference gets read into two ostensibly identical dataframes, processed and written back out again without a squeak of complaint, but one results in a csv file that CSV.read cannot read back and yet a simple sort fixes the issue.

dmbates · August 18, 2024, 3:21pm

I think you may be running into character-encoding issues. When you read a large data file it is always best to check the extrema of the columns, which are the most likely locations of problematic cases. I downloaded the grants.csv file from the URL you gave and summarized it

julia> df = CSV.read("/Users/dmbates/Downloads/grants.csv", DataFrame; normalizenames=true, downcast=true);

julia> describe(df, :min, :mean, :max, :nunique, :eltype)
16×6 DataFrame
 Row │ variable                       min                                mean     max                                nunique  eltype   
     │ Symbol                         Any                                Union…   Any                                Union…   DataType 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Identifier                     000406                                      YH201117526                        701580   String
   2 │ Title                          ! Remember ! 1914: It was Their …           ☑ Other                            416193   String
   3 │ Description                    ! (Hollywood Blues)                         \uf0fc     Musicians, Dancers, S…  595850   String
   4 │ Currency                       GBP                                         GBP                                1        String3
   5 │ Amount_Awarded                 1                                  62331.6  600000000                                   Int32
   6 │ Award_Date                     1994-09-28                                  2024-07-16                         8547     Date
   7 │ Recipient_Org_Identifier       10025                                       tnlcomfund-org-ÉireÓgAnCharraigM…  347044   String
   8 │ Recipient_Org_Name             !Vamos! Festival Ltd                        ÒJÁ GROUP CIC                      295421   String
   9 │ Recipient_Org_Ward             Aaran                                       london                             9462     String
  10 │ Recipient_Org_UK_Constituency  Aberavon                                    York Outer                         686      String
  11 │ Recipient_Org_Local_Authority  0                                           York                               498      String
  12 │ Recipient_Org_Region           East Midlands                               Yorkshire and The Humber           15       String31
  13 │ Funding_Org_Identifier         GB-CHC-1034245                              GB-GOR-PC390                       12       String15
  14 │ Funding_Org_Name               Arts Council England                        UK Sport                           20       String
  15 │ Good_Cause_Area                Arts                                        Sport                              8        String
  16 │ Last_Modified                  2021-12-09                                  2024-08-15                         123      Date

Notice the peculiar extrema for the Title and Description columns.

TimG · August 21, 2024, 10:58am

Yes, I recognise the data you highlight.

What I don’t understand is why only one dataframe results in an unreadable CSV file when both dataframes ostensibly contain identically the same data.

I cannot identify any differences between the two dataframes, as my code demonstrates, and they both contain identically the same extrema as you have highlighted.

But only one dataframe causes a problem.

mrufsvold · August 21, 2024, 11:30am

Could it have something to do with Julia transforming the characters?

https://docs.julialang.org/en/v1/stdlib/Unicode/#Unicode.julia_chartransform

nalimilan · August 31, 2024, 7:20pm

Can you have a look at row 210003 (as reported in the error message) and possibly post it here? How does it differ between the two files?

I’m surprised it’s possible to get a different behavior when calling the same code on data frames that are considered equal. I wonder whether this may reflect a bug in CSV.jl. Can you try replacing CSV.read(..., DataFrame) with DataFrame(CSV.File(..., ntasks=1))?

TimG · September 1, 2024, 8:12am

I tried CSV.File with ntasks=1 and the problematic file read without error. If I don’t limit ntasks, however, I get the same error as before.

This is line 210003 as cut and pasted from VSCode:

360G-CreativeScotland-7824,Grant to University of Dundee,"Towards the commissioning and exhibition at the Cooper Gallery of 'Jerusalem Syndrome', a new project by Nathan Coley",GBP,10888,2003-11-03,360G-CreativeScotland-University-of-Dundee,University of Dundee,West End,Dundee West,Dundee City,Scotland,GB-GOR-OT964,Scottish Arts Council,Arts,2022-05-10

The other file is ordered differently, so line 210003 relates to a different grant entirely. The same grant is at line 191699 and looks like this:

360G-CreativeScotland-7824,Grant to University of Dundee,"Towards the commissioning and exhibition at the Cooper Gallery of 'Jerusalem Syndrome', a new project by Nathan Coley",GBP,10888,2003-11-03,360G-CreativeScotland-University-of-Dundee,University of Dundee,West End,Dundee West,Dundee City,Scotland,GB-GOR-OT964,Scottish Arts Council,Arts,2022-05-10

I tried to investigate the difference in size of the original downloaded files which is equivalent to exactly one byte per line. I basically did a frequency count of characters. It looks to me like the file downloaded from the link has correct line termination (CRLF) whereas the HTTP file only has LF characters at the end of each line. This accounts for the size discrepancy.

Both the sorted and the unsorted files from the HTTP.get request have this issue, but the sorted one loads. This suggests the problem arises only if there is an unlucky juxtaposition of two rows in the file which causes problems with threaded activity in CSV.File. This impression is reinforced because, if I download a new file today (which has additional rows in it), this issue doesn’t arise.

If this is the fundamental cause, I’m still puzzled how the issue propagates through a DataFrame.

nalimilan · September 2, 2024, 7:52am

Interesting. Can you file an issue against CSV.jl on GitHub? There’s probably a bug when the cut point to attribute parts of the file to tasks is in a particular position.

As you say it’s still intriguing that the difference subsists after reading and writing a data frame. Why isn’t the order the same in the two files? I thought data frames were identical before writing them.

BTW, both LF and CRLF are correct, the former is just the Linux convention while the other is the Windows convention.

TimG · September 2, 2024, 8:07am

Cannot round-trip a file (read, write, read) in some circumstances · Issue #1140 · JuliaData/CSV.jl (github.com)

I thought only CRLF was a valid line ending based on this document, but maybe that’s wrong. The document does acknowledge the existence of many different informal conventions.

Why isn’t the order the same in the two files? I thought data frames were identical before writing them.

I don’t know. that’s just how they come. I sorted the dataframes before comparing them and then they were identical.

nalimilan · September 2, 2024, 8:50am

Ah OK. This RFC isn’t really respected, and AFAIK most software supports both LF and CRLF.

This is another mystery. If you open the original CSV files in a text editor, are rows in the same order?

TimG · September 2, 2024, 9:06am

No, they are definitely in a different order when they arrive.

nalimilan · September 2, 2024, 9:38am

OK, then this is probably just a difference in how the website serves them. And one order triggers the bug in CSV.jl but not the other, even if both files have the same (valid) content in a different order.

TimG · September 2, 2024, 10:17am

Yes. And this also implies that there is actually nothing mysterious going on with DataFrames. It is just the row order tripping up the threading in CSV.jl somehow.

I think this is the answer to the question I posed in the OP. Thank you very much @nalimilan for getting me here!

Topic		Replies	Views
CSV won't read tab separated file General Usage csv	23	650	March 4, 2024
CSV.jl writing quoted strings General Usage question , csv	14	95	December 19, 2024
Using CSV.read() to import data from a data input file into a DataFrame General Usage question , dataframes , csv	27	6762	March 1, 2022
Read file with CSV.read New to Julia	8	19791	September 9, 2019
CSV Reading (rewrite in C?) Internals & Design	50	5068	October 1, 2018

Curious CSV round-trip problem

Related topics