CSV won't read tab separated file

I have a large text file that describes itself as

--
-- PostgreSQL database dump
--

-- Dumped from database version 12.16
-- Dumped by pg_dump version 12.16

It contains several blocks of tab-separated data which represent the data from each database table. I’ve copied one of these blocks (using the VSCode editor) into a stand-alone file. This contains 15 tab-separated columns and 761,065 rows. First row is a header. I cannot coax CSV into reading this file.

using CSV, DataFrames, StringEncodings

entries = ["id", "identifier", "title", "description", "currency", "amount_awarded", "award_date", "last_modified", "funding_org_id", "good_cause_area_id", "local_authority_id", "recipient_org_id", "region_id", "uk_constituency_id", "ward_id"]
    # Names of DF columns

df1 = DataFrame([name => String[] for name in entries])
for line in eachline("newex.csv")
    list = findall(r"\t", line)
    l = length(list)
    if l !== 14  # Check that all lines have exactly 14 tab separators
        println(l)
    end
    row = String[]
    append!(row, [chop(line[1:first(list[1])], head=0, tail=1)], [chop(line[first(list[x-1]):first(list[x])], head=1, tail=1) for x = 2:l], [chop(line[first(list[l]):end], head=1, tail=0)])
    push!(df1, row)
end
deleteat!(df1, 1) # first row of datafile contains headers.
println(describe(df1))

#df2 = CSV.File(open("newex.csv", enc"ISO-8859-1")) |> DataFrame
df2 = CSV.read("newex.csv", DataFrame)
println(describe(df2))

My β€œbrute force and ignorance” method produces a viable dataframe, but CSV.read fails. I’ve tried specifying delim="\t" and have experimented with StringEncodings but nothing I’ve tried makes any difference.
Obviously, I’ve managed to read the datafile, so this post is out of curiosity more than anything.

Here is the output of running the above script.

15Γ—7 DataFrame
 Row β”‚ variable            mean     min                            median   max                                nmissing  eltype   
     β”‚ Symbol              Nothing  String                         Nothing  String                             Int64     DataType 
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚ id                           100                                     999999                                    0  String
   2 β”‚ identifier                   000406                                  YH201117526                               0  String
   3 β”‚ title                        " " (working title)                     β˜‘ Other                                   0  String
   4 β”‚ description                  ! (Hollywood Blues)                     \uf0fc     Musicians, Dancers, S…         0  String
   5 β”‚ currency                     GBP                                     GBP                                       0  String
   6 β”‚ amount_awarded               0                                       999999                                    0  String
   7 β”‚ award_date                   1995-01-06                              2023-12-20                                0  String
   8 β”‚ last_modified                2021-12-09 12:53:26.999321+00           2024-01-11 13:07:00.623853+00             0  String
   9 β”‚ funding_org_id               1                                       9                                         0  String
  10 β”‚ good_cause_area_id           1                                       8                                         0  String
  11 β”‚ local_authority_id           1                                       \\N                                       0  String
  12 β”‚ recipient_org_id             1                                       99999                                     0  String
  13 β”‚ region_id                    1                                       \\N                                       0  String
  14 β”‚ uk_constituency_id           1                                       \\N                                       0  String
  15 β”‚ ward_id 

Then I get literally thousands of warnings from CSV as it tries to read each line of the file:

β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548111. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548112. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548113. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548114. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548115. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548116. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548117. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548118. Filling remaining columns with `missing`
β”” @ CSV C:\U
...
β”Œ Warning: thread = 3: too many warnings, silencing any further warnings
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195709. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 195709. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195710. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195710. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195711. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195711. Filling remaining columns with `missing`
...
β”Œ Warning: thread = 1: too many warnings, silencing any further warnings
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β”Œ Warning: thread = 1: too many warnings, silencing any further warnings
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601498. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601499. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601500. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601501. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601502. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601503. Filling remaining columns with `missing`
...
β”Œ Warning: thread = 1: too many warnings, silencing any further warnings
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601514. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601515. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601516. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601517. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601518. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601519. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601520. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601521. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
...
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601595. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601596. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601597. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601598. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4: too many warnings, silencing any further warnings
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591

And then the crash.

ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
  [1] getindex
    @ .\essentials.jl:13 [inlined]
  [2] iterate
    @ .\array.jl:945 [inlined]
  [3] _foldl_impl
    @ .\reduce.jl:60 [inlined]
  [4] foldl_impl
    @ .\reduce.jl:48 [inlined]
  [5] mapfoldl_impl
    @ .\reduce.jl:44 [inlined]
  [6] _mapreduce_dim
    @ .\reducedim.jl:362 [inlined]
  [7] mapreduce
    @ .\reducedim.jl:357 [inlined]
  [8] _sum
    @ .\reducedim.jl:1015 [inlined]
  [9] sum
    @ .\reducedim.jl:1011 [inlined]
 [10] _simple_count
    @ .\reduce.jl:1357 [inlined]
 [11] _count
    @ .\reducedim.jl:442 [inlined]
 [12] count
    @ .\reducedim.jl:440 [inlined]
 [13] #43
    @ .\none:0 [inlined]
 [14] MappingRF
    @ .\reduce.jl:100 [inlined]
 [15] _foldl_impl(op::Base.MappingRF{SentinelArrays.var"#43#44"{typeof(ismissing)}, Base.BottomRF{typeof(Base.add_sum)}}, init::Base._InitialValue, itr::Vector{Vector{String}})      
    @ Base .\reduce.jl:58
 [16] foldl_impl
    @ .\reduce.jl:48 [inlined]
 [17] mapfoldl_impl
    @ .\reduce.jl:44 [inlined]
 [18] mapfoldl
    @ .\reduce.jl:175 [inlined]
 [19] mapreduce
    @ .\reduce.jl:307 [inlined]
 [20] sum
    @ .\reduce.jl:535 [inlined]
 [21] sum
    @ .\reduce.jl:564 [inlined]
 [22] count
    @ C:\Users\TGebbels\.julia\packages\SentinelArrays\1kRo4\src\chainedvector.jl:787 [inlined]
 [23] (::DataFrames.var"#76#84"{Vector{Any}})(col::SentinelArrays.ChainedVector{String, Vector{String}})
    @ DataFrames C:\Users\TGebbels\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\abstractdataframe.jl:723
 [24] iterate
    @ .\generator.jl:47 [inlined]
 [25] collect_to!(dest::Vector{Dict{Symbol, Any}}, itr::Base.Generator{DataFrames.DataFrameColumns{DataFrame}, DataFrames.var"#76#84"{Vector{Any}}}, offs::Int64, st::Int64)
    @ Base .\array.jl:892
 [26] collect_to_with_first!
    @ .\array.jl:870 [inlined]
 [27] collect(itr::Base.Generator{DataFrames.DataFrameColumns{DataFrame}, DataFrames.var"#76#84"{Vector{Any}}})
    @ Base .\array.jl:844
 [28] map
    @ .\abstractarray.jl:3313 [inlined]
 [29] _describe(df::DataFrame, stats::Vector{Any})
    @ DataFrames C:\Users\TGebbels\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\abstractdataframe.jl:712
 [30] describe(df::DataFrame; cols::Function)
    @ DataFrames C:\Users\TGebbels\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\abstractdataframe.jl:668
 [31] top-level scope
    @ c:\Users\TGebbels\OneDrive - xxxxxxxx\Documents\DCMS Database\TestCSVread.jl:37

I don’t have a mechanism to share the datafile but I would if I could.
Here are the first few lines:

id	identifier	title	description	currency	amount_awarded	award_date	last_modified	funding_org_id	good_cause_area_id	local_authority_id	recipient_org_id	region_id	uk_constituency_id	ward_id
99402	DCMS-tnlcomfund-0023093253	Activity Adventure Trai	The organisation will use the funding to purchase and construct an Activity Adventure Trail. This will enable the organisation to provide a safe outdoor recreational facility which the area lacks currently to tackle the identified risk of isolation and anti social behaviour amongst the local children.	GBP	0	2015-07-30	2022-06-07 13:24:35.117395+00	5	8	306	63764	3	317	6032
99403	DCMS-tnlcomfund-0023093271	The SPHERE Project	The organisation will use the funding to run a series of activity sessions focussing on physical mental and sexual health. This will enable the organisation to tackle the ongoing health and substance misuse issues that arise as a result of deprivation in the area.	GBP	0	2015-07-30	2022-06-07 13:24:35.129602+00	5	8	76	46015	3	91	11
101347	DCMS-tnlcomfund-0023093273	Crookston Environmental and Outdoor Project	Crookston Environmental and Outdoor Project	GBP	0	2015-08-12	2022-06-07 13:24:35.135671+00	5	8	205	64760	1	545	3978
97860	DCMS-tnlcomfund-0023093278	A4A - Paulsgrove and Wymering Connect	The organisation will use the funding to work with over 40 local agencies to create an online directory giving residents information about local activities and services. This will enable the group to provide easy access to all local residents with updated programmes of activities and information about accessing these services to improve wellbeing and community engagement in the area.	GBP	0	2015-07-16	2022-06-07 13:24:35.141859+00	5	8	141	62886	9	391	767
112863	DCMS-tnlcomfund-0023207672	The Engage Programme	The project will provide practical life skills and training to teenage boys who lack positive male role models in their lives. This will improve the young people’s self-esteem resilience and motivation.	GBP	0	2017-10-12	2022-06-07 15:10:41.356337+00	5	8	366	68266	7	602	5357

I ran the script with newex.csv being the first few lines you provided.

The result is

15Γ—7 DataFrame
 Row β”‚ variable            mean      min                                median      max                                nmissing  eltype
     β”‚ Symbol              Union…    Any                                Any         Any                                Int64     DataType
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚ id                  102175.0  97860                              99403.0     112863                                    0  Int64
   2 β”‚ identifier                    DCMS-tnlcomfund-0023093253                     DCMS-tnlcomfund-0023207672                0  String31
   3 β”‚ title                         A4A - Paulsgrove and Wymering Co…              The SPHERE Project                        0  String
   4 β”‚ description                   Crookston Environmental and Outd…              The project will provide practic…         0  String
   5 β”‚ currency                      GBP                                            GBP                                       0  String3
   6 β”‚ amount_awarded      0.0       0                                  0.0         0                                         0  Int64
   7 β”‚ award_date                    2015-07-16                         2015-07-30  2017-10-12                                0  Date
   8 β”‚ last_modified                 2022-06-07 13:24:35.117395+00                  2022-06-07 15:10:41.356337+00             0  String31
   9 β”‚ funding_org_id      5.0       5                                  5.0         5                                         0  Int64
  10 β”‚ good_cause_area_id  8.0       8                                  8.0         8                                         0  Int64
  11 β”‚ local_authority_id  218.8     76                                 205.0       366                                       0  Int64
  12 β”‚ recipient_org_id    61138.2   46015                              63764.0     68266                                     0  Int64
  13 β”‚ region_id           4.6       1                                  3.0         9                                         0  Int64
  14 β”‚ uk_constituency_id  389.2     91                                 391.0       602                                       0  Int64
  15 β”‚ ward_id             3229.0    11                                 3978.0      6032                                      0  Int64

Maybe the error was due to corrupted data in the rest of the file?


Here is the result of versioninfo()

Julia Version 1.10.1
Commit 7790d6f0641 (2024-02-13 20:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 Γ— Apple M2 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 6 virtual cores)

My simplistic approach can successfully read the whole file.
Some of these fields contain β€œ\\N” to indicate missing values. (eg Local Authority, Region and Constituency).

I process these lines like this:

    df3.local_authority_id = [row.local_authority_id == "\\N" ? missing : parse(Int64, row.local_authority_id) for row in eachrow(df1)]
    df3.region_id = [row.region_id == "\\N" ? missing : parse(Int64, row.region_id) for row in eachrow(df1)]
    df3.uk_constituency_id = [row.uk_constituency_id == "\\N" ? missing : parse(Int64, row.uk_constituency_id) for row in eachrow(df1)]
    df3.ward_id = [row.ward_id == "\\N" ? missing : parse(Int64, row.ward_id) for row in eachrow(df1)]

Again, perhaps, not elegant, but effective.

Might it be these β€œ\\N” values that are confusing CSV?

That wouldn’t surprise me, have you tried telling CSV about them?

    β€’  missingstring: either a nothing, String, or Vector{String} to use as sentinel values that will be parsed as missing; if nothing is passed, no sentinel/missing values will be parsed; by default,
       missingstring="", which means only an empty field (two consecutive delimiters) is considered missing

So, I tried

df2 = CSV.read("newex.csv", missingstring="\\N", DataFrame)

and

df2 = CSV.read("newex.csv", DataFrame; missingstring=["\\N", ""])

but still got the same warnings and errors as above.

If I try

df2 = CSV.read("newex.csv", missingstring="\N", DataFrame)

as suggested here, I get

df2 = CSV.read("newex.csv", missingstring="\N", DataFrame)
#                                          β””β”˜ ── invalid escape sequence
Stacktrace:
 [1] top-level scope
   @ c:\Users\TGebbels\OneDrive - xxxxxxxxxxx\Documents\DCMS Database\TestCSVread.jl:36
in expression starting at c:\Users\TGebbels\OneDrive - xxxxxxxxxx\Documents\DCMS Database\TestCSVread.jl:36

If there are actually two backslashes in the raw data you need missingstring="\\\\N"

I don’t think so, because my method above worked with "\\N".

Just to be sure, I tried

df2 = CSV.read("newex.csv", DataFrame; missingstring=["\\\\N", ""])

but the results were the same.

I edited your file as follows:

image

Then:

julia> CSV.read(f, DataFrame)
5Γ—15 DataFrame
 Row β”‚ id      identifier                  title                              description                        currency  amount_awarded  award_date  last_modified                  funding_org_id  good_cause_ β‹―
     β”‚ Int64   String31                    String                             String                             String3   Int64           Dates.Date  String31                       Int64           Int64       β‹―
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚  99402  DCMS-tnlcomfund-0023093253  Activity Adventure Trai            The organisation will use the fu…  GBP                    0  2015-07-30  2022-06-07 13:24:35.117395+00               5              β‹―
   2 β”‚  99403  DCMS-tnlcomfund-0023093271  \\\\N                              The organisation will use the fu…  GBP                    0  2015-07-30  2022-06-07 13:24:35.129602+00               5
   3 β”‚ 101347  DCMS-tnlcomfund-0023093273  Crookston Environmental and Outd…  Crookston Environmental and Outd…  GBP                    0  2015-08-12  2022-06-07 13:24:35.135671+00               5
   4 β”‚  97860  DCMS-tnlcomfund-0023093278  A4A - Paulsgrove and Wymering Co…  The organisation will use the fu…  GBP                    0  2015-07-16  2022-06-07 13:24:35.141859+00               5
   5 β”‚ 112863  DCMS-tnlcomfund-0023207672  The Engage Programme               The project will provide practic…  GBP                    0  2017-10-12  2022-06-07 15:10:41.356337+00               5              β‹―
                                                                                                                                                                                                  6 columns omitted

julia> CSV.read(f, DataFrame; missingstring = "\\N")
5Γ—15 DataFrame
 Row β”‚ id      identifier                  title                              description                        currency  amount_awarded  award_date  last_modified                  funding_org_id  good_cause_ β‹―
     β”‚ Int64   String31                    String                             String                             String3   Int64           Dates.Date  String31                       Int64           Int64       β‹―
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚  99402  DCMS-tnlcomfund-0023093253  Activity Adventure Trai            The organisation will use the fu…  GBP                    0  2015-07-30  2022-06-07 13:24:35.117395+00               5              β‹―
   2 β”‚  99403  DCMS-tnlcomfund-0023093271  \\\\N                              The organisation will use the fu…  GBP                    0  2015-07-30  2022-06-07 13:24:35.129602+00               5
   3 β”‚ 101347  DCMS-tnlcomfund-0023093273  Crookston Environmental and Outd…  Crookston Environmental and Outd…  GBP                    0  2015-08-12  2022-06-07 13:24:35.135671+00               5
   4 β”‚  97860  DCMS-tnlcomfund-0023093278  A4A - Paulsgrove and Wymering Co…  The organisation will use the fu…  GBP                    0  2015-07-16  2022-06-07 13:24:35.141859+00               5
   5 β”‚ 112863  DCMS-tnlcomfund-0023207672  The Engage Programme               The project will provide practic…  GBP                    0  2017-10-12  2022-06-07 15:10:41.356337+00               5              β‹―
                                                                                                                                                                                                  6 columns omitted

julia> CSV.read(f, DataFrame; missingstring = "\\\\N")
5Γ—15 DataFrame
 Row β”‚ id      identifier                  title                              description                        currency  amount_awarded  award_date  last_modified                  funding_org_id  good_cause_ β‹―
     β”‚ Int64   String31                    String?                            String                             String3   Int64           Dates.Date  String31                       Int64           Int64       β‹―
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 β”‚  99402  DCMS-tnlcomfund-0023093253  Activity Adventure Trai            The organisation will use the fu…  GBP                    0  2015-07-30  2022-06-07 13:24:35.117395+00               5              β‹―
   2 β”‚  99403  DCMS-tnlcomfund-0023093271  missing                            The organisation will use the fu…  GBP                    0  2015-07-30  2022-06-07 13:24:35.129602+00               5
   3 β”‚ 101347  DCMS-tnlcomfund-0023093273  Crookston Environmental and Outd…  Crookston Environmental and Outd…  GBP                    0  2015-08-12  2022-06-07 13:24:35.135671+00               5
   4 β”‚  97860  DCMS-tnlcomfund-0023093278  A4A - Paulsgrove and Wymering Co…  The organisation will use the fu…  GBP                    0  2015-07-16  2022-06-07 13:24:35.141859+00               5
   5 β”‚ 112863  DCMS-tnlcomfund-0023207672  The Engage Programme               The project will provide practic…  GBP                    0  2017-10-12  2022-06-07 15:10:41.356337+00               5              β‹―
                                                                                                                                                                                                  6 columns omitted

Unless you want to share the whole file there’s little we can do to help you. Given the first few lines of your file work you should probably just read the first n lines, increasing n until you hit the error to work out where your file is corrupted.

$DEITY help me for saying this. pg_dump creates a file that can be read by another Postgres database. Could you create a local Postgres database and read this file in?
Then us Julia to extract the values from the database?

I don’t have Postgres. The file was produced by a 3rd party.
I have read the file. I’m just wondering why CSV can’t.

I’d like to share the file but am unsure of a mechanism.
Tim

It looks like line 334 is where the problem first arises:

println("limit=333")
CSV.read("newex.csv", DataFrame; limit=333)#; missingstring="\\\\N")
println("limit=334")
CSV.read("newex.csv", DataFrame; limit=334)#; missingstring="\\\\N")

This produces:

limit=333
limit=334
β”Œ Warning: thread = 4 warning: only found 8 / 15 columns around data row: 496. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 8 / 15 columns around data row: 515. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 4 warning: only found 8 / 15 columns around data row: 515. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β”Œ Warning: thread = 1 warning: only found 8 / 15 columns around data row: 515. Filling remaining columns with `missing`
β”” @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
ERROR: LoadError: TaskFailedException

    nested task error: BoundsError: attempt to access 334-element Vector{UInt32} at index [335]
    Stacktrace:
     [1] setindex!
       @ .\array.jl:1021 [inlined]
     [2] checkpooled!(::Type{Union{Missing, String3}}, pertaskcolumns::Vector{Vector{CSV.Column}}, col::CSV.Column, j::Int64, ntasks::Int64, nrows::Int64, ctx::CSV.Context)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:514
     [3] multithreadpostparse(ctx::CSV.Context, ntasks::Int64, pertaskcolumns::Vector{Vector{CSV.Column}}, rows::Vector{Int64}, finalrows::Int64, j::Int64, col::CSV.Column)
       @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:433
     [4] (::CSV.var"#35#40"{CSV.Context, Int64, Vector{Vector{CSV.Column}}, Vector{Int64}, Int64, Int64, CSV.Column})()
       @ CSV C:\Users\TGebbels\.julia\packages\WorkerUtilities\ey0fP\src\WorkerUtilities.jl:384

...and 2 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base .\task.jl:448
 [2] macro expansion
   @ .\task.jl:480 [inlined]
 [3] CSV.File(ctx::CSV.Context, chunking::Bool)
   @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:282
 [4] File
   @ C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:227 [inlined]
 [5] #File#32
   @ C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:223 [inlined]
 [6] read(source::String, sink::Type; copycols::Bool, kwargs::@Kwargs{limit::Int64})
   @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\CSV.jl:117
 [7] top-level scope
   @ c:\Users\TGebbels\OneDrive - xxxxxx\Documents\DCMS Database\TestCSVread.jl:39
in expression starting at c:\Users\TGebbels\OneDrive - xxxxxx\Documents\DCMS Database\TestCSVread.jl:39

This is line 334 from the datafile:

98021	DCMS-tnlcomfund-0023093356	Community Project	The group will use the funding to make improvements to the flooring and purchase of tables and chairs for the frequently used meeting room. This will improve community buildings and increase usage of facilities.	GBP	0	2015-07-23	2022-06-07 13:24:35.23891+00	5	8	176	62912	8	235	137

Also share line 333, which could trigger the problem.

Upload the file to gitub, dropbox, google drive, or whatever service you like and then post a link here

I think it is unlikely that the problem is with line 334 because it has progressed to at least line 515.

Line 333

505	360G-ACE-01-L090504154L-13	Grant to Ambit Magazine	Ambit Tour of the South West	GBP	4951	2005-03-01	2022-04-13 13:56:52.789849+00	1	1	45	35	2	56	305

Github says:

Yowza, that’s a big file. Try again with a file smaller than 25MB.

It’s 244MB.

https://www.google.com/search?q=large+file+sharing+service

And add ntasks=1 to CSV.read to avoid multitasking. This would allow pinpointing the offending line.

CSV.read("newex.csv", DataFrame; limit=333, ntasks=1)

Google wouldn’t let me upload it as a CSV file so I changed the file type to .txt - still the same file, though.

newex.txt