I have a large text file that describes itself as
--
-- PostgreSQL database dump
--
-- Dumped from database version 12.16
-- Dumped by pg_dump version 12.16
It contains several blocks of tab-separated data which represent the data from each database table. Iβve copied one of these blocks (using the VSCode editor) into a stand-alone file. This contains 15 tab-separated columns and 761,065 rows. First row is a header. I cannot coax CSV into reading this file.
using CSV, DataFrames, StringEncodings
entries = ["id", "identifier", "title", "description", "currency", "amount_awarded", "award_date", "last_modified", "funding_org_id", "good_cause_area_id", "local_authority_id", "recipient_org_id", "region_id", "uk_constituency_id", "ward_id"]
# Names of DF columns
df1 = DataFrame([name => String[] for name in entries])
for line in eachline("newex.csv")
list = findall(r"\t", line)
l = length(list)
if l !== 14 # Check that all lines have exactly 14 tab separators
println(l)
end
row = String[]
append!(row, [chop(line[1:first(list[1])], head=0, tail=1)], [chop(line[first(list[x-1]):first(list[x])], head=1, tail=1) for x = 2:l], [chop(line[first(list[l]):end], head=1, tail=0)])
push!(df1, row)
end
deleteat!(df1, 1) # first row of datafile contains headers.
println(describe(df1))
#df2 = CSV.File(open("newex.csv", enc"ISO-8859-1")) |> DataFrame
df2 = CSV.read("newex.csv", DataFrame)
println(describe(df2))
My βbrute force and ignoranceβ method produces a viable dataframe, but CSV.read fails. Iβve tried specifying delim="\t"
and have experimented with StringEncodings but nothing Iβve tried makes any difference.
Obviously, Iβve managed to read the datafile, so this post is out of curiosity more than anything.
Here is the output of running the above script.
15Γ7 DataFrame
Row β variable mean min median max nmissing eltype
β Symbol Nothing String Nothing String Int64 DataType
ββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β id 100 999999 0 String
2 β identifier 000406 YH201117526 0 String
3 β title " " (working title) β Other 0 String
4 β description ! (Hollywood Blues) \uf0fc Musicians, Dancers, Sβ¦ 0 String
5 β currency GBP GBP 0 String
6 β amount_awarded 0 999999 0 String
7 β award_date 1995-01-06 2023-12-20 0 String
8 β last_modified 2021-12-09 12:53:26.999321+00 2024-01-11 13:07:00.623853+00 0 String
9 β funding_org_id 1 9 0 String
10 β good_cause_area_id 1 8 0 String
11 β local_authority_id 1 \\N 0 String
12 β recipient_org_id 1 99999 0 String
13 β region_id 1 \\N 0 String
14 β uk_constituency_id 1 \\N 0 String
15 β ward_id
Then I get literally thousands of warnings from CSV as it tries to read each line of the file:
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548111. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548112. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548113. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548114. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548115. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548116. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548117. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 3 warning: only found 15 / 16 columns around data row: 548118. Filling remaining columns with `missing`
β @ CSV C:\U
...
β Warning: thread = 3: too many warnings, silencing any further warnings
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195709. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 195709. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195710. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195710. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195711. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 195711. Filling remaining columns with `missing`
...
β Warning: thread = 1: too many warnings, silencing any further warnings
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β Warning: thread = 1: too many warnings, silencing any further warnings
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601498. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601499. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601500. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601501. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601502. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 1 warning: only found 15 / 16 columns around data row: 601503. Filling remaining columns with `missing`
...
β Warning: thread = 1: too many warnings, silencing any further warnings
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601514. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601515. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601516. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601517. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601518. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601519. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601520. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601521. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
...
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601595. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601596. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601597. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4 warning: only found 15 / 16 columns around data row: 601598. Filling remaining columns with `missing`
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:586
β Warning: thread = 4: too many warnings, silencing any further warnings
β @ CSV C:\Users\TGebbels\.julia\packages\CSV\aoJqo\src\file.jl:591
And then the crash.
ERROR: LoadError: UndefRefError: access to undefined reference
Stacktrace:
[1] getindex
@ .\essentials.jl:13 [inlined]
[2] iterate
@ .\array.jl:945 [inlined]
[3] _foldl_impl
@ .\reduce.jl:60 [inlined]
[4] foldl_impl
@ .\reduce.jl:48 [inlined]
[5] mapfoldl_impl
@ .\reduce.jl:44 [inlined]
[6] _mapreduce_dim
@ .\reducedim.jl:362 [inlined]
[7] mapreduce
@ .\reducedim.jl:357 [inlined]
[8] _sum
@ .\reducedim.jl:1015 [inlined]
[9] sum
@ .\reducedim.jl:1011 [inlined]
[10] _simple_count
@ .\reduce.jl:1357 [inlined]
[11] _count
@ .\reducedim.jl:442 [inlined]
[12] count
@ .\reducedim.jl:440 [inlined]
[13] #43
@ .\none:0 [inlined]
[14] MappingRF
@ .\reduce.jl:100 [inlined]
[15] _foldl_impl(op::Base.MappingRF{SentinelArrays.var"#43#44"{typeof(ismissing)}, Base.BottomRF{typeof(Base.add_sum)}}, init::Base._InitialValue, itr::Vector{Vector{String}})
@ Base .\reduce.jl:58
[16] foldl_impl
@ .\reduce.jl:48 [inlined]
[17] mapfoldl_impl
@ .\reduce.jl:44 [inlined]
[18] mapfoldl
@ .\reduce.jl:175 [inlined]
[19] mapreduce
@ .\reduce.jl:307 [inlined]
[20] sum
@ .\reduce.jl:535 [inlined]
[21] sum
@ .\reduce.jl:564 [inlined]
[22] count
@ C:\Users\TGebbels\.julia\packages\SentinelArrays\1kRo4\src\chainedvector.jl:787 [inlined]
[23] (::DataFrames.var"#76#84"{Vector{Any}})(col::SentinelArrays.ChainedVector{String, Vector{String}})
@ DataFrames C:\Users\TGebbels\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\abstractdataframe.jl:723
[24] iterate
@ .\generator.jl:47 [inlined]
[25] collect_to!(dest::Vector{Dict{Symbol, Any}}, itr::Base.Generator{DataFrames.DataFrameColumns{DataFrame}, DataFrames.var"#76#84"{Vector{Any}}}, offs::Int64, st::Int64)
@ Base .\array.jl:892
[26] collect_to_with_first!
@ .\array.jl:870 [inlined]
[27] collect(itr::Base.Generator{DataFrames.DataFrameColumns{DataFrame}, DataFrames.var"#76#84"{Vector{Any}}})
@ Base .\array.jl:844
[28] map
@ .\abstractarray.jl:3313 [inlined]
[29] _describe(df::DataFrame, stats::Vector{Any})
@ DataFrames C:\Users\TGebbels\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\abstractdataframe.jl:712
[30] describe(df::DataFrame; cols::Function)
@ DataFrames C:\Users\TGebbels\.julia\packages\DataFrames\58MUJ\src\abstractdataframe\abstractdataframe.jl:668
[31] top-level scope
@ c:\Users\TGebbels\OneDrive - xxxxxxxx\Documents\DCMS Database\TestCSVread.jl:37
I donβt have a mechanism to share the datafile but I would if I could.
Here are the first few lines:
id identifier title description currency amount_awarded award_date last_modified funding_org_id good_cause_area_id local_authority_id recipient_org_id region_id uk_constituency_id ward_id
99402 DCMS-tnlcomfund-0023093253 Activity Adventure Trai The organisation will use the funding to purchase and construct an Activity Adventure Trail. This will enable the organisation to provide a safe outdoor recreational facility which the area lacks currently to tackle the identified risk of isolation and anti social behaviour amongst the local children. GBP 0 2015-07-30 2022-06-07 13:24:35.117395+00 5 8 306 63764 3 317 6032
99403 DCMS-tnlcomfund-0023093271 The SPHERE Project The organisation will use the funding to run a series of activity sessions focussing on physical mental and sexual health. This will enable the organisation to tackle the ongoing health and substance misuse issues that arise as a result of deprivation in the area. GBP 0 2015-07-30 2022-06-07 13:24:35.129602+00 5 8 76 46015 3 91 11
101347 DCMS-tnlcomfund-0023093273 Crookston Environmental and Outdoor Project Crookston Environmental and Outdoor Project GBP 0 2015-08-12 2022-06-07 13:24:35.135671+00 5 8 205 64760 1 545 3978
97860 DCMS-tnlcomfund-0023093278 A4A - Paulsgrove and Wymering Connect The organisation will use the funding to work with over 40 local agencies to create an online directory giving residents information about local activities and services. This will enable the group to provide easy access to all local residents with updated programmes of activities and information about accessing these services to improve wellbeing and community engagement in the area. GBP 0 2015-07-16 2022-06-07 13:24:35.141859+00 5 8 141 62886 9 391 767
112863 DCMS-tnlcomfund-0023207672 The Engage Programme The project will provide practical life skills and training to teenage boys who lack positive male role models in their lives. This will improve the young peopleβs self-esteem resilience and motivation. GBP 0 2017-10-12 2022-06-07 15:10:41.356337+00 5 8 366 68266 7 602 5357