Best way to import a CSV with a repeated structure

Aside any comment on whoever published data with this layout, I have to import a CSV like this, where the data of interest is repeated on each day of the year:

Does CSV.jl has options to load the (small) file at once ? Or should I go for a manual implementation using readline ? Or perhaps I should CSV-load the file piece by piece in a daily loop, accounting for the different row indexes… at least it is constant… ?

Another option would be to write a function that reads the file and outputs it as a single table, by including a column for the date, then read the CSV as usual.

You could try:

using CSV, DataFrames, Dates

lines = readlines(file)
ix = findall(contains.(lines, "Données"))
dates = Date.(last.(split.(lines[ix], "du ")))
header = string.(split(lines[ix[1]+1]))
df = DataFrame()
for i in ix
   append!(df, CSV.read(IOBuffer(join(lines[i+2:i+25],"\n")), delim=" ", ignorerepeated=true, DataFrame, header=header))
end
insertcols!(df, 1, "Jour" => repeat(dates, inner=24))
2 Likes

Suppose your file looks like:

$ cat in.txt 
Day 1
1	10
2	20
3	15
4	20

Day 2
1	25
2	30
3	40
4	35

Day 3
1	40
2	45
3	30
4	15

A little Unix shell magic can be used to help Julia parse this as follows:

julia> using CSV, DataFrames

julia> CSV.read(pipeline(`awk 'BEGIN { day = "0"; print "day\ttime\tvalue" } /^Day/ { day = $2} /^[^D]/ {print day,"\t" $0}' in.txt`), DataFrame; delim='\t')
12×3 DataFrame
 Row │ day    time   value 
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1     10
   2 │     1      2     20
   3 │     1      3     15
   4 │     1      4     20
   5 │     2      1     25
   6 │     2      2     30
   7 │     2      3     40
   8 │     2      4     35
   9 │     3      1     40
  10 │     3      2     45
  11 │     3      3     30
  12 │     3      4     15

The key useful info is the way awk tool is used to separate the important data lines from the intra-file headers, and to process the data in the header to enable adding it later to each line separately. This is the awk command on its own:

awk 'BEGIN { day = "0"; print "day\ttime\tvalue" } /^Day/ { day = $2} /^[^D]/ {print day,"\t" $0}' in.txt

Hopefully the awk logic is clear enough.

P.S. screenshots are annoying in a post. Try to insert copy-pasteable stuff

Tahnk you. Indeed I should have posted a minimal syntetic example. I posted a screenshot as I didn’t expect a complete solution, only on which path to take.

At the end I used the CSV.read options header and limit, but I am surprised that CSV has a select option for the columns but not one for the rows.

Here, quinnj explains why it is not worth filtering rows.