Best way to import a CSV with a repeated structure

sylvaticus · September 13, 2024, 3:44pm

Aside any comment on whoever published data with this layout, I have to import a CSV like this, where the data of interest is repeated on each day of the year:

Does CSV.jl has options to load the (small) file at once ? Or should I go for a manual implementation using readline ? Or perhaps I should CSV-load the file piece by piece in a daily loop, accounting for the different row indexes… at least it is constant… ?

dlakelan · September 13, 2024, 4:10pm

Another option would be to write a function that reads the file and outputs it as a single table, by including a column for the date, then read the CSV as usual.

rafael.guerra · September 14, 2024, 6:06pm

You could try:

using CSV, DataFrames, Dates

lines = readlines(file)
ix = findall(contains.(lines, "Données"))
dates = Date.(last.(split.(lines[ix], "du ")))
header = string.(split(lines[ix[1]+1]))
df = DataFrame()
for i in ix
   append!(df, CSV.read(IOBuffer(join(lines[i+2:i+25],"\n")), delim=" ", ignorerepeated=true, DataFrame, header=header))
end
insertcols!(df, 1, "Jour" => repeat(dates, inner=24))

Dan · September 14, 2024, 8:20pm

Suppose your file looks like:

$ cat in.txt 
Day 1
1	10
2	20
3	15
4	20

Day 2
1	25
2	30
3	40
4	35

Day 3
1	40
2	45
3	30
4	15

A little Unix shell magic can be used to help Julia parse this as follows:

julia> using CSV, DataFrames

julia> CSV.read(pipeline(`awk 'BEGIN { day = "0"; print "day\ttime\tvalue" } /^Day/ { day = $2} /^[^D]/ {print day,"\t" $0}' in.txt`), DataFrame; delim='\t')
12×3 DataFrame
 Row │ day    time   value 
     │ Int64  Int64  Int64 
─────┼─────────────────────
   1 │     1      1     10
   2 │     1      2     20
   3 │     1      3     15
   4 │     1      4     20
   5 │     2      1     25
   6 │     2      2     30
   7 │     2      3     40
   8 │     2      4     35
   9 │     3      1     40
  10 │     3      2     45
  11 │     3      3     30
  12 │     3      4     15

The key useful info is the way awk tool is used to separate the important data lines from the intra-file headers, and to process the data in the header to enable adding it later to each line separately. This is the awk command on its own:

awk 'BEGIN { day = "0"; print "day\ttime\tvalue" } /^Day/ { day = $2} /^[^D]/ {print day,"\t" $0}' in.txt

Hopefully the awk logic is clear enough.

P.S. screenshots are annoying in a post. Try to insert copy-pasteable stuff

sylvaticus · September 15, 2024, 8:44am

Tahnk you. Indeed I should have posted a minimal syntetic example. I posted a screenshot as I didn’t expect a complete solution, only on which path to take.

At the end I used the CSV.read options header and limit, but I am surprised that CSV has a select option for the columns but not one for the rows.

TimG · September 15, 2024, 11:31am

Here, quinnj explains why it is not worth filtering rows.

Topic		Replies	Views
Read text file containing proper CSV data chunks Data dataframes , csv	11	562	December 20, 2023
Read file with CSV.read New to Julia	8	19791	September 9, 2019
Read vector from data file Data csv , io	8	739	January 18, 2024
Alternative to DataFrame Readtable to read large data files with headers Data	17	4042	November 12, 2018
Read CSV and change rows later New to Julia question , dataframes , csv	12	2100	May 7, 2020

Best way to import a CSV with a repeated structure

Related topics