Tidying up a csv file (follow-up to question 53261/4)

Hi there,

I have a .csv file (or, equivalently, I can have it as a .xlsx or .ods file), with the typical following format (suppressing many more intermediate rows):

Vínculo	Nome	Número de identificação	Estado	Iniciado em	Completo	Tempo utilizado	Avaliar/10,00	Q. 1 /0,00	Q. 2 /0,50	Q. 3 /0,50	Q. 4 /0,50	Q. 5 /0,50
- Aluno	JOSÉ DA SILVA	1890	Finalizada	10 janeiro 2021  15:00 	10 janeiro 2021  17:00 	2 horas	7,50	-	0,50	0,00	0,50	0,50
- Aluno	JOÃO NIGUÉM	1543	Finalizada	10 janeiro 2021  15:00 	10 janeiro 2021  16:59 	1 hora 59 minutos	0,50	-	0,00	0,00	0,00	0,00
- Aluno	MARCOS DA SILVA	1329	Finalizada	10 janeiro 2021  15:00 	10 janeiro 2021  16:45 	1 hora 44 minutos	8,00	-	0,50	0,00	0,50	0,50
- Aluno	BRUNA NINGUÉM	1111	Finalizada	10 janeiro 2021  17:58 	10 janeiro 2021  18:00 	1 minuto 19 segundos	0,00	-	-	-	-	-
Média geral							6,49	-	0,35	0,21	0,34	0,34

I would like, with Julia and any appropriate packages, to:
(1) delete the original columns 1, 2, 3, 4, 7, 8 and 9, originally entitled “Vínculo”, “Nome”, “Número de identificação”, “Estado”, “Tempo utilizado”, “Avaliar/10,00” and “Q. 1 /0,00”, and keep the remaining ones.
(2) delete the last row.
(3) read the original columns 5 and 6, originally entitled “Iniciado em” and “Completo”, as DateTime objects, after first replacing the (Portuguese standard) date/times, such as “10 janeiro 2021 15:00”, with (international standard) “2021-01-10 15:00”. I had problems with this step, mainly possibly because of the Portuguese to international standard conversion of the date and time format.
(4) read the original columns 10 up to 13, originally entitled “Q.2/0,50” up to “Q.5/0,50”, as floating point numbers, after first replacing the (Portuguese standard) decimal commas, such as “7,50”, with the English standard decimal dot, such as “7.50”. Also the “-” strings in those columns should be interpreted (converted/replaced) as missing type values. I could not make much progress with this, particularly the comma to dot and then string to float conversion, and then the missing value part as well.
(5) finally, rename the remaining 6 columns from: “Iniciado em”, “Completo”, “Q.2/0,50”, … “Q.5/0,50” to: “Start”, “End”, “Q1”, … “Q4”.

I guess I am able to deal with items (1), (2) and (5) reasonably well, at least with the packages XLSX.jl and DataFrames.jl, but the items (3) and (4) seem trickier. @Mattriks in question How to convert string column of dataframe to float with missing values? has induced me to use the CSV.jl package. I am willing to do this, but now I would like to know whether I could deal with all the issues described above, discarding XLSX.jl altogether, using only DateTimes.jl and CSV.jl.
Thanks for any help.

Regarding (3): You can use the Dates.DateLocale function to define a new Locale for the DateTime parsing.
For german this would look like that:

months = ["Januar"
          "Februar"
          "März"
          "April"
          "Mai"
          "Juni"
          "Juli"
          "August"
          "September"
          "Oktober"
          "November"
          "Dezember"]

months_abbrev = ["Jan"
               "Feb"
               "Mar"
               "Apr"
               "Mai"
               "Jun"
               "Jul"
               "Aug"
               "Sep"
               "Okt"
               "Nov"
               "Dez"]

days = ["Montag", "Dienstag", "Mittwoch", "Donnerstag", "Freitag", "Samstag", "Sonntag"]
days_abbrev = ["Mo", "Di", "Mi", "Do", "Fr", "Sa", "So"]

Dates.LOCALES["german"] = Dates.DateLocale(months, months_abbrev, days, days_abbrev)

Then you can use dateformat to parse your dates.
For your use case this should be

dform = Dates.DateFormat("dd U yyyy")
date = Dates.Date(somedatestring, dform)

See Dates · The Julia Language for the details for the dateformat construction.

1 Like

Untested, as I don’t have your file of course, but roughly I think you’re after:

# (2), (4)
df = CSV.read("myfile.csv", DataFrame; missingstring = "-", decimal = ",")[1:end-1, :]

# (1)
select!(df, Not([1:4; 7:9]))

# (5)
rename!(df, ["Start", "End", "Q1", "Q2", "Q3", "Q4"])

and a solution for (3) is given above.

@nilshg I tried your suggestion for items (2), (4) and got the error:

julia > df = CSV.read("P1_notas_orig.csv", DataFrame; missingstring="-", decimal=",")[1:end-1,:]
ERROR: TypeError: in keyword argument decimal, expected Union{Char, UInt8}, got a value of type String
Stacktrace:
 [1] read(::String, ::Type{T} where T; copycols::Bool, kwargs::Base.Iterators.Pairs{Symbol,String,Tuple{Symbol,Symbol},NamedTuple{(:missingstring, :decimal),Tuple{String,String}}}) at /home/orca/.julia/packages/CSV/la2cd/src/CSV.jl:45
 [2] top-level scope at REPL[17]:1

Sorry typo on my part (that’s what you get when you just write code without running it!) - it should be decimal = ',' (i.e. single rather than double quotes)

@nilshg I tried your last suggestion and now the csv file was read, but the original last columns, named “Q.2/0,50”, “Q.3/0,50”, etc were treated as of type Array{Union{Missing, String},1} instead of Array{Union{Missing, Float64}, 1}, as I expected. Furthermore, the comma’s were not replaced by the dot’s, in the decimal separators…

Very hard to say much more without having the actual file, or at least some excerpt. Both missingstring and decimal work as expected:

julia> using CSV, DataFrames

julia> df = DataFrame(a = ["0,5","-", "0,2"])
3×1 DataFrame
 Row │ a      
     │ String 
─────┼────────
   1 │ 0,5
   2 │ -
   3 │ 0,2

julia> s = String(take!(CSV.write(IOBuffer(), df)))
"a\n\"0,5\"\n-\n\"0,2\"\n"

julia> CSV.read(IOBuffer(s), DataFrame)
3×1 DataFrame
 Row │ a      
     │ String 
─────┼────────
   1 │ 0,5
   2 │ -
   3 │ 0,2

julia> CSV.read(IOBuffer(s), DataFrame, missingstring = "-")
3×1 DataFrame
 Row │ a       
     │ String? 
─────┼─────────
   1 │ 0,5
   2 │ missing 
   3 │ 0,2

julia> CSV.read(IOBuffer(s), DataFrame, missingstring = "-", decimal = ',')
3×1 DataFrame
 Row │ a
     │ Float64?  
─────┼───────────
   1 │       0.5
   2 │ missing   
   3 │       0.2

although of course there can always be bugs and/or edge cases, especially with files that have some unusual encodings, non-english characters etc.

I uploaded and shared (a representative excerpt of) the real file, called P1_notas_teste.csv to my public area in Dropbox. I have changed the names and identification numbers to preserve anonymity. The corresponding link is: Dropbox - P1_notas_teste.csv - Simplify your life

I hope this helps.

Thanks again.

When I run your commands upon that file which I uploaded, I get:

julia> df = CSV.read("P1_notas_teste.csv", DataFrame; missingstring = "-", decimal = ',')[1:end-1, :]
┌ Warning: thread = 1 warning: only found 28 / 29 columns around data row: 8. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:603
┌ Warning: thread = 1 warning: only found 21 / 29 columns around data row: 9. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:603
┌ Warning: thread = 1 warning: only found 28 / 29 columns around data row: 10. Filling remaining columns with `missing`
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:603
8×29 DataFrame. Omitted printing of 21 columns
│ Row │ Vínculo │ Nome               │ Número de identificação │ Estado     │ Iniciado em             │ Completo                │ Tempo utilizado      │ Avaliar/10,00 │
│     │ String  │ String             │ String                  │ String     │ String                  │ String                  │ String               │ String        │
├─────┼─────────┼────────────────────┼─────────────────────────┼────────────┼─────────────────────────┼─────────────────────────┼──────────────────────┼───────────────┤
│ 1   │ - Aluno │ AMANDA LOBATO      │ 139018123               │ Finalizada │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  17:00  │ 2 horas              │ 7,50          │
│ 2   │ - Aluno │ BRUNO MATOS        │ 113039321               │ Finalizada │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:59  │ 1 hora 59 minutos    │ 0,50          │
│ 3   │ - Aluno │ CAIO LEITÃO        │ 187777777               │ Finalizada │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:45  │ 1 hora 44 minutos    │ 8,00          │
│ 4   │ - Aluno │ DRÁU7ZIO VARELA    │ 118034567               │ Finalizada │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:52  │ 1 hora 52 minutos    │ 7,50          │
│ 5   │ - Aluno │ EDUARDO DA SILVA   │ 114148900               │ Finalizada │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:14  │ 1 hora 14 minutos    │ 8,00          │
│ 6   │ - Aluno │ FABIANA ESTOJO     │ 116197913               │ Finalizada │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:11  │ 1 hora 10 minutos    │ 6,50          │
│ 7   │ - Aluno │ GUSTAVO OCEANO     │ 118039876               │ Finalizada │ 10 janeiro 2021  17:21  │ 10 janeiro 2021  18:00  │ 39 minutos           │ 4,00          │
│ 8   │ - Aluno │ RICARDO MAR ABERTO │ 111219765               │ Finalizada │ 10 janeiro 2021  17:58  │ 10 janeiro 2021  18:00  │ 1 minuto 19 segundos │ 0,00          │

Then:

julia> select!(df, Not([1:4; 7:9]))
8×22 DataFrame. Omitted printing of 12 columns
│ Row │ Iniciado em             │ Completo                │ Q. 2 /0,50 │ Q. 3 /0,50 │ Q. 4 /0,50 │ Q. 5 /0,50 │ Q. 6 /0,50 │ Q. 7 /0,50 │ Q. 8 /0,50 │ Q. 9 /0,50 │
│     │ String                  │ String                  │ String?    │ String?    │ String     │ String     │ String     │ String     │ String     │ String?    │
├─────┼─────────────────────────┼─────────────────────────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────────┼────────────┤
│ 1   │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  17:00  │ 0,50       │ 0,00       │ 0,50       │ 0,50       │ 0,00       │ 0,00       │ 0,50       │ 0,50       │
│ 2   │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:59  │ 0,00       │ 0,00       │ 0,00       │ 0,00       │ 0,00       │ 0,00       │ 0,00       │ 0,00       │
│ 3   │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:45  │ 0,50       │ 0,00       │ 0,50       │ 0,50       │ 0,50       │ 0,50       │ 0,00       │ 0,50       │
│ 4   │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:52  │ 0,00       │ 0,00       │ 0,50       │ 0,50       │ 0,00       │ 0,50       │ 0,50       │ 0,50       │
│ 5   │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:14  │ 0,50       │ 0,00       │ 0,50       │ 0,50       │ 0,00       │ 0,50       │ 0,50       │ 0,50       │
│ 6   │ 10 janeiro 2021  15:00  │ 10 janeiro 2021  16:11  │ 0,50       │ 0,00       │ 0,00       │ 0,50       │ 0,50       │ 0,50       │ 0,50       │ 0,00       │
│ 7   │ 10 janeiro 2021  17:21  │ 10 janeiro 2021  18:00  │ missing    │ 0,50       │ 0,00       │ 0,50       │ 0,50       │ 0,50       │ 0,00       │ missing    │
│ 8   │ 10 janeiro 2021  17:58  │ 10 janeiro 2021  18:00  │ missing    │ missing    │ #undef     │ #undef     │ #undef     │ #undef     │ #undef     │ missing    │

The weird thing is that, in the last row, I get some expected missing and unexpected #undef values as well :frowning:

Thanks for sharing. It seems to me that you just have a malformed csv here, just opening this in a text editor I get

Vínculo,Nome,Número de identificação,Estado,Iniciado em,Completo,Tempo utilizado,"Avaliar/10,00","Q. 1 /0,00 (...)"

for the first row, which to me suggests that something went wrong in creating the file as it looks like values from the first row (10,00, 0,00) have somehow crept into the header row (I’m assuming the header should be Q. 1 and not Q .1 /0,00. There isn’t much that CSV.jl or any other CSV reader can do about that, you’ll have to look into how that file is created in the first place.