CSV woes and SubString documentation

anon67531922 · December 23, 2017, 5:57am

Hi,

I have a small csv file with 8 columns and 48 rows. The first two columns are strings and the remaining columns are floats.

My goal: For each row, extract the first two strings and parse the remaining floats into a vector and create an instance of a custom type

MyType(s1::String,s2::String,v::Vector{Float64}

At first, it seemed like a natural candidate for CSV and Query. I know my coding ability sucks, but I can almost write it by hand faster than my code is reading that small file. It is taking 6 seconds to read and parse using CSV.read and Query. Digging around, I found a comment that DataFrames are slow if you are manipulating rows, which is what I was doing.

I then used DataFrames.stack to turn my rows into columns, but that was even slower (~8 seconds).

My latest attempt is to use Base.readcsv. This reads my CSV into a 2d array, which seems like something I can work with, but when I access on element of the first two columns, the type is SubString{String}. I can’t seem to be able to find any documentation for SubString and I need a String. How can I get a String from a SubString{String}?

Any ideas would be appreciated. Thanks

PS: I know the first time to run any function is slow, but I will only ever run this function once since I only need to load the data to memory once.

rdeits · December 23, 2017, 6:08am

It will be a lot easier to help if you can provide a working example of your data and the code you have so far. Can you post a reproducible example?

How can I get a String from a SubString{String}?

convert(String, x)

anon67531922 · December 23, 2017, 6:14am

Thanks. I could swear I tried that It is that intense stress thing again. Appreciate the help

davidanthoff · December 23, 2017, 6:27am

This is actually a nice case where Query.jl’s briding of the relational world with the julia type system works well. Here is how I’d do it:

using FileIO, CSVFiles, Query

data = load("rep.csv") |> @map(MyType(_.s1, _.s2, values(_)[3:end])) |> collect

That should give you a Vector{MyType}.

I have no good solution for the speed problem when you run this for the first time, other than hoping for some wonders around precompilation to happen at some point

anon67531922 · December 23, 2017, 7:34am

Thanks @davidanthoff! This looks like a beautiful solution I wasn’t aware of FilesIO and IterableTables

I have a small problem. There is a \ufeff at the beginning of my CSV file and I’m having trouble getting rid of it and it is corrupting s1 so I can’t get your trick to work

Any ideas?

Edit: Nevermind. Got it!

anon67531922 · December 23, 2017, 7:41am

Works beautifully

Liso · December 23, 2017, 12:35pm

Could load recognize BOM for UTF-16 with different endian?

What was your solution?

anon67531922 · December 23, 2017, 12:43pm

My solution was anything but elegant. Opened it in Excel and resaved as .csv

anon67531922 · December 24, 2017, 3:14am

@davidanthoff Just FYI. Applying your trick, I am literally removing hundreds of lines of code and gaining awesome speed boosts. Thanks again

davidanthoff · December 24, 2017, 6:36am

Yay!

Topic		Replies	Views
String7 type with read CSV? New to Julia	8	360	June 23, 2023
DataFrames/CSV: how to read vectors from *.csv? General Usage	9	2845	March 26, 2021
New behaviour due to an update of the package CSV when using CSV.read General Usage	5	904	July 21, 2019
CSV.jl type stability General Usage csv , type-stability	26	1007	October 22, 2022
Csv error reading numbers as string General Usage	16	2292	December 6, 2020

CSV woes and SubString documentation

Related topics