CSV woes and SubString documentation


#1

Hi,

I have a small csv file with 8 columns and 48 rows. The first two columns are strings and the remaining columns are floats.

My goal: For each row, extract the first two strings and parse the remaining floats into a vector and create an instance of a custom type

MyType(s1::String,s2::String,v::Vector{Float64}

At first, it seemed like a natural candidate for CSV and Query. I know my coding ability sucks, but I can almost write it by hand faster than my code is reading that small file. It is taking 6 seconds to read and parse using CSV.read and Query. Digging around, I found a comment that DataFrames are slow if you are manipulating rows, which is what I was doing.

I then used DataFrames.stack to turn my rows into columns, but that was even slower (~8 seconds).

My latest attempt is to use Base.readcsv. This reads my CSV into a 2d array, which seems like something I can work with, but when I access on element of the first two columns, the type is SubString{String}. I can’t seem to be able to find any documentation for SubString and I need a String. How can I get a String from a SubString{String}?

Any ideas would be appreciated. Thanks :slight_smile:

PS: I know the first time to run any function is slow, but I will only ever run this function once since I only need to load the data to memory once.


Beginner installing and trying to use JuliaPro on Windows - extremely slow experience
#2

It will be a lot easier to help if you can provide a working example of your data and the code you have so far. Can you post a reproducible example?

How can I get a String from a SubString{String}?

convert(String, x)


#3

Thanks. I could swear I tried that :sweat_smile: It is that intense stress thing again. Appreciate the help :+1:


#4

This is actually a nice case where Query.jl’s briding of the relational world with the julia type system works well. Here is how I’d do it:

using FileIO, CSVFiles, Query

data = load("rep.csv") |> @map(MyType(_.s1, _.s2, values(_)[3:end])) |> collect

That should give you a Vector{MyType}.

I have no good solution for the speed problem when you run this for the first time, other than hoping for some wonders around precompilation to happen at some point :slight_smile:


#5

Thanks @davidanthoff! This looks like a beautiful solution :heart_eyes: I wasn’t aware of FilesIO and IterableTables :+1:

I have a small problem. There is a \ufeff at the beginning of my CSV file and I’m having trouble getting rid of it and it is corrupting s1 so I can’t get your trick to work :thinking:

Any ideas?

Edit: Nevermind. Got it! :clap:


#6

Works beautifully :heart_eyes::raised_hands:


#7

Could load recognize BOM for UTF-16 with different endian?

What was your solution?


#8

My solution was anything but elegant. Opened it in Excel and resaved as .csv :slight_smile:


#9

@davidanthoff Just FYI. Applying your trick, I am literally removing hundreds of lines of code and gaining awesome speed boosts. Thanks again :+1:


#10

Yay!