Read CSV and change rows later

Hi all,

I’m trying a trivial task. Read CSV and for each row - if a value in given column is not missing, process the column.
Something like this:

batches = CSV.File(
                file; 
                header = true,
                delim = ',')

foreach(
            b ->if ! ismissing(b.Splits)
                    b.Splits = split(b.Splits, ',')
                end,
            batches)

The CSV looks like this:

"5674012","530489692","batch_145322","10/31/2019 15:00:13",
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
...

How to achieve such simple task? I’m getting errors

ERROR: LoadError: MethodError: no method matching setindex!(::CSV.Row{false}, ::Array{SubString{String},1}, ::Symbol)

How to achieve that batches are mutable? Generaly I would expect batches[1].Splits = "whatever" will work, but obviously I’m missing something here.

Thank you for your time!

The problem here is that the iterator returned by CSV.File is immutable. So each row object that you are iterating over in your foreach loop can’t be altered. In particular, you can’t set the value Splits equal to something.

Could you give more information on what your goals are? Do you want to put the data in memory and work with it? Or do you want to just alter the csv file and save that?

PS here is an MWE for your to use, it shows the way you can input a string into CSV.file

using CSV

file = """
"X1", "X2", "X3", "x4", "Splits"
"5674012","530489692","batch_145322","10/31/2019 15:00:13",
"5674012","530489702","batch_145323","10/31/2019 15:00:32","9b4e08e5"
"5674012","530489728","batch_145327","10/31/2019 15:01:56","b036aa66,b036aa67,b036aa68"
"""

io = IOBuffer(file)

batches = CSV.File(
    io; 
	header = true,
    delim = ',')

foreach(batches) do b
	if !ismissing(b.Splits)
		split(b.Splits, ',')
	end
end

@pdeffebach , thanks for sample how to read CSV from string. Pretty useful.

What I’d like to do is to read the CSV into memory and process later (merging with other CSVs, filtering rows based on the column values etc.). And then later to write the result as CSV to disk.

I think you should use DataFrames, then. Read all your files into DataFrame and then at the end write that data frame to a disk using CSV.write.

batches = CSV.File(
    io; 
	header = true,
    delim = ',') |> DataFrame
1 Like

Ok, I’ll have a look into https://juliadata.github.io/DataFrames.jl/stable/man/getting_started/

This will take some time :slight_smile:

1 Like

Please do not hesitate to ask questions!

1 Like

You’re looking for the copycols kwarg to CSV.File although I agree that in general you’ll have an easier time doing manipulation on a DataFrame. But even if you create a DataFrame using e.g. CSV.read, you still need the copycols kwarg if you want to mutate the values later.

Just going through getting started for DataFrames and I’m a little bit confused about versions and that document.

I installed (from https://julialang.org/downloads) 1.4.1.
The version of DataFrames is 0.20.2

(@v1.4) pkg> status
Status `C:\Users\u\.julia\environments\v1.4\Project.toml`
  [a93c6f00] DataFrames v0.20.2 
  ...

But this getting started shows version for 0.21.0. So now when running some commands (e.g. select(df, :x1 => :a1, :x2 => :a2) # rename columns) throws exception.

Trying to install the version is throwing errors

Pkg.add(Pkg.PackageSpec(;name="DataFrames", version="0.21"))
  Resolving package versions...
ERROR: Unsatisfiable requirements detected for package DataFrames [a93c6f00]:
 DataFrames [a93c6f00] log:
 ├─possible versions are: [0.11.7, 0.12.0, 0.13.0-0.13.1, 0.14.0-0.14.1, 0.15.0-0.15.2, 0.16.0, 0.17.0-0.17.1, 0.18.0-0.18.4, 0.19.0-0.19.4, 0.20.0-0.20.2] or uninstalled
 └─restricted to versions 0.21 by an explicit requirement — no versions left

When I go to https://juliadata.github.io/DataFrames.jl/stable/ , the versions changes to 0.21. So the version obviously exists. How to install it?

I think you need to use update rather than adding the new version. Can you just try Pkg.update() and see if that bumps you to the new version?

I was able to reproduce your environment on macOS and update worked for me. I would avoid adding a package using PkgSpec like that and just rely on the resolver to do things.

1 Like

@pdeffebach you are right. update really worked. Thanks a lot.
Now if you know why default Julia installation has stale packages, please let me know :slight_smile:

Sorry, could you clarify? Are you on JuliaPro? The new version of DataFrames was just released 2 days ago. so maybe you had installed it before then?

I downloaded Julia as standalone app (from https://julialang.org/downloads/) - and it was 2 days ago. Maybe at that time the version of DataFrames was 0.20.x.

Note that DataFrames development is very fast. It probably got updated after you downloaded DataFrames. update is very easy to do in Julia. However note that 20.0.x is robust and perfectly capable of doing data analysis. The documentation (aside from the tutorial) works for it here.

There is an unfortunate amount conflicting tutorials online right now. But that will settle down once DataFrames hits 1.0, which is soon.

1 Like