Reading Multiple Files in a Folder Into a tensor in Julia

phantom · May 17, 2021, 9:05pm

New to Julia and programming in general so this is a two part question. Suppose I have a Folder with 3,000 CSV files. Each file is roughly 7,000 x 7. (The number of rows may vary from file to file but the number of columns is constant.) I am trying to read each of these files into an 3000 x N x M tensor or other data structure in julia to compare the outputs by column. (This would mostly involve summing the lags in each column vector)

Question 1: What is the most efficient data structure to parse through this data. My understanding is that DataFrames does not allow for 3 dimensional arrays. Would it be better to load the data into multiple DataFrames and then loo loop through them, to flatten the data into one DataFrame, or would a multi-dimensional array work better in this case?

**Question 2: Is there an efficient way to read all these files into julia? ** Most of the methods I have come across online involve looping through the names of the files. In the case the files each have different names so I can’t loop through “CSV1”, “CSV2”, etc. My understanding is that broadcasting methods would involve listing out all the files by name which in this case would be infeasible.

Any insights would be greatly appreciated thanks!

pdeffebach · May 17, 2021, 9:47pm

You can use Glob.jl to work with lots of CSV files.

You can also use NamedArrays instead of DataFrames to work with named and more than 2 dimensions. A DataFrame wouldnt be the best choice for this even if you had just two dimensions, since a DataFrame is conceptually different from a matrix.

xiaodai · May 19, 2021, 2:43pm

mapreduce(file->file |> CSV.read |> DataFrame, vcat, files)

I have done something similiar to the above. All you need is to get all the files you want to read into files as a vector of string paths. This will make one dataframe for you

phantom · May 21, 2021, 8:21pm

Thanks! Sorry had to look up these functions. So it seems like I should be able to open all files in a data folder with glob with the following code

Folder="/Users/Drive/Data"  
Files=glob("*.csv", Folder)

But how would I load them into a named array? CSV.file opens them into one csv file. I could try something like:

df3=DataFrame.(CSV.File.(Files)) would create a data frame.

to get a large DataFrame, but is there a way to convert that into a named array?

phantom · May 21, 2021, 8:25pm

Thanks! Would you mind elaborating on your answer a little bit more? How would I read all the files as a vector of string paths?

xiaodai · May 22, 2021, 1:56am

readdir(path_to_dir; joni=true) will give you all files in the directory. Or you can use Glob.jl to get only CSVs.

or you can filter csv like this

using Chain, DataFrames, CSV, DataFramesMeta
df = @chain path begin
  readdir(_; join=true)
  filter(filepath->endswith(".csv", lowercase(filepath), _)
  CSV.read.(_, DataFrame)
  vcat
end

I think will give u all the csvs row-bound into a large dataframe

dmolina · May 22, 2021, 8:32am

A small contribution. You can replace:

By the following code, that it is in my opinion nicer.

filter(endswith("csv")∘lowercase, _)

(∘ is created with \circ, and allow to combine several functions).

Also, in answer to @phantom, you can check FileTree.jl for processing many files in parallel.

phantom · May 24, 2021, 7:26pm

Thanks! So i’ve managed to get them into a single DataFrame, however the dimensions are a little weird. Each file is showing up as a “row” in the DataFrame. In other words each row of the DataFrame appears to be a DataFrame. Were you able to calculate column lag directly from the DataFrame? Or is there a way to convert from DataFrame to named array? I’ve tried the following:

convert(NamedArray, DataFrame)
convert(Matrix, DataFrame)
NamedArray(DataFrame)

but when I try to convert to a matrix i get an error. When I try the other two methods I still get a DataFrame in Return?

pdeffebach · May 24, 2021, 7:43pm

xiaodai:

using Chain, DataFrames, CSV, DataFramesMeta
df = @chain path begin
  readdir(_; join=true)
  filter(filepath->endswith(".csv", lowercase(filepath), _)
  CSV.read.(_, DataFrame)
  vcat
end

This is not correct. You need reduce(vcat, _) instead of vcat. See this MWE

julia> df1 = DataFrame(a = [1, 2], b = [3, 4]);

julia> df2 = DataFrame(a = [5, 6], b = [7, 8]);

julia> vcat([df1, df2])
2-element Vector{DataFrame}:
 2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4
 2×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     5      7
   2 │     6      8

julia> reduce(vcat, [df1, df2])
4×2 DataFrame
 Row │ a      b     
     │ Int64  Int64 
─────┼──────────────
   1 │     1      3
   2 │     2      4
   3 │     5      7
   4 │     6      8

You can turn a DataFrame into a matrix with Matrix(df). From there you can convert into a NamedArray.

Topic		Replies	Views
Read multiple csv files Performance filesystem	1	1439	July 28, 2020
Read multiple DataFrames stored in a single text file General Usage question	8	1222	June 19, 2020
I have an array of 31 arrays and would like to make it a DataFrame. Need some help General Usage	4	282	April 6, 2020
Should I use either Dataframes.jl or Named Array for a long and wide array for sci computing General Usage	7	2591	July 25, 2019
Reading data .mat files and merge them General Usage	8	1728	August 14, 2018

Reading Multiple Files in a Folder Into a tensor in Julia

Related topics