Reading Multiple Files in a Folder Into a tensor in Julia

New to Julia and programming in general so this is a two part question. Suppose I have a Folder with 3,000 CSV files. Each file is roughly 7,000 x 7. (The number of rows may vary from file to file but the number of columns is constant.) I am trying to read each of these files into an 3000 x N x M tensor or other data structure in julia to compare the outputs by column. (This would mostly involve summing the lags in each column vector)

Question 1: What is the most efficient data structure to parse through this data. My understanding is that DataFrames does not allow for 3 dimensional arrays. Would it be better to load the data into multiple DataFrames and then loo loop through them, to flatten the data into one DataFrame, or would a multi-dimensional array work better in this case?

**Question 2: Is there an efficient way to read all these files into julia? ** Most of the methods I have come across online involve looping through the names of the files. In the case the files each have different names so I can’t loop through β€œCSV1”, β€œCSV2”, etc. My understanding is that broadcasting methods would involve listing out all the files by name which in this case would be infeasible.

Any insights would be greatly appreciated thanks!

You can use Glob.jl to work with lots of CSV files.

You can also use NamedArrays instead of DataFrames to work with named and more than 2 dimensions. A DataFrame wouldnt be the best choice for this even if you had just two dimensions, since a DataFrame is conceptually different from a matrix.

1 Like
mapreduce(file->file |> CSV.read |> DataFrame, vcat, files)

I have done something similiar to the above. All you need is to get all the files you want to read into files as a vector of string paths. This will make one dataframe for you

1 Like

Thanks! Sorry had to look up these functions. So it seems like I should be able to open all files in a data folder with glob with the following code

Folder="/Users/Drive/Data"  
Files=glob("*.csv", Folder) 

But how would I load them into a named array? CSV.file opens them into one csv file. I could try something like:

df3=DataFrame.(CSV.File.(Files)) would create a data frame.  

to get a large DataFrame, but is there a way to convert that into a named array?

Thanks! Would you mind elaborating on your answer a little bit more? How would I read all the files as a vector of string paths?

readdir(path_to_dir; joni=true) will give you all files in the directory. Or you can use Glob.jl to get only CSVs.

or you can filter csv like this

using Chain, DataFrames, CSV, DataFramesMeta
df = @chain path begin
  readdir(_; join=true)
  filter(filepath->endswith(".csv", lowercase(filepath), _)
  CSV.read.(_, DataFrame)
  vcat
end

I think will give u all the csvs row-bound into a large dataframe

1 Like

A small contribution. You can replace:

By the following code, that it is in my opinion nicer.

filter(endswith("csv")∘lowercase, _)

(∘ is created with \circ, and allow to combine several functions).

Also, in answer to @phantom, you can check FileTree.jl for processing many files in parallel.

4 Likes

Thanks! So i’ve managed to get them into a single DataFrame, however the dimensions are a little weird. Each file is showing up as a β€œrow” in the DataFrame. In other words each row of the DataFrame appears to be a DataFrame. Were you able to calculate column lag directly from the DataFrame? Or is there a way to convert from DataFrame to named array? I’ve tried the following:

convert(NamedArray, DataFrame)
convert(Matrix, DataFrame)
NamedArray(DataFrame)

but when I try to convert to a matrix i get an error. When I try the other two methods I still get a DataFrame in Return?

This is not correct. You need reduce(vcat, _) instead of vcat. See this MWE

julia> df1 = DataFrame(a = [1, 2], b = [3, 4]);

julia> df2 = DataFrame(a = [5, 6], b = [7, 8]);

julia> vcat([df1, df2])
2-element Vector{DataFrame}:
 2Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     2      4
 2Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     5      7
   2 β”‚     6      8

julia> reduce(vcat, [df1, df2])
4Γ—2 DataFrame
 Row β”‚ a      b     
     β”‚ Int64  Int64 
─────┼──────────────
   1 β”‚     1      3
   2 β”‚     2      4
   3 β”‚     5      7
   4 β”‚     6      8

You can turn a DataFrame into a matrix with Matrix(df). From there you can convert into a NamedArray.

2 Likes