New to Julia and programming in general so this is a two part question. Suppose I have a Folder with 3,000 CSV files. Each file is roughly 7,000 x 7. (The number of rows may vary from file to file but the number of columns is constant.) I am trying to read each of these files into an 3000 x N x M tensor or other data structure in julia to compare the outputs by column. (This would mostly involve summing the lags in each column vector)
Question 1: What is the most efficient data structure to parse through this data. My understanding is that DataFrames does not allow for 3 dimensional arrays. Would it be better to load the data into multiple DataFrames and then loo loop through them, to flatten the data into one DataFrame, or would a multi-dimensional array work better in this case?
**Question 2: Is there an efficient way to read all these files into julia? ** Most of the methods I have come across online involve looping through the names of the files. In the case the files each have different names so I canβt loop through βCSV1β, βCSV2β, etc. My understanding is that broadcasting methods would involve listing out all the files by name which in this case would be infeasible.
You can use Glob.jl to work with lots of CSV files.
You can also use NamedArrays instead of DataFrames to work with named and more than 2 dimensions. A DataFrame wouldnt be the best choice for this even if you had just two dimensions, since a DataFrame is conceptually different from a matrix.
I have done something similiar to the above. All you need is to get all the files you want to read into files as a vector of string paths. This will make one dataframe for you
Thanks! So iβve managed to get them into a single DataFrame, however the dimensions are a little weird. Each file is showing up as a βrowβ in the DataFrame. In other words each row of the DataFrame appears to be a DataFrame. Were you able to calculate column lag directly from the DataFrame? Or is there a way to convert from DataFrame to named array? Iβve tried the following: