mjanun
July 25, 2021, 1:02pm
1
I have some data in different columns of an Excel sheet. I want to read the different columns into separate dataframes. I know the the column name/number where the data for each dataframe starts but not number of rows of data they contain. One example of the type of data is shown:
I want two data frames in this case. The first one containing SECTOR as a header and containing all data in column A. The second one containing INDIVIDUAL as header with all the data in column C.
How can I do this using XLSX or otherwise?
If itβs something you only need to do once, you can use ClipData.jl to copy and paste into a DataFrame easily.
If itβs something you need to do programmatically, the solution is XLSX.jl, but I donβt know much about how to work with that package. Hopefully someone else can chime in.
mjanun
July 25, 2021, 1:45pm
3
Thanks. I need to do it programmatically.
Could you check the following:
using XLSX, DataFrames
xf = XLSX.readxlsx(filename)
m = xf[1][:]
df = DataFrame(m[2:end,:],:auto)
rename!(df, Symbol.(m[1,:]))
df_list = []
for (h,c) in pairs(eachcol(df))
if all(ismissing.(c))
select!(df, Not(h))
else
dh = DataFrame(; h => c);
dh[!,h] = convert.(eltype(dh[!,1]), df[:,h])
dropmissing!(dh, h)
push!(df_list, dh)
end
end
It creates one single dataframe df
and pushes one dataframe per non-empty column into a vector of dataframes:
Output:
julia> df
4Γ2 DataFrame
Row β SECTOR INDIVIDUAL
β Any Any
ββββββΌβββββββββββββββββββββ
1 β IT ONE
2 β FINANCE TWO
3 β missing THREE
4 β missing FOUR
julia> df_list
2-element Vector{Any}:
2Γ1 DataFrame
Row β SECTOR
β String
ββββββΌβββββββββ
1 β IT
2 β FINANCE
4Γ1 DataFrame
Row β INDIVIDUAL
β String
ββββββΌββββββββββββ
1 β ONE
2 β TWO
3 β THREE
4 β FOUR
1 Like
Beter to do
df = DataFrame(m[2:end,:], :auto)
What are you expecting? It works as I expected.
julia> df_list = [DataFrame(rand(2,2), :auto) for i in 1:2]
2-element Vector{DataFrame}:
2Γ2 DataFrame
Row β x1 x2
β Float64 Float64
ββββββΌββββββββββββββββββββββ
1 β 0.504304 0.338425
2 β 0.0633497 0.0394208
2Γ2 DataFrame
Row β x1 x2
β Float64 Float64
ββββββΌββββββββββββββββββββββ
1 β 0.899258 0.69782
2 β 0.899377 0.00734581
julia> push!(df_list, DataFrame(h = [1, 2, 3]))
3-element Vector{DataFrame}:
2Γ2 DataFrame
Row β x1 x2
β Float64 Float64
ββββββΌββββββββββββββββββββββ
1 β 0.504304 0.338425
2 β 0.0633497 0.0394208
2Γ2 DataFrame
Row β x1 x2
β Float64 Float64
ββββββΌββββββββββββββββββββββ
1 β 0.899258 0.69782
2 β 0.899377 0.00734581
3Γ1 DataFrame
Row β h
β Int64
ββββββΌβββββββ
1 β 1
2 β 2
3 β 3
I see. No, you want DataFrame(; h => c)
. The Pair
syntax lets you work with names programmatically.
1 Like
mjanun
July 25, 2021, 4:53pm
11
Thank you for the code. Few questions:
Why do the column type is shown as Any when it is string?
This code results in same number of rows in all the dataframes, so it add data with missing
if number of rows in one dataframe is lower than others. The data I have has different number of rows in each column. Is it possible to create this vector of dataframe with different number of rows, ie. if missing value can be excluded?
FWIW, I donβt think your answer is particularly compelling.
There are better functions in XLSX to work with this. @mjanun I will try and make an MWE with a solution I think is more elegant soon.
mjanun
July 25, 2021, 4:54pm
13
Thank you for looking in to it. I will wait for your example.
@mjanun , see code edited above to meet your requirement.
NB: supposedly code βnot particularly compelling nor elegantβ
1 Like
Here is something I think might be a bit more robust
julia> using XLSX, DataFrames;
julia> function num_trailing_missing(x)
n = length(x)
s = 0
while true
n == 0 && break
!ismissing(x[n]) && break
s += 1
n -= 1
end
s
end;
julia> mat = XLSX.readxlsx("testdata.xlsx")[1][:];
julia> inds = [1:1, 3:3]; # You know the columns but not rows
julia> dfs = map(inds) do is
data = mat[2:end, is]
nms = mat[1, is]
df = DataFrame(data, string.(nms))
min_num_trailing_missings = minimum(num_trailing_missing.(eachcol(df)))
df = df[1:(end - min_num_trailing_missings), :]
# narrow the types
transform(df, names(df) .=> ByRow(identity); renamecols = false)
end
2-element Vector{DataFrame}:
2Γ1 DataFrame
Row β SECTOR
β String
ββββββΌβββββββββ
1 β IT
2 β FINANCE
4Γ1 DataFrame
Row β INDIVIDUAL
β Int64
ββββββΌββββββββββββ
1 β 1
2 β 2
3 β 3
4 β 4
Overall this was harder than I thought. I donβt think itβs too different from @rafael.guerra 's answer, actually. However
I take advantage of the fact that you know the starting and ending indices
I narrow the types of the output so they are no longer Any
I drop trailing missing
rather than all missing
values in the data frame.
1 Like
mjanun
July 25, 2021, 6:07pm
16
Thank you. One last question, I see that mat = XLSX.readxlsx("testdata.xlsx")[1][:]
refers to the first sheet in the spreadsheet. How can I specify the sheet name instead e.g. if I want to refer to the sheet called Data
?
Take a look at the docs with ? readxlsx
. You just replace 1
with the name of the sheet, as a String
.
1 Like
Are there in your response?
No, when I wrote that I thought readtable
could allow for subsets of columns, but I guess it only takes in the full sheet.