find the min & max of a column of a matrix
TL;DR you should rather do using DataFrames
and use describe
:
Missing values are filtered in the calculation of all statistics, however the column :nmissing will report the number of missing values of that variable.
I’m relying on that whatever you do to import into it rather uses missing
than NaN
, because only the former is filtered. NaN can also happen in calculations, so isn’t strictly a good sign for missing, while I believe other languages e.g. R is it for that.
https://dataframes.juliadata.org/stable/man/comparisons/
Note that pandas skips
NaN
values in its analytic functions by default. By contrast, Julia functions do not skipNaN
’s. If necessary, you can filter out theNaN
’s before processing, for example,mean(Iterators.filter(!isnan, x))
.Pandas uses
NaN
for representing both missing data and the floating point “not a number” value. Julia defines a special valuemissing
for representing missing data.
Depending on the cause of NaNs, e.g. if an artifact of importing, you can filter or substitute them somehow (also something like interpolating may apply): Replacing *missing* and *NaN* values in dataframe - #2 by nilshg
Simply replacing NaN (or missing
) with 0 isn’t good advice (with missing
likely better), but I noticed this blog post and it might be helpful: https://www.roelpeters.be/replacing-nan-missing-in-julia-dataframes/
Older text: You can do that in one go, at least this way:
julia> A = [NaN 3; 4 missing]
2×2 Matrix{Union{Missing, Float64}}:
NaN 3.0
4.0 missing
julia> extrema(x for x ∈ skipmissing(A) if !isnan(x))
(3.0, 4.0)
About column of a “matrix”, it seems clear you’re referring to a table, and would want to be using DataFrames
(or Pandas.jl).
I intentionally showed you could find extrema (or just e.g. minimum) of a full matrix (across columns), not just for one (or more) columns. You would want to slice one column (or row) at a time, as you know how to do. But I also looked a bit into doing that automatically for all each column.
I see you have the problem of Vector{Any}
because of “Sc_Young_Modulus”. If you see Any
(an abstract type, the top one; you can’t rely on the Abstract prefix, but I think that’s the major (only?) exception) like that, it’s likely going to kill performance. That’s one reason to want to use DataFrames or other way to skip header rows. You want to see concrete types, e.g. Vector of Float64
for your whole column, and it also allows different types for each column without performance problems. Julia is unusual with this missing
concept, which is similar to NaN, but more general since it works for all datatypes.
See “Handle Missing Data”, e.g. dropmissing!
in the cheat sheet below.
I’m no expert on the package, so I’m not sure if it has similar good functions [EDIT: it seems as good] such as in Pandas:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html
While there is the Pandas.jl wrapper, that would work with all Julia data that’s compatible with Python, I doubt it supports missing
(because Python can’t support it, also think the concept was introduced in Julia after that package). I’m not sure where you got NaN from, possibly an extra line when importing some data? You likely want to use CSV.jl to import. I’m not sure if it rather imports with missing
, or possibly both it and NaN?
Because even though I showed how the avoid both, just checking for missing (if you can rely in that at most, or even avoid expecting that), is going to be much faster and allocate less, and simpler code:
julia> @time extrema(skipmissing(B))
0.013557 seconds (11.08 k allocations: 612.520 KiB, 99.58% compilation time)
If you used readdlm, and want to do minimal changes, then I would look into non-default options: header=true, comments=true, comment_char=‘#’