With what package and Julia functions the Scaling and centering Matrices is performed?
I’m not sure if there is a package, but both of these would be 1 line functions if you wanted to make them.
A simple analogue of R’s scale(A)
function (with the default arguments) would be:
using LinearAlgebra, Statistics
scale(A) = mapslices(normalize!, A .- mean(A,dims=1), dims=1)
using Statistics
scale(A) = (A .- mean(A,dims=1)) ./ std(A,dims=1)
Should also work :). Might need to change dims = 1 or 2 if the data is row/column major.
ChemometricsTools.jl offers this as “CenterScale()”, the way it works is kinda nice because you can do the following…
using ChemometricsTools
scaler = CenterScale(A)
Ascaled = scaler(A)
#now you can use this same mean/stddev to center & scale new data for inference
Bscaled = scaler(B)
Something like that anyways - I’d have to check the docs.
What happens with mapslices function? Is it the only case in which the comparison is not satisfactory!
using RCall
#
rcopy(R"R.version.string")
julia> rcopy(R"R.version.string")
"R version 3.6.1 (2019-07-05)"
RCall.reval("set.seed(12345)")
A=rcopy(RCall.reval("matrix(rnorm(20), nrow = 4)"))
@rput A
tR=rcopy(R"scale(A)")
using LinearAlgebra, Statistics
julia> scale1(X) = mapslices(normalize!, X .- mean(X,dims=1), dims=1)
scale1 (generic function with 1 method)
julia> tJ=scale1(A)
4Ă—5 Array{Float64,2}:
-0.606239 0.395508 -0.731593 0.824123 0.720878
-0.203018 -0.448957 0.599096 -0.0240218 -0.692767
0.767816 0.592667 -0.154067 -0.39105 -0.0115036
0.0414407 -0.539219 0.286565 -0.409051 -0.0166076
julia> tR
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> floor.(tR,digits=6) != floor.(tJ,digits=6)
true
julia> scale2(X) = (X .- mean(X,dims=1)) ./ std(X,dims=1)
scale2 (generic function with 1 method)
julia> tJ=scale2(A)
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> tR
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> floor.(tR,digits=6) != floor.(tJ,digits=6)
false
julia> using LinearAlgebra, StatsBase
julia> scale3(X)=standardize(ZScoreTransform,X,dims=1)
scale3 (generic function with 1 method)
julia> tJ=scale3(A)
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> tR
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> floor.(tR,digits=6) != floor.(tJ,digits=6)
false
julia> using ChemometricsTools
julia> scaler = CenterScale(A)
CenterScale{Array{Float64,2},Array{Float64,2}}([0.28809252478401026 -0.07494330466750099 … -0.5848475671701096 0.6389883622663647], [1.1498790318008532 0.8687946525882664 … 1.4295913312370347 0.933934864794779], true)
julia> tJ = scaler(A)
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> tR
4Ă—5 Array{Float64,2}:
-1.05004 0.685041 -1.26716 1.42742 1.2486
-0.351637 -0.777616 1.03766 -0.0416069 -1.19991
1.3299 1.02653 -0.266853 -0.677319 -0.0199249
0.0717774 -0.933954 0.496345 -0.708498 -0.0287652
julia> floor.(tR,digits=6) != floor.(tJ,digits=6)
false
My mistake, I misunderstood the normalization — R normalizes the columns to have standard-deviation (root-mean-square, with the n-1 Bessel correction) equal to 1, whereas I was normalizing them to have root-sum-square (norm) equal to 1, so my suggestion was off by a factor of sqrt(3)
in this case. A corrected version should be:
scale(A) = mapslices(normalize!, A .- mean(A,dims=1), dims=1) * sqrt(size(A,1)-1)
Note that there is no “standard” scaling — it really depends on the purpose and the context. Eg cf
http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf
On a related note, any reasonable scaling will serve for numerical purposes.