Normalizing Values and Floating Point Error

ianroddis · May 22, 2023, 11:26am

Sorry if this isn’t the right place to post this question, but it was as close as I could think for the intersection of concerns.

Most sources I’ve read on data modelling suggest that, when required, values be normalized/standardized ~ N(0,1). If those values are going to be used in matrices for modelling, though, what are the considerations for floating point error? On my system, the machine epsilon is ~ 1e-7 for 64-bit floats (doubles). For very large data sets, there could be many values affected by roundoff error.

My question is: does it make sense to use a different parameterization of N, say N(0, 100), to avoid roundoff error in this case?

DNF · May 22, 2023, 11:32am

Do you mean for 32-bit floats? Machine epsilon for 64-bit numbers (at 1.0) is 2e-16, and that doesn’t depend on the system, it’s a property of floating point numbers themselves.

ianroddis · May 22, 2023, 11:41am

You are right, I misinterpreted another value and thought epsilon was 1e-7, so my concerns are likely invalid.

I’d still be curious to know if the conventional wisdom of standardizing to N(0,1) stands for very large data sets (200m+ rows)?

GunnarFarneback · May 22, 2023, 11:48am

The machine epsilon is a relative measure. Scaling all values (within reasonable limits) won’t make any difference to floating point roundoff errors.

ianroddis · May 22, 2023, 11:55am

@GunnarFarneback Could you expand on that a bit, please? From my understanding, machine epsilon is the smallest difference that the type supports between two values … i.e. there are ranges of the real numberline where a value will get rounded up or down. The impact of roundoff will increase the smaller the value being rounded.

If I scale all my features / targets to N(0,1), and do many multiplication operations on them, the values between (-1,1) will tend to get smaller and the effect of roundoff error will be magnified, no?

GunnarFarneback · May 22, 2023, 12:04pm

The distance between two “adjacent” floating point values depends on the magnitude of the values.

julia> nextfloat(1.0)
1.0000000000000002

julia> nextfloat(1.0) - 1.0
2.220446049250313e-16

julia> nextfloat(1024.0)
1024.0000000000002

julia> nextfloat(1024.0) - 1024.0
2.2737367544323206e-13

julia> eps(1.0)
2.220446049250313e-16

julia> eps(1024.0)
2.2737367544323206e-13

ianroddis · May 22, 2023, 12:04pm

I see, that’s really helpful, thank you!

Topic		Replies	Views
Rounding problem in summing up to numbers New to Julia question	10	1393	February 23, 2017
50x speed difference in gemv for different values in vector General Usage	8	4993	August 16, 2017
Bug with isapprox? General Usage	6	2616	December 8, 2017
Float comparison error General Usage	4	1234	June 19, 2018
Random number in (0,1] General Usage	16	15085	October 8, 2018

Normalizing Values and Floating Point Error

Related topics