Occasionally NaNs when using similar()

Then you don’t use similar (or properly initialize, after, and then likely you are at the cost of the alternatives), i.e. use use zeros, or ones or, fill.

It’s not theoretically possible to eliminate the presence of NaNs because the point of similar is an optimization, specifically to not initialize, and then you get garbage from earlier time, and NaN (and with much less, but non-zero probability Infs too).

What WOULD be possible, and likely a good idea, is to partially initialize, i.e. first element (and/or last) to NaN to “poison the array” in a sense intentionally. Would that help you and make this obvious? It would still be O(1) (instead of O(0), for initialization; most likely a bit more, memory allocation is not totally free).

I’m not sure you have the option of similar in many languages, such as Python, at least not with Java. But it’s the default for C, implicitly (for malloc) all the time. You can do all your Julia programming without ever using similar, but then expect closer to Java speed. However, it’s probably overused, and you can get a lot of speed usually without. similar is included in the language, or strictly standard library, since Julia is a performance obsessed language. Julia has no speed-hindrances compared to ANY other language; except possibly the GC, and it can even speed up compared to C. Some things can slow down if not used right, or similar not used, e.g. the GC is an overhead, but the GC can always be avoided, at least in non-ideomatic code, e.g. with StaticCompiler.jl you don’t even have the GC available at all.

You most likely had, and likely still, a bug. It’s possibly you still have a bug without similar, though e.g. zeros might make you code correct. It’s hard to say without seeing code.

2 Likes

I don’t think it is feasible, or even always positive, to try and prevent every possible misplaced expectation by a user that does not read the manual. We should strive to do it, but not beyond the point of making the language less pleasurable to work with.

There is nothing wrong with similar(). It does initialize an undefined object of a certain size and element type. It’s used because it’s convenient and performing. But it does not grant anything about its element values. Doing arithmetic with random, undefined values is never a good idea.

And there are functions that allow to do that, with guaranteed values (the zeros, ones, fill, …).

The issue is that given an object x, e.g., a Float64 vector, the notion of “similarity” can have many different meanings. If the user need is to use the values of similar(x) to do arithmetic, where do we stop? Does the user expect to have values in the same min max range of x? All positive/negative values? Following the same statistic distribution of x? Other constraints? It’s a rabbit hole I’m not keen to get into.

2 Likes

If you are addressing me, then similar is about getting the same type but also uninitialized, and maybe it’s actually a mistake to combine the two.

I mean we are so used to asking for uninitialized (at least in C) since it’s a needed low-level step, but nobody ever asked to read those values. Actually ones (and fill possible and zeros is also redundant, since zero should be the default if you enlarge an array).

Maybe you should only be allowed to construct arrays by using comprehensions, or alternatively only construct 0-sized arrays.

[Interesting two comments there.]

The only valid use of a (large) undef array is to next initialize it with something (or at least before those parts of it are read), and it could be done lazily to zeros by the compiler. I.e. plausibly you could have a non-zero sized undef array it to write into (but only visible to the compiler, as an optimization step to not enlarge an array one by one) and extend the boundaries of valid values, but the bounds for out-of-bounds checks could differ for reading and for writing.

The comment was mainly for you, but the phone GUI makes it hard to understand whether I commented on the right thread :sweat_smile:

My position is that there is a good use case for the current similar. And therefore I would not change it, nor propose a radical restructuring of Julia to address a misplaced user expectation about it.

Notice that if you want to do some arithmetic with the created vector the user would have similar semantically wrong expectations UNLESS the constructor is explicit about the values.

That is, if we set similar(x) to return zeros, a user could ignore that and do
threes = similar(x) * 3.0
and be baffled by the result.

Or they could expect that mean(x) == mean(similar(x)).

Every implementation can be misused, but that doesn’t make it unsafe, right? I have the impression that here OP was simply misusing a well defined, legitimate, function.

2 Likes

My point is that this is not a well defined request. When should this behavior happen? What if x has NaN or Inf? Are there other constraint? Any distribution to follow? Is it ok to return always the same value?

1 Like

Here the OP was misusing similar, and you could say many people get away with using it, since it’s an optimization they know how to use.

That was my first argument, but note ALL CAN do without similar, like in Java, and I thought we should hide it from users so that they could opt into it when and if they understand.

My latest thinking is that no one needs similar, it’s an optimization the compiler could do transparently.

We do have sizehint! and that’s ok. But we also have potentially problematic resize! (which can e.g. introduce NaNs. even if you never used similar):

[…] If n is larger, the new elements are not guaranteed to be initialized.

with same problem as similar. It would be ok if you can resize! and not read from the added area until you have written to it first. So my idea is Julia does bounds-checking, it needs to know [1, size_of_array], and in general start is not 1, so it needs to know the actual start and end of an array, but Julia could have stricter bounds than those for reading from the array, and a default value, zero, to add to the region. Does this make sense, is it understandable what I’m saying?

Just as we have @inbounds (which may go away, become a no-op) to block that disregards the bounds for reads and writes from an array, we could also have alternative bounds just for the reads, and if people are worried about the costs of bounds-checking those moving bounds, we could have a macro to disregard them.

1 Like

I think I understand your position, and I agree on the three points you make, but not entirely on the conclusion (I agree on the reading conditions, not on having a default value).

For similar:

  1. Why should the default value be zero(x)? Why not one(x)? Or oneunit(x)? Why not rand(typeof(x))? This is an arbitrary decision, and unless the semantic of the command makes it explicit, it seems to invite for misunderstandings.
  2. The fact that we can do without, does not mean we have to renounce to it completely. And the fact that a small fraction of users can use a function wrongly does not mean we need to hide it (we should hide all functions, then). If there’s a 10% use cases where that function is convenient, and the documentation makes it very clear what it does (as the docs for similar indeed do), why should we adopt a more cumbersome writing?

Maybe it’s simpler to have a lighter constructor that does what you are suggesting, and it is called something like safe_similar(). Or we should just encourage to use fill(3.0,n) if the intended usage is to use it as ones(n) .* 3.0.

And the same, for me, is true about resize!. Unless we are explicit in the semantic of the function, e.g., resize_zero!(), it seems more dangerous to have default values rather than undefined.

On the other hand, if we could add for free (without performance loss) something that prevents from reading or doing arithmetic with undefined values, that would be nice.

1 Like

The similar function is for low-level performance optimization (similar to @inbounds). It does exactly what it’s supposed to do: allocate an array without initializing it. That is, it avoids the extra cost of having to go through the entire array and setting it to some value. This kind of functionality is very important for high performance numerical code.

But it is dangerous if you don’t know what you’re doing: If you use similar (or any other uninitialized memory) you have to double and triple-check that you never use that memory in any computation without first writing to it.

If performance is not critical, always use zero instead of similar.

It actually would be nice if there was a “compiler option” (aka Julia command line flag) to always replace similar with zero – or, replace it with something that consistently “poisons” it with NaN. I’ve seen Fortran compilers with such options. If you have good tests any change in behavior between running with and without initialization can be an indicator that you really screwed up :slight_smile:

3 Likes

Do not be cavalier about this. This is the most common and most serious bug in numerical computing. You must fix your code. Rules 1, 2, and 3 of numerical computing are “Make sure every memory I access is initialized”.

If you do not fix this, your code will eventually give you wrong results without you noticing – that’s the kind of bug that will lead to having to retract a paper, or derail an entire PhD.

3 Likes

As I explained “poisoning” with NaNs would defeat the speed-advantage of similar. Also for 0-initializing, but I think it might be valid to prefix just the first few bytes of an array O(1) operation.

Doing that by default would be ok, and for the full array for some debug opt-in.

I would suggest initializing the first bytes of an array to NaN, NaN32 (just in case some codes uses that or reinterprets), then NaN16, for a total of 112 bits, 14 bytes, less than a cache-line on all CPUs. Can be stored in one store instruction, on at least ARM(64), that allows storing a pair of words.

For these concatenated bitstrings:

julia> bitstring(NaN)
"0111111111111000000000000000000000000000000000000000000000000000"

julia> bitstring(NaN32)
"01111111110000000000000000000000"

julia> bitstring(NaN16)
"0111111000000000"

since similar knows the type, then maybe it should just choose one of those and store only once such of them? It wouldn’t defeat reinterpreting though… Also some types, such as integers have no NaN. But reading those as such will get you zeros (not for the beginning, reading as Int8) or e.g. 127 or even higher value.

Yes, of course! The similar routine should not change its (default) behavior in any way. It works exactly as intended, and it’s a crucially important function.

Also for 0-initializing, but I think it might be valid to prefix just the first few bytes of an array O(1) operation. Doing that by default would be ok, and for the full array for some debug opt-in.

I don’t think poisoning the first entry of a similar array (by default) would be okay. The similar function should allocate an array and not touch it otherwise. Even an O(1) operation might have unforeseen consequences on the final machine code and the potential compiler optimizations that might result.

“Compiler flags” to change to behavior of similar to either zero out or poison the array would be very useful, but they are debugging/testing tools only and would carry a performance penalty.

2 Likes

No, I don’t think so. Julia handles the allocation, your code jumps to that routine, and I don’t see it inlined, nor think it would even happen, even partially.

So I think you do not have to worry about the code size, of your code, nor about performance of yours or of Julia’s in general. About compiler optimizations, I at least don’t see any prevented, I might be wrong, and I explained at the issue I opened this might actually be faster (since you gain the trivial cost back). It would NOT be faster if you initialize the whole array, and then redundant with e.g. zeros, why I’m not proposing doing such.

Well, fair enough… I think you have a much better understanding of the low-level details. Just to clarify what I really meant:

  • The official semantics of similar is that the content of the array is uninitialized. That is, the user should assume that the result contains random bits or NaNs. Of course, writing zeros or NaNs or anything else to any one or to all elements doesn’t conflict with these semantics, as long as it’s not documented as “official behavior”
  • The reason people use similar is for performance, usually because they then want to use that array in some BLAS function where they know that it will get overwritten.

So I would argue that any implementation of similar should optimize for performance above all other concerns. If putting a NaN in the first elements doesn’t affect performance (or even improves it for some deeply unintuitive reason), that seems fine. And sure, if the implementation is maximally performant and also tends to blow up when used improperly, that’s a win.

When running in some specified “debug mode”, the “maximally performant” no longer applies, and putting anything into the array that might help with identifying errors is absolutely fair game.

A small, marginal issue with this is that similar() handles also stuff like Char and I don’t think isnan(x::Char) is defined.

2 Likes

Or isnan or equivalent for any other custom type.

Char is always Unicode (except when it isn’t…) i.e. a 21-bit integer stored in Int32 (which implements Char).

So would the top 64-21 = 43 (even 44) bits need to be zeros? Only for a valid Char. The point is kind of to get something nonsensical, and maybe it’s good for this that these, and even all the bits would be ones? It’s not a legal NaN, or “NaN-Char” (there is no such, but this could be it), but Julia’s strings can store arbitrary bytes, and they do not fail if you try to add such (which is a bit controversial), you would have potential troubles later, if you do not validate strings.

julia> reinterpret(Char, Int32(-1))
'\xff\xff\xff\xff': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> "" * reinterpret(Char, Int32(-1))
"\xff\xff\xff\xff"

Note, here I’m explain an array of Char, not to be confused with a a String (Vector of bytes, but actually a pointer to, so one such or a Vector of such). Strings as opposed to Char is covered below:

For the same reason this would work for any array

This is in effect what you get for any bitstype, and I think Julia already takes care of non-bitstype arrays (i.e. doesn’t allow random pointer behind the scenes).

Note, for the raw memory, if interpreted as floats, you will have NaNs occasionally, and I’m not suggesting anything else than forcing it, but only for a tiny prefix of the array. How the array gets reinterpreted into its intended type is another story. But one series of bits seems to work for all NaN-supporting types, and others.

Not I edited my post, the one right above, you’re answering. It probably wasn’t to clear, and it seems I partially wrote the opposite of what I meant. So maybe you do not have any disagreement if what I meant to write. Is it now clear? I do not want safe_similar, the same similar would be used, but yes it would be safer… a) separate bounds for reads and writes would do that and/or b) the other option that I suggested in the issue (that was since closed at JuliaLang) for O(1) poisoning arrays.

Fully agree with your points @goerz and thank you for your concern.
Since the day of making my reply here, I’ve done away with using similar() and now ‘properly’ initialize my arrays, with home-made spinoffs of fill() function. In my case case, the speed of initialization was never a concern for me (at least yet).

For anyone encountering similar problems (hehe) and looking for workarounds:

I now use myzeros(x) ( myfill(x, value) ). The function’s task is to output a data structure of the same size filled with zeros (filled with the same value as value). while making sure that the eltype is the same as x.

1 Like