Type of array index?

I wonder what the type of indices is for standard arrays? Int, Int64, or . . . ?

I have this use in mind at present:

# Collect indices for which v[i] < thresh
function partition_vec(v, thresh)
  idx = Vector{Int??????}()
  for i in eachindex(v)
    if v[i] < thresh
      push!(idx, i)
    end
  end
  idx
end
idx = partition_vec(rand(Float64,30), 0.3)

It’s not that I’m likely going to use huge arrays for which the range of index matters, but we’ve seen the problem of int as size in C Language. You are supposed to use size_t but a lot of programmers used int instead and a lot of code broke when programs started to allocate arrays over 2 GB.

A related question is, how does one get the index type of an array? The above code would be cleaner if it were

function partition_vec(v, thresh)
  idx = Vector{index_type_of(v)}()
  . . . .

This is overkill for most cases. I’m asking this just out of curiosity.

1 Like

Int64

julia> v = rand(30)
julia> thresh = 0.3
julia> idx = filter(i->v[i] < thresh, 1:length(v))
11-element Vector{Int64}:
  2
  4
  7
  8
 12
 17
 18
 24
 27
 28
 29
2 Likes

For a “standard” linearly indexed array it is Int, which on a 64-bit Julia equals Int64 and on a 32-bit Julia equals Int32. Generally you can use any subtype of Integer for doing the indexing.

typeof(first(eachindex(v))) should work for your purpose.

Notice that this may be something other than Int, e.g. for
v = view(rand(10, 10), 1:2, 1:2) or v = BigInt(1):BigInt(2).

10 Likes

There is a keytype function that you can use:

julia> v = [1, 2, 3];

julia> keytype(v)
Int64

julia> v = view(rand(10, 10), 1:2, 1:2);

julia> keytype(v)
CartesianIndex{2}
11 Likes

It doesn’t necessarily match the type given by eachindex though?

julia> I = BigInt(1):BigInt(2)
1:2

julia> keytype(I)
Int64

julia> typeof.(eachindex(I))
2-element Vector{DataType}:
 BigInt
 BigInt

Bug or feature?

1 Like

This code seems to break your typeof.(eachindex(I)) solution:

I = BigInt(1):BigInt(10)^100
typeof.(eachindex(I))   # ERROR: InexactError: Int64(1000000000000000000000000000000000000000 ...
typeof.(size(I))        # (BigInt,)

keytype disagrees with eachindex in more common cases as well:

julia> A = [1 2; 3 4]
julia> keytype(A)
CartesianIndex{2}
julia> eachindex(A) .|> typeof
4-element Vector{DataType}:
 Int64
 Int64
 Int64
 Int64

If the partition_vec example is close to your actual code, then it’s easier to rewrite using findall or filter:

partition_vec(v, thresh) = findall(<(thresh), v)
# or
partition_vec(v, thresh) = filter(i -> v[i] < thresh, eachindex(v))

The resulting vector type is properly inferred here.

1 Like

That was an illustration, not a solution. typeof(first(eachindex(I))) was the proposed solution and does work for this range. That broadcasting cannot create arbitrarily large vectors is a different matter (and you probably don’t have enough memory for a 10^100 element vector).

1 Like

This option does not seem to infer the index type, but your second one does:

v = 1.5:BigInt(1):10.5
typeof(first(eachindex(v)))             # BigInt  (and eltype(v) is BigFloat)
findall(<(4), v)                        # 3-element Vector{Int64}
filter(i -> v[i] < 4, eachindex(v))     # 3-element Vector{BigInt}

Why should it be exactly the same type as first(eachindex(v))? I meant that the resulting type is concrete and suitable for indexing the original collection.

Again, the difference is present in more common cases of matrices as well: findall returns CartestianIndexes, while eachindex returns Ints.

I think you are a bit too hung up on the type of indices. You can index into an array using whatever Integer type you want:

1.7.1> a = rand(5);

1.7.1> a[UInt8(3)]
0.45629312424449464

1.7.1> a[big(3)]
0.45629312424449464

1.7.1> a[3]
0.45629312424449464

So the index type is not something that you should concern yourself with in almost any case. Just use Int, which corresponds to the default machine integer type on your system (64 or 32 bit).

They keytype function exists, but as its docstring says, it

is provided mainly for compatibility with the dictionary interface.

Don’t think about it. And anyway, you should use findall for this.

Edit: But, okay, let’s say you really need this, for some reason. Then you can still use whatever integer type you want:

idx = UInt[]
for i in eachindex(v)
    push!(idx, i)

will work, and so will

idx = Int8[]
for i in eachindex(v)
    push!(idx, i)

since i is converted into eltype(idx) when you push!. You just have to make sure that the integer type is big enough to hold the values you need.

2 Likes

Off-topic: Several people have suggested filter-type solutions. If that were actually what I wanted to accomplish, I would use such a solution. (I’m familiar with functional programming.) But, my real code needs to get both sides and is, currently,

function partition_vals(vals, thresh)
  idx_l = Vector{keytype(vals)}() # I've learned this trick from this thread
  idx_h = Vector{keytype(vals)}()
  for i in eachindex(vals)
    (vals[i] < thresh) ? push!(idx_l, i) : push!(idx_h, i)
  end
  (idx_l, idx_h)
end
# . . .
idx_l, idx_h = partition_val(vals, thresh)

That’s why I named my function “partition”. This is an example from Ruby:

[3, 10, 5, 9].partition {|v| v < 6} # => [[3, 5], [10, 9]]

(Not functionally equivalent, because the above doesn’t return the indices, but you get the idea.)

Okay, so, I thought about it further and have come up with this solution:

function partition_vec(v, thresh)
  idx_l = findall(<(thresh), v)
  idx_h = setdiff(collect(eachindex(v)), idx_l)
  (idx_l, idx_h)
end

I’m happy with it. Thanks!

You can do this, but as I said, keytype doesn’t really mean anything special for arrays, use whichever Integer type that suits your application, in fact, the most idiomatic would be to write

idx_l = Int[]
idx_h = Int[]

As for this:

Rule of thumb: Never use collect. 99% of the time, it just needlessly creates a temporary array.

Instead, write:

idx_h = setdiff(eachindex(v), idx_l)
1 Like

However, fwiw, using deleteat!() with collect() seems to run much faster than setdiff() without it:

function partition_vec(v, a)
    idx_l = findall(<(a), v)
    idx_h = deleteat!(collect(axes(v,1)), idx_l)
    (idx_l, idx_h) 
end

Faster with collect? That must be a bug :face_with_raised_eyebrow:

1 Like

Thank you for the very useful advice!

So, I wonder where one finds a documentation that includes the interfaces?

The main Julia website offers an extremely friendly and comprehensive set of “tutorials”. That’s where I learned about setdiff. But all the examples, as far as I recall, use vectors for its arguments.

It would be nice if there is a set of documentation where one finds something like

setdiff(Vector{T}, . . . ) where T
setdiff(UnitRange{T}, . . . ) where T
. . . .

I use a julia-mode on emacs and it shows something like this

. . . .
setdiff(s::Core.Any)
setdiff(itr::Core.Any, itrs::Vararg{Core.Any})

taken from the source code of the standard library, which isn’t very helpful as a description of the user interface of setdiff, except to people who have a lot of experience in implementing Julia libraries. (In hindsight, the name “itr” must mean “iterator”, which suggests that both UnitRange and Vector are treated like an iterator by setdiff, . . .)

I don’t really think that’s going to happen. Look at the documentation of setdiff Collections and Data Structures · The Julia Language

setdiff(s, itrs...)
Construct the set of elements in s but not in any of the iterables in itrs. Maintain order with arrays.

It describes a completely generic function, which is how it should be.

2 Likes

Is there a reason it’s Int and not UInt by default? It looks like Julia doesn’t support the negative indexing to access from the back that some scripting languages allow, so if it’s not permissible wouldn’t UInt be a better choice?

You are supposed to use size_t but a lot of programmers used int instead and a lot of code broke when programs started to allocate arrays over 2 GB.

And many just use a default signed integer and personally I see no reason for this. Sometimes I do get lazy and just use unsigned instead of size_t, but, indexing into a std::vector with a negative number is never going to work, so I always wonder why it’s so commonplace, as it just seems to be asking for bugs leaving a door open that can so easily be closed!

You want to be able to do e.g. for i in 1:10 # Not for i in UInt(1):UInt(10) or even use: 0x0000000000000001:0x000000000000000a

Note, Int (and integer literals), is 64-bit in Julia (except on 32-bit platforms, then 32-bit). That means the 1 bit lost to the sign isn’t a huge deal, i.e. 63 bits indexing is enough (on 32-bit it’s 31-bit indexing). Historically in C/C++ 16-bit indexing was much better than 15-bit that signed would have given (only 32767 indexes, half your memory likely; Julia doesn’t officially support 8- or 16-bit platforms, but HAS been run on 8-bit Arduino, then Int is still not smaller than 32-bit). Note the implication “bit width of std::ptrdiff_t is not less than 17” which seems awkward, in practice likely means “not less that 32”.

Int, as signed, is much better in general (e.g. for literals), unsigned is a potential trap.

That’s true in C++, but not in general with Julia. While, by default, arrays start at 1, they can start at an arbitrary index, e.g. -1, when using OffsetArrays (you can also do similar in Fortran), and it can be very practical. See also… unpractical/joke (that I doubt you can do in Fortran):

If you declare your i index signed (implicitly, as you do), and would index with it as an unsigned, then you can cast with i % UInt64, implicitly or explicitly, but while that way it’s fast, you use loose out on bounds-checking, or you would check for negative numbers, an extra check slowing down (though it might bet optimized away since redundant with the check i < 1).

Using unsigned, unsigned int, int (or even size_t it, seems in, extreme cases) is a potential bug in C/C++:

The bit width of size_t is not less than 16. (since C99)

https://en.cppreference.com/w/cpp/types/size_t

On many platforms (an exception is systems with segmented addressing) std::size_t can safely store the value of any non-member pointer, in which case it is synonymous with std::uintptr_t.

std::size_t is commonly used for array indexing and loop counting. Programs that use other types, such as unsigned int, for array indexing may fail on, e.g. 64-bit systems when the index exceeds UINT_MAX or if it relies on 32-bit modular arithmetic.

When indexing C++ containers, such as std::string, std::vector, etc, the appropriate type is the member typedef size_type provided by such containers. It is usually defined as a synonym for std::size_t.

The integer literal suffix for std::size_t is any combination of z or Z with u or U (i.e. zu, zU, Zu, ZU, uz, uZ, Uz, or UZ). (since C++23)

https://en.cppreference.com/w/cpp/types/ptrdiff_t

std::ptrdiff_t is the signed integer type of the result of subtracting two pointers.

The bit width of std::ptrdiff_t is not less than 17. (since C++11)
Notes
std::ptrdiff_t is used for pointer arithmetic and array indexing, if negative values are possible. Programs that use other types, such as int, may fail on, e.g. 64-bit systems when the index exceeds INT_MAX or if it relies on 32-bit modular arithmetic.