Is array/vector handling this hard?

voroskoi · September 19, 2023, 5:56pm

Hi,

I just started learning Julia and I try to solve some basic ML tasks. It is quite common to generate a vector of indices for the training data and I would like to generate the complement.

For example I have a Vector of labels with a length of 10. I pick some indices.

selected = [1,3,5,8,9]
not_selected = Int[]
for i in 1:length(labels)
    if i ∉ selected
        push!(not_selected, i)
    end
end

I would like to write something like not_selected = 1:length(labels) .∉ selected, but that complains because the dimensions do not match. So I write not_selected = 1:length(labels) .∉ Ref(selected), (which I do not understand btw) and it is almost what I want, but it returns a BitVector of the comparison results. Now I should create an other vector from the BitVector so I give up.
There must be a more elegant way, is not?

Now I can do this (features is a Matrix)

features_train = features[selected, :]
labels_train = labels[selected]
features_test = features[not_selected, :]
labels_test = labels[not_selected]

If the dataset is huge copying those Matrices/Vectors could be a problem, so I tried to remove rows from the initial data, but that does not look nice either:

function removeRows(A::Array, ints)
    rows, cols = size(A)
    out = similar(A, rows - length(ints), cols)
    i = 1
    for r in 1:rows
        if !(r in ints)
            @inbounds out[i, :] = A[r, :]
            i += 1
        end
    end
    return out
end

There is deleteat!, but only for Vectors Oh, come on!

I probably do not find the proper functions, methods yet, but it is frustrating.

One more thing: I see DataFrames quite often, but I can not figure out what it can offer beyond the Base Array/Vector types besides named columns. Am I missing something there too?

Thanks,

JonasWickman · September 19, 2023, 6:12pm

Here are two possibilities for getting the index complements:

U = 1:10
selected = [1,3,5,8,9]
not_selected = setdiff( U , selected )
not_selected = [ i for i ∈ U if i ∉ selected ]

DNF · September 19, 2023, 6:18pm

Could you clean this up a bit? labels and ids are used without being defined. Did you change names half way? Anyway, I think this is quite simple to do, if you just clarify the intent.

abraemer · September 19, 2023, 6:18pm

Using a vector of indices is completely fine by itself. However in your case, where you also want a complement it is not the best choice. You could just use a BitVector of indexing.

mask = falses(size(features, 1))
mask[selected] .= true
# now you can do
features[mask, :] # selected features
features[.!(mask), :]  # not selected features

If the data is very large you can also consider using views. For that just prepend a slicing operation with @view like in @view features[.!(mask), :]

Also Julia arrays are column-major, so array[indices, :] will generically be slower than array[:, indices]

voroskoi · September 19, 2023, 6:24pm

Oh, I have not found setdiff yet. Nice.

voroskoi · September 19, 2023, 6:25pm

I am sorry, I have fixed the example.

voroskoi · September 19, 2023, 6:56pm

Thanks for the insightful answer. I changed the code to use BitVector and @view, slicing is quite amazing.
Regarding Your final sentence: If julia is column-major the datasets should be also column-based, but it is not the case. Are there folks out there who transpose them before using?

Nathan_Boyer · September 19, 2023, 7:01pm

One method is to use the InvertedIndices.jl package.(You still did not define labels.)

julia> using InvertedIndices

julia> labels = 101:110
101:110

julia> selected = [1,3,5,8,9]
5-element Vector{Int64}:
 1
 3
 5
 8
 9

julia> labels[Not(selected)]
5-element Vector{Int64}:
 102
 104
 106
 107
 110

DataFrames are essentially a collection of related vectors. Often the rows will store data corresponding to each other, and the types of each column can be different. A DataFrame is faster and more convenient than trying to keep track of a Matrix{Any} for this type of data. However, if your data is simpler, vectors and matrices are perfectly fine.

julia> using DataFrames

julia> df = DataFrame(id=[1001, 1002], first_name=["John", "Sally"], last_name=["Smith", "Jones"], score = [57.3, 94.8])
2×4 DataFrame
 Row │ id     first_name  last_name  score
     │ Int64  String      String     Float64
─────┼───────────────────────────────────────
   1 │  1001  John        Smith         57.3
   2 │  1002  Sally       Jones         94.8

julia> df[df.score.>70, :]
1×4 DataFrame
 Row │ id     first_name  last_name  score
     │ Int64  String      String     Float64
─────┼───────────────────────────────────────
   1 │  1002  Sally       Jones         94.8

jmair · September 20, 2023, 5:27am

In my experience, we always use the final dimension to store the number of samples in datasets. This is already done by default in MLDatasets.jl which lets you easily access common datasets used in ML benchmarks. For example MNIST (Vision · MLDatasets.jl) exposes the features for the training set as a "28x28x60000"array, which is the more efficient way for Julia.

Nathan_Boyer · September 20, 2023, 1:40pm

I forgot to explain this part of your post.

If labels=101:110 and selected=[1,3,5,8,9], then the broadcasting in 1:length(labels).∉selected is trying to perform this operation:

1 ∉ 1  # false
2 ∉ 3  # true
3 ∉ 5  # true
4 ∉ 8  # true
5 ∉ 9  # true
6 ∉ ?  # ?
7 ∉ ?  # ?
8 ∉ ?  # ?
9 ∉ ?  # ?
10 ∉ ?  # ?

but it can’t because the dimensions don’t match. Ref is a common/fast method to “shield” a collection from the broadcasting, so the operations become instead:

1 ∉ [1,3,5,8,9]  # false
2 ∉ [1,3,5,8,9]  # true
3 ∉ [1,3,5,8,9]  # false
...

A more intuitive way to achieve the same result is to put selected in a Vector or Tuple. Then the broadcasting will be applied to that one-element outer collection, and one-element collections have the special property of being repeated during broadcasting.

julia> typeof(Ref(selected))
Base.RefValue{Vector{Int64}}

julia> 1:length(labels) .∉ Ref(selected)
10-element BitVector:
 0
 1
 0
 1
 0
 1
 1
 0
 0
 1

julia> typeof([selected])
Vector{Vector{Int64}} (alias for Array{Array{Int64, 1}, 1})

julia> 1:length(labels) .∉ [selected]
10-element BitVector:
 0
 1
 0
 1
 0
 1
 1
 0
 0
 1

julia> typeof((selected,))
Tuple{Vector{Int64}}

julia> 1:length(labels) .∉ (selected,)
10-element BitVector:
 0
 1
 0
 1
 0
 1
 1
 0
 0
 1

Once you have one BitVector, the other can be easily constructed with broadcasted not .!.

not_selected_indices = .!selected_indices

tamasgal · September 20, 2023, 2:23pm

Depending on how many times you use the lookup (in), a Set might be a better data structure since an array lookup is O(n) and set is O(1).

julia> a = collect(1:1_000_000);

julia> @btime 1_000_000 ∈ a
  514.166 μs (0 allocations: 0 bytes)
true

julia> b = Set(a);

julia> @btime 1_000_000 ∈ b
  19.349 ns (0 allocations: 0 bytes)
true

Just keep in mind the overhead of creating the Set in first place. In the example above with lots of (different) elements, it can be quite slow:

julia> @btime b = Set($a);
  20.512 ms (7 allocations: 18.00 MiB)

voroskoi · September 20, 2023, 3:57pm

I have tried Iris, but that is so small it does not really matter. Thanks!

voroskoi · September 20, 2023, 4:23pm

Thank You very much for the detailed answer!

1 ∉ 1  # false
2 ∉ 3  # true
3 ∉ 5  # true
4 ∉ 8  # true
5 ∉ 9  # true
6 ∉ ?  # ?
7 ∉ ?  # ?
8 ∉ ?  # ?
9 ∉ ?  # ?
10 ∉ ?  # ?

This part actually shocked me. Not the size mismatch part, but with labels = [1,2,3,4,5] this actually returns [0,1,1,1,1], which would have been even more confusing. Broadcasting is powerful, but I have to keep in mind that is broadcasts both sides.

Yeah, this was clear after @abraemer’s reply, but it is powerful indeed, I have started using right away.

gustaphe · September 20, 2023, 4:31pm

You are not alone in thinking this is unintuitive. If Julia was in that stage of development my feeling is a majority would have voted to not have scalar numbers be iterable, so that at least 1 ∉ 3 gave a sensible error.

voroskoi · September 20, 2023, 4:57pm

This was not new to me, but this is really important and I will keep in mind with larger datasets. Thank You!

Topic		Replies	Views
Elegant way of exclusive indexing General Usage indexing , arrays	41	2276	February 26, 2022
Filter DataFrame by an Array New to Julia	8	5915	December 10, 2019
Finding indices using "setdiff"? New to Julia	2	710	July 29, 2019
Vector{Vector} indices General Usage indexing , arrays	22	2810	September 19, 2022
Deleteat!(trues(3), [true, false, false]) got "ArgumentError: indices must be unique and sorted" General Usage bug	18	1048	June 20, 2020

Is array/vector handling this hard?

Related topics