Is array/vector handling this hard?

Hi,

I just started learning Julia and I try to solve some basic ML tasks. It is quite common to generate a vector of indices for the training data and I would like to generate the complement.

For example I have a Vector of labels with a length of 10. I pick some indices.

selected = [1,3,5,8,9]
not_selected = Int[]
for i in 1:length(labels)
    if i βˆ‰ selected
        push!(not_selected, i)
    end
end

I would like to write something like not_selected = 1:length(labels) .βˆ‰ selected, but that complains because the dimensions do not match. So I write not_selected = 1:length(labels) .βˆ‰ Ref(selected), (which I do not understand btw) and it is almost what I want, but it returns a BitVector of the comparison results. :slight_smile: Now I should create an other vector from the BitVector so I give up.
There must be a more elegant way, is not?

Now I can do this (features is a Matrix)

features_train = features[selected, :]
labels_train = labels[selected]
features_test = features[not_selected, :]
labels_test = labels[not_selected]

If the dataset is huge copying those Matrices/Vectors could be a problem, so I tried to remove rows from the initial data, but that does not look nice either:

function removeRows(A::Array, ints)
    rows, cols = size(A)
    out = similar(A, rows - length(ints), cols)
    i = 1
    for r in 1:rows
        if !(r in ints)
            @inbounds out[i, :] = A[r, :]
            i += 1
        end
    end
    return out
end

There is deleteat!, but only for Vectors :slight_smile: Oh, come on!

I probably do not find the proper functions, methods yet, but it is frustrating.

One more thing: I see DataFrames quite often, but I can not figure out what it can offer beyond the Base Array/Vector types besides named columns. Am I missing something there too?

Thanks,

1 Like

Here are two possibilities for getting the index complements:

U = 1:10
selected = [1,3,5,8,9]
not_selected = setdiff( U , selected )
not_selected = [ i for i ∈ U if i βˆ‰ selected ]
3 Likes

Could you clean this up a bit? labels and ids are used without being defined. Did you change names half way? Anyway, I think this is quite simple to do, if you just clarify the intent.

Using a vector of indices is completely fine by itself. However in your case, where you also want a complement it is not the best choice. You could just use a BitVector of indexing.

mask = falses(size(features, 1))
mask[selected] .= true
# now you can do
features[mask, :] # selected features
features[.!(mask), :]  # not selected features

If the data is very large you can also consider using views. For that just prepend a slicing operation with @view like in @view features[.!(mask), :]

Also Julia arrays are column-major, so array[indices, :] will generically be slower than array[:, indices]

5 Likes

Oh, I have not found setdiff yet. Nice.

I am sorry, I have fixed the example.

Thanks for the insightful answer. I changed the code to use BitVector and @view, slicing is quite amazing.
Regarding Your final sentence: If julia is column-major the datasets should be also column-based, but it is not the case. Are there folks out there who transpose them before using?

One method is to use the InvertedIndices.jl package.(You still did not define labels.)

julia> using InvertedIndices

julia> labels = 101:110
101:110

julia> selected = [1,3,5,8,9]
5-element Vector{Int64}:
 1
 3
 5
 8
 9

julia> labels[Not(selected)]
5-element Vector{Int64}:
 102
 104
 106
 107
 110

DataFrames are essentially a collection of related vectors. Often the rows will store data corresponding to each other, and the types of each column can be different. A DataFrame is faster and more convenient than trying to keep track of a Matrix{Any} for this type of data. However, if your data is simpler, vectors and matrices are perfectly fine.

julia> using DataFrames

julia> df = DataFrame(id=[1001, 1002], first_name=["John", "Sally"], last_name=["Smith", "Jones"], score = [57.3, 94.8])
2Γ—4 DataFrame
 Row β”‚ id     first_name  last_name  score
     β”‚ Int64  String      String     Float64
─────┼───────────────────────────────────────
   1 β”‚  1001  John        Smith         57.3
   2 β”‚  1002  Sally       Jones         94.8

julia> df[df.score.>70, :]
1Γ—4 DataFrame
 Row β”‚ id     first_name  last_name  score
     β”‚ Int64  String      String     Float64
─────┼───────────────────────────────────────
   1 β”‚  1002  Sally       Jones         94.8
6 Likes

In my experience, we always use the final dimension to store the number of samples in datasets. This is already done by default in MLDatasets.jl which lets you easily access common datasets used in ML benchmarks. For example MNIST (Vision Β· MLDatasets.jl) exposes the features for the training set as a "28x28x60000"array, which is the more efficient way for Julia.

I forgot to explain this part of your post.

If labels=101:110 and selected=[1,3,5,8,9], then the broadcasting in 1:length(labels).βˆ‰selected is trying to perform this operation:

1 βˆ‰ 1  # false
2 βˆ‰ 3  # true
3 βˆ‰ 5  # true
4 βˆ‰ 8  # true
5 βˆ‰ 9  # true
6 βˆ‰ ?  # ?
7 βˆ‰ ?  # ?
8 βˆ‰ ?  # ?
9 βˆ‰ ?  # ?
10 βˆ‰ ?  # ?

but it can’t because the dimensions don’t match. Ref is a common/fast method to β€œshield” a collection from the broadcasting, so the operations become instead:

1 βˆ‰ [1,3,5,8,9]  # false
2 βˆ‰ [1,3,5,8,9]  # true
3 βˆ‰ [1,3,5,8,9]  # false
...

A more intuitive way to achieve the same result is to put selected in a Vector or Tuple. Then the broadcasting will be applied to that one-element outer collection, and one-element collections have the special property of being repeated during broadcasting.

julia> typeof(Ref(selected))
Base.RefValue{Vector{Int64}}

julia> 1:length(labels) .βˆ‰ Ref(selected)
10-element BitVector:
 0
 1
 0
 1
 0
 1
 1
 0
 0
 1

julia> typeof([selected])
Vector{Vector{Int64}} (alias for Array{Array{Int64, 1}, 1})

julia> 1:length(labels) .βˆ‰ [selected]
10-element BitVector:
 0
 1
 0
 1
 0
 1
 1
 0
 0
 1

julia> typeof((selected,))
Tuple{Vector{Int64}}

julia> 1:length(labels) .βˆ‰ (selected,)
10-element BitVector:
 0
 1
 0
 1
 0
 1
 1
 0
 0
 1

Once you have one BitVector, the other can be easily constructed with broadcasted not .!.

not_selected_indices = .!selected_indices
2 Likes

Depending on how many times you use the lookup (in), a Set might be a better data structure since an array lookup is O(n) and set is O(1).

julia> a = collect(1:1_000_000);

julia> @btime 1_000_000 ∈ a
  514.166 ΞΌs (0 allocations: 0 bytes)
true

julia> b = Set(a);

julia> @btime 1_000_000 ∈ b
  19.349 ns (0 allocations: 0 bytes)
true

Just keep in mind the overhead of creating the Set in first place. In the example above with lots of (different) elements, it can be quite slow:

julia> @btime b = Set($a);
  20.512 ms (7 allocations: 18.00 MiB)
1 Like

I have tried Iris, but that is so small it does not really matter. Thanks!

Thank You very much for the detailed answer!

1 βˆ‰ 1  # false
2 βˆ‰ 3  # true
3 βˆ‰ 5  # true
4 βˆ‰ 8  # true
5 βˆ‰ 9  # true
6 βˆ‰ ?  # ?
7 βˆ‰ ?  # ?
8 βˆ‰ ?  # ?
9 βˆ‰ ?  # ?
10 βˆ‰ ?  # ?

This part actually shocked me. Not the size mismatch part, but with labels = [1,2,3,4,5] this actually returns [0,1,1,1,1], which would have been even more confusing. Broadcasting is powerful, but I have to keep in mind that is broadcasts both sides.

Yeah, this was clear after @abraemer’s reply, but it is powerful indeed, I have started using right away.

1 Like

You are not alone in thinking this is unintuitive. If Julia was in that stage of development my feeling is a majority would have voted to not have scalar numbers be iterable, so that at least 1 βˆ‰ 3 gave a sensible error.

3 Likes

This was not new to me, but this is really important and I will keep in mind with larger datasets. Thank You!

1 Like