Count occurances for matrix rows (where column order does not matter)

halleysfifthinc · December 8, 2022, 2:53pm

One fix is to track deleted items to prevent them from being re-added.

function onlyoneof2(F)
    Ft = sort.(SVector{4}.(eachrow(F)))
    d = Dict{eltype(Ft),Int}()
    del = Vector{eltype(Ft)}()
    for i in eachindex(Ft)
        if haskey(d, Ft[i])
            delete!(d, Ft[i])
            push!(del, Ft[i])
        elseif Ft[i] ∈ del
            continue
        else
            d[Ft[i]] = i
        end
    end
    return collect(values(d))
end

Now if we modify the short F you gave to have an odd number of occurrences of a row: F′=[1 2 3 4; 5 6 7 8; 4 3 2 1; 7 8 6 5; 1 2 5 6; 8 7 6 5; 5 6 7 8;1 2 3 4;].

julia> onlyoneof(F)
2-element Vector{Int64}:
 5
 8 # wrong

julia> onlyoneof2(F)
1-element Vector{Int64}:
 5

I’m surprised (and somewhat confused) to find that this new version is faster!

julia> @benchmark onlyoneof(F4) evals=5
BenchmarkTools.Trial: 6531 samples with 5 evaluations.
 Range (min … max):  136.330 μs … 470.626 μs  ┊ GC (min … max): 0.00% … 59.64%
 Time  (median):     141.526 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   152.676 μs ±  50.194 μs  ┊ GC (mean ± σ):  6.92% ± 12.69%

  ▆█▂▂                                                       ▁▂ ▁
  █████▄▅▃▄▁▄▃▄▃▁▃▁▃▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██ █
  136 μs        Histogram: log(frequency) by time        384 μs <

 Memory estimate: 405.27 KiB, allocs estimate: 210.

julia> @benchmark onlyoneof2(F4) evals=5
BenchmarkTools.Trial: 9865 samples with 5 evaluations.
 Range (min … max):   89.992 μs … 308.097 μs  ┊ GC (min … max): 0.00% … 64.94%
 Time  (median):      95.344 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   100.918 μs ±  31.735 μs  ┊ GC (mean ± σ):  5.52% ± 11.16%

  ▆█▇▁                                                    ▁▁  ▁ ▂
  ████▆▆█▅▄▄▄▄▃▁▄▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄██▇▆█ █
  90 μs         Histogram: log(frequency) by time        280 μs <

 Memory estimate: 286.78 KiB, allocs estimate: 15.

Dan · December 8, 2022, 3:00pm

this doesn’t scale well to large F. Try benchmarking with 10_000 rows (at least)

halleysfifthinc · December 8, 2022, 3:13pm

That does scale poorly compared to your function with increasing duplicates. Without a better picture of real F data, I don’t know whether that would be a problem in practice.

KevinMoerman · December 8, 2022, 3:18pm

Here is the MATLAB code for it, which on my machine can process a (25143270,4) array in about 6 seconds:

[~,ind1,ind2]=unique(sort(F,2),'rows'); %use unique
c=accumarray(ind2,1,[length(ind1) 1]); %Counts for unique set
L=c(ind2)==1;  %Expand counts to match input, and check for occurrence = 1

@Dan’s onlyoneofX code, and also my version onlyoneof3 below (based on @halleysfifthinc’s code above but using a Dict to track the deleted entries) both take about 7.5-8 seconds (I just used @elapsed a couple times) on my machine for a (25143270,4) array.

function onlyoneof3(F)
    Ft = sort.(SVector{4}.(eachrow(F)))
    d = Dict{eltype(Ft),Int}()
    del = Dict{eltype(Ft),Int}()
    for i in eachindex(Ft)
        if haskey(d, Ft[i]) & ~haskey(del, Ft[i])
            delete!(d, Ft[i]) #Remove from d             
            del[Ft[i]] = i #Add to deleted
        else
            d[Ft[i]] = i
        end
    end
    return collect(values(d))
end

Here is the code I am using to create that “giant ish” face array:

function grid3D(x,y,z)

    X = [ i for i ∈ x          , j ∈ 1:length(y), k ∈ 1:length(z) ]
    Y = [ j for i ∈ 1:length(x), j ∈ y          , k ∈ 1:length(z) ]
    Z = [ k for i ∈ 1:length(x), j ∈ 1:length(y), k ∈ z           ]

    return X, Y, Z
end


function getElements(ijk)
    
    #Cartesian index offsets
    iStep = CartesianIndex(1,0,0)
    jStep = CartesianIndex(0,1,0)
    kStep = CartesianIndex(0,0,1)

    F=zeros(Int64,6*length(ijk),4)
    for q=1:1:length(ijk)
        #Build 8-noded hex element
        e=[LinearIndices(sizV)[ijk[q]] LinearIndices(sizV)[ijk[q]+iStep] LinearIndices(sizV)[ijk[q]+iStep+jStep] LinearIndices(sizV)[ijk[q]+jStep]  LinearIndices(sizV)[ijk[q]+kStep] LinearIndices(sizV)[ijk[q]+iStep+kStep] LinearIndices(sizV)[ijk[q]+iStep+jStep+kStep] LinearIndices(sizV)[ijk[q]+jStep+kStep]]                

        F[1+(q-1)*6:q*6,:]=[e[[4 3 2 1]]; 
                            e[[5 6 7 8]]; 
                            e[[1 2 6 5]];
                            e[[2 3 7 6]];
                            e[[3 4 8 7]];
                            e[[4 1 5 8]];
                            ]
    end

    return F
end

s=0.01
r=1;
X,Y,Z= grid3D(-r:s:r,-r:s:r,-r:s:r)

M = sqrt.(X.^2 .+ Y.^2 .+ Z.^2)

siz=size(M)
sizV=siz.+1

L=M.<=(r+s/100) #Bool defining segmentation
ijk=findall(L) #Cartesian indices
F = getElements(ijk)

KevinMoerman · December 8, 2022, 3:21pm

Yes it matters I think I might have fixed your code, but performance is similar to that shorter one by Dan. I now also provided an example code to create big F arrays too.

Dan · December 8, 2022, 3:25pm

Is it still too slow for your needs? b/c it can be made faster (>2x) than MATLAB. But I feel, correct me if wrong, this is decent performance.

KevinMoerman · December 8, 2022, 3:49pm

It would be good to find that “convincingly faster than MATLAB” solution, as I am trying to convince myself and my colleagues that Julia is the way to go for the future. I also have ever larger meshes at times so any time saved would be great.

Do you think that using additional outputs from the unique function (like for the MATLAB version), would perhaps lead to the fastest approach?

Dan · December 8, 2022, 4:00pm

The dictionary in Julia is essentially simlar to unique+accumarray from MATLAB and this is why the solutions take similar times. Julia is not magically faster in such small non-custom tasks (just as matrix-multiplication is the same speed in MATLAB and Julia).
But, Julia is fast overall, and open and easily extendible, so I would recommend it.
Any specific task, if scrutinized carefuly admits more performance squeeze. Do you really believe this problem will clinch it for the team?
(BTW same performance squeeze available in other “uncrippled” languages such as C/C++, javascript, java, rust).

Seif_Shebl · December 10, 2022, 4:32pm

countmap seems like a good idea. Combined with SVectors, the following is a little faster than the fastest onlyoneof above.

using StatsBase

function onlyoneof4(F)
    S = [sort(SVector{4}(r)) for r in eachrow(F)]
    countmap(S)
end

czylabsonasa · December 12, 2022, 7:35am

in order get fair comparison the sorted array of indices “must” be returned

czylabsonasa · December 13, 2022, 11:18am

i found that a for large F, the sort based approach the “most effective”,
here is the r=1.0, s=0.01 case:

julia-1.8> include("facestest.jl")
  Activating project at `~/Asztal/git/discourse.julialang.org/faces`
size of F: 25143270
┌────────────────────────────────┬──────────┬──────────┬───────────┐
│                            fun │  time(s) │  mem(MB) │ allocs(#) │
├────────────────────────────────┼──────────┼──────────┼───────────┤
│                           sort │ 3.78e+00 │ 2.42e+03 │        10 │
│               countmap by @dan │ 4.37e+00 │ 2.64e+03 │        83 │
│ 2 dicts by @halleysfifthinc+OP │ 5.25e+00 │ 3.27e+03 │       432 │
│                         1 dict │ 3.86e+00 │ 2.62e+03 │        94 │
└────────────────────────────────┴──────────┴──────────┴───────────┘

perhaps tweaking in this direction would lead to more acceptable results.

Topic		Replies	Views
Return duplicate rows in array with no of times and index of first occurence General Usage question , array	6	691	July 13, 2022
Count occurrences of columns in a 2d array (using countmap) Performance question , performance , arrays	13	2070	December 3, 2021
Counting Occurrences in JuliaDB New to Julia juliadb	3	2012	March 11, 2019
How to count the number of occurances in an array for each column New to Julia	8	9607	April 4, 2019
Counting number of occurences in an array Tooling question , statistics , arrays , splitapplycombine	10	15960	December 18, 2019

Count occurances for matrix rows (where column order does not matter)

Related topics