Finding sub-arrays in an array?

I’m putting this here because this is a chemistry usage but it’s not really a chemistry-specific case.

I have an array that has a bunch of states formatted like this:

1.0 1.0 0.0 1.0 0.0 1.0 6380.35
1.0 1.0 1.0 1.0 0.0 1.0 6316.91
2.0 1.0 1.0 2.0 0.0 2.0 6444.27
2.0 1.0 2.0 1.0 0.0 1.0 11086.5

For any given line, 1:3 is a description of one state, 4:6 is the description of a second state, and 7 is the frequency of the difference.

I have created a list of all the states (which is all the values 1:3 and 4:6 listed in an nx3 array) and of all the unique states, and I need to identify which states only appear once in the list. However,

findall(isequal(unst[1,:]), states[1:3,:])

Is not working – it provides an empty Cartesian index

CartesianIndex{2}

Preferably, I would be able to find the index of every unique state in the overall nx7 list. Does anyone know how to do this?

Thanks!

The issue is that isequal operates on the entire array, rather than columnwise like you seem to want. SInce the entire matrix does not match unst[1,:], you get a single false as a result.

Does

findall(isequal(unst[1,:]), eachcol(@view(states[1:3, :]))) # @view is optional

do what you want? This should slice the matrix into each 3-tall column and compare each of those to unst[1, :].

For example,

julia> findall(isequal([1;2;3]), eachcol([1;1;1;; 1;2;3;; 3;2;1;; 2;2;2;;]))
1-element Vector{Int64}:
 2

I’m actually trying to match rows, not columns. States is formatted like

  1  1   0
  1  1   1
  2  1   1
  2  1   2
  2  2   0
  2  2   1

and unst is formatted like

  1  1   0
  1  1   1
  2  1   1
  2  1   2
  2  2   0

The rest of my earlier comment is likely still relevant except that you might need to use eachrow(states[:, 1:3]) instead of eachcol(states[1:3, :]).


In general, the NumPy/MATLAB/etc pattern of smooshing all your data into a mega-array and then slicing-and-dicing it during processing is not necessary in Julia (or other languages where non-arrays are performant). Where relevant, you might consider using types to organize your data a little more carefully. For example:

julia> struct StateDescriptor
           state1::NTuple{3, Int} # maybe even make a special type for these
           state2::NTuple{3, Int}
           freq_diff::Float64
       end

julia> bunch_of_states = [StateDescriptor((1,1,0), (1,0,1), 6380.35), StateDescriptor((1,1,1), (1,0,1), 6316.91), StateDescriptor((2,1,1), (2,0,2), 6444.27), StateDescriptor((2,1,2), (1,0,1), 11086.5)]
4-element Vector{StateDescriptor}:
 StateDescriptor((1, 1, 0), (1, 0, 1), 6380.35)
 StateDescriptor((1, 1, 1), (1, 0, 1), 6316.91)
 StateDescriptor((2, 1, 1), (2, 0, 2), 6444.27)
 StateDescriptor((2, 1, 2), (1, 0, 1), 11086.5)

julia> all_states = vcat([x.state1 for x in bunch_of_states], [x.state2 for x in bunch_of_states])
8-element Vector{Tuple{Int64, Int64, Int64}}:
 (1, 1, 0)
 (1, 1, 1)
 (2, 1, 1)
 (2, 1, 2)
 (1, 0, 1)
 (1, 0, 1)
 (2, 0, 2)
 (1, 0, 1)

julia> unique_states = unique(all_states)
6-element Vector{Tuple{Int64, Int64, Int64}}:
 (1, 1, 0)
 (1, 1, 1)
 (2, 1, 1)
 (2, 1, 2)
 (1, 0, 1)
 (2, 0, 2)

julia> findall(isequal(unique_states[1]), x.state1 for x in bunch_of_states) # (1, 1, 0) in state1
1-element Vector{Int64}:
 1

julia> findall(isequal(unique_states[5]), x.state2 for x in bunch_of_states) # (1, 0, 1) in state2
3-element Vector{Int64}:
 1
 2
 4

Done this way, there isn’t even a question about rows versus columns. I’m sure you’d want to make some adjustments to what I’ve suggested based on your full use case, but in general I find data/code like this much easier to reason about.

Interesting! I have it in an array like this because it’s read in from a file, but I could do some data manipulation like that before I got to this point.

Thanks for your help; that worked.

If the intention was to view the list as a list of possible transitions between states, and furthermore to find the states which appear once and only once in the transition list, then maybe the following can help:

# run the code in mikmoore's post...
using StatsBase

cm = countmap(all_states)
# Dict{Tuple{Int64, Int64, Int64}, Int64} with 6 entries:
#   (1, 1, 1) => 1
#  (1, 0, 1) => 3
# ...

only_once_state_indices =  # indices inside `all_states`
  [i for (i,s) in enumerate(all_states) if last(cm[s])==1]
# 5-element Vector{Int64}:
#  1
#  2
# ...

only_once_state_transitions = # indices in transition list
  [i for (i,s) in enumerate(bunch_of_states) 
    if ( last(cm[s.state1])==1 || last(cm[s.state2])==1 ) ]
# 4-element Vector{Int64}:
#  1
#  2
#  3
#  4
# apparantly, each transition has one state which appears only once

Also, it would be better to rename bunch_of_states to bunch_of_transitions.

As an alternative to using structs you might find it useful to load data to a DimStack of DimArrays from DimensionalData.jl, depending on what other things you will be doing with your data.

Assuming the file is in any of the common formats, try reading it with the corresponding julia package – eg, both delimited and fixed-width textual formats are well-supported in Julia.

Then, you will immediately have 1d-array-of-namedtuples instead of 2d-array, and these are very convenient and efficient to manipulation in Julia. Defining custom structs or using fancy array wrappers can be useful, but those are generally further potential steps – you already get great usability and performance from just namedtuples all the way :slight_smile: