Counting words in StringArray

How to tount words in StringArray

julia> y=["gastro" "gastro gastro "; "dom" "pies"]
2×2 Array{String,2}:
 "gastro"  "gastro gastro "
 "dom"     "pies"

julia> occursin.(x,y)
2×2 BitArray{2}:
 1  1
 0  0

julia> y.=="gastro"
2×2 BitArray{2}:
 1  0
 0  0

I am looking for a solution that will give:

2×2 BitArray{2}:
 1  2
 0  0

Paul

Can you do it if you just want to count the words in a string, for example how many "gastro" in "gastro gastro"? When you have that working you can use a broadcasting with that function over the array.

I am reading:

julia> broadcast(+, A, B)

but

julia> broadcast.(occursin.(x,y))
ERROR: MethodError: objects of type Bool are not callable
Stacktrace:

I was found like this

julia> using DataStructures
julia> count.(x,y)
2×2 Array{Int64,2}:
 1  2
 0  0

julia> x
"gastro"

julia> y
2×2 Array{String,2}:
 "gastro"  "gastro gastro "
 "dom"     "pies"

julia> using DataStructures

julia> count.(x,y)
2×2 Array{Int64,2}:
 1  2
 0  0

Is it OK? , fast ?

1 Like

That works, but you don’t need DataStructures for this, as far as I can tell.

1 Like

I think this should be as fast as it can get for this case as it’s now a vectorised operation

1 Like

Vectorized isn’t faster, a loop would be just as good. But the chosen solution is probably quite fast, and also elegant.

Oh really? Is it not like Python in this regard, where vectorisation is faster than loops because of more efficient memory handling?

What do you call “more efficient memory handling”?

I think that in Julia, fusing the dots, this is, calling broadcasted function over the result of broadcasted function may eliminate intermediary data structures, however, a loop may do the same, depending on how it is written.

1 Like

Nope, it’s not like Python in that regard. The reason vectorization is fast in Python isn’t memory handling, but that the calculation is handed off to a library written in a fast language like C, that just does looping.

In Julia, Julia itself is that fast language (except for some special cases like BLAS, which isn’t yet implemented in Julia because of the workload that would be).

The thing you think of as ‘fast vectorization’ is in most cases just looping under the hood in a faster language.

5 Likes

In fact, memory handling for vectorized code in ‘slow’ languages like Python and Matlab is normally quite bad, because you chain together vectorized operations that each produce intermediate temporary arrays. This can be avoided in Julia, either with “fused broadcasting” or with plain loops.

4 Likes

Thanks for clarifying that! I’m a newbie to Julia, and this good information to know.

For anyone interested in this further, there’s an article explaining it from the time it was introduced in Julia 0.6

2 Likes