Combining strings


I am trying to process some string data - each element of the array should be split up into individual words and then combined. At the end, I only want the unique words in the array. In R, I would just run a loop and use c() to add to the array at each iteration. My code is:

mystrs = ["SWT1;SWT1;LPT","ABC;ABC| LPT; NYP","ABCD ;ABC|PT; NYP" ]

function fsep(tstr,splits = (';','|',','))
    tstr2 = split(tstr,splits)
    tstr3 = sort(unique(map(strip,tstr2)))
    return tstr3

myset = []
for i = 1:3
    tgenes = fsep(mystrs[i])

uset = unique(myset)

The output is:

julia> uset
3-element Array{Any,1}:
 SubString{String}["LPT", "SWT1"]
 SubString{String}["ABC", "LPT", "NYP"]
 SubString{String}["ABC", "ABCD", "NYP", "PT"]

whereas, I want it to be [“ABC”,“ABCD”,“LPT”,“NYP”,“PT”]

What would be the best structure and code to do this?

many thanks for your help!

PS: Any other observations/suggestions on code would be very welcome. Good to learn!!

How about

julia> unique(mapreduce(x-> split(x, (';', '|', ' ')), vcat, mystrs))
7-element Array{SubString{String},1}:

If you have a much longer collection, and expect to see a lot of repetition in these substrings, you might choose to go with:

mapreduce(x -> Set(split(x,  (';', '|', ' '))), union!, mystrs)

which performs the unique step each time (when it forms the Set) rather than once at the end.

Thanks for the solution! The first version works great!

res1 = unique(mapreduce(x-> split(x, (';', '|',',')), vcat, mystrs))
res2 = sort(unique(map(strip,res1)))

I do have a very large collection with significant overlaps, so I am keen to implement the second solution too. I can do the first part and get a set, but it complains that map cannot be used in sets (to strip whitespace):

res3 = mapreduce(x -> Set(split(x, (';','|'))),union!, mystrs)
res4 = sort(map(strip,res3))

julia> res4 = sort(map(strip,res3))

ERROR: map is not defined on sets
 [1] error(::String) at ./error.jl:33
 [2] map(::Function, ::Set{SubString{String}}) at ./abstractarray.jl:2101
 [3] top-level scope at none:0

I tried to use Parse but get an error too:

julia> res3x = Parse(string,res3)
ERROR: UndefVarError: Parse not defined
 [1] top-level scope at none:0

How do I get the result from ‘set’ back again to ‘String’?

thanks a ton!

It is unclear what Parse is in this context. If you meant Base.parse, that’s for parsing numbers etc, so I am not sure how it applies here.

You mean Set to Vector? Just use collect.

Note that since I was splitting also on the ' ' character (space), there is no need to call strip (which strips whitespace) at the end (you will simply end up with one member of the set that is an empty string "").
That said, you can do strip without map.

julia> substrs = mapreduce(x -> Set(split(x,  (';', '|', ' '))), union!, mystrs)
Set(SubString{String}["NYP", "ABCD", "PT", "LPT", "SWT1", "", "ABC"])

julia> strip.(substrs)
7-element Array{SubString{String},1}:

Note that sort also isn’t defined for sets, but you can turn a Set into an Vector with collect