# Combining strings

Hi,

I am trying to process some string data - each element of the array should be split up into individual words and then combined. At the end, I only want the unique words in the array. In R, I would just run a loop and use c() to add to the array at each iteration. My code is:

``````mystrs = ["SWT1;SWT1;LPT","ABC;ABC| LPT; NYP","ABCD ;ABC|PT; NYP" ]

function fsep(tstr,splits = (';','|',','))
tstr2 = split(tstr,splits)
tstr3 = sort(unique(map(strip,tstr2)))
return tstr3
end

myset = []
for i = 1:3
tgenes = fsep(mystrs[i])
push!(myset,tgenes)
end

uset = unique(myset)

``````

The output is:

``````julia> uset
3-element Array{Any,1}:
SubString{String}["LPT", "SWT1"]
SubString{String}["ABC", "LPT", "NYP"]
SubString{String}["ABC", "ABCD", "NYP", "PT"]
``````

whereas, I want it to be [“ABC”,“ABCD”,“LPT”,“NYP”,“PT”]

What would be the best structure and code to do this?

many thanks for your help!

PS: Any other observations/suggestions on code would be very welcome. Good to learn!!

``````julia> unique(mapreduce(x-> split(x, (';', '|', ' ')), vcat, mystrs))
7-element Array{SubString{String},1}:
"SWT1"
"LPT"
"ABC"
""
"NYP"
"ABCD"
"PT"
``````

If you have a much longer collection, and expect to see a lot of repetition in these substrings, you might choose to go with:

``````mapreduce(x -> Set(split(x,  (';', '|', ' '))), union!, mystrs)
``````

which performs the `unique` step each time (when it forms the Set) rather than once at the end.

Thanks for the solution! The first version works great!

``````res1 = unique(mapreduce(x-> split(x, (';', '|',',')), vcat, mystrs))
res2 = sort(unique(map(strip,res1)))
``````

I do have a very large collection with significant overlaps, so I am keen to implement the second solution too. I can do the first part and get a set, but it complains that map cannot be used in sets (to strip whitespace):

``````res3 = mapreduce(x -> Set(split(x, (';','|'))),union!, mystrs)
res4 = sort(map(strip,res3))

julia> res4 = sort(map(strip,res3))

ERROR: map is not defined on sets
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] map(::Function, ::Set{SubString{String}}) at ./abstractarray.jl:2101
[3] top-level scope at none:0
``````

I tried to use Parse but get an error too:

``````julia> res3x = Parse(string,res3)
ERROR: UndefVarError: Parse not defined
Stacktrace:
[1] top-level scope at none:0
``````

How do I get the result from ‘set’ back again to ‘String’?

thanks a ton!

It is unclear what `Parse` is in this context. If you meant `Base.parse`, that’s for parsing numbers etc, so I am not sure how it applies here.

You mean `Set` to `Vector`? Just use `collect`.

Note that since I was splitting also on the `' ' ` character (space), there is no need to call `strip` (which strips whitespace) at the end (you will simply end up with one member of the set that is an empty string `""`).
That said, you can do strip without `map`.

``````julia> substrs = mapreduce(x -> Set(split(x,  (';', '|', ' '))), union!, mystrs)
Set(SubString{String}["NYP", "ABCD", "PT", "LPT", "SWT1", "", "ABC"])

julia> strip.(substrs)
7-element Array{SubString{String},1}:
"NYP"
"ABCD"
"PT"
"LPT"
"SWT1"
""
"ABC"
``````

Note that sort also isn’t defined for sets, but you can turn a `Set` into an `Vector` with `collect`