Help me make a regex, instead of this?

Ahmed_Salih · May 5, 2019, 5:03pm

Hey guys, so imagine you have an array like this when reading your directory:

603-element Array{String,1}:
 "PartFluidOut_0000.vtk"
 "PartFluidOut_0001.vtk"
 "PartFluidOut_0002.vtk"
 "PartFluidOut_0003.vtk"
 "PartFluidOut_0004.vtk"
 "PartFluidOut_0005.vtk"
 "PartFluidOut_0006.vtk"
 "PartFluidOut_0007.vtk"
 "PartFluidOut_0008.vtk"
 "PartFluidOut_0009.vtk"
 ⋮
 "PartFluid_0293.vtk"
 "PartFluid_0294.vtk"
 "PartFluid_0295.vtk"
 "PartFluid_0296.vtk"
 "PartFluid_0297.vtk"
 "PartFluid_0298.vtk"
 "PartFluid_0299.vtk"
 "PartFluid_0300.vtk"
 "_ResumeFluidOut.csv"

Then what I want to do is to filter in a way that I end up with the following result:

"PartFluidOut"
"PartFluid"

The way to this goal as I want it is:

Only want to look at .vtk files, ie. everyother file type should be ignored
Only want letters, so no numbers or underscores etc. in output

I’ve done it using this snippet of code currently:

filenames   = readdir()
filterFiles = split.(filenames,"_")

k = []
for i = 1:length(filterFiles)
    global k = push!(k,filterFiles[i][1])
end

deleteat!(k,length(k))

finalNames =  unique(k)
println(finalNames)

This works, but is a bit hardcoded since I have to look for an underscore, and I would rather just remove all non-vtk files, then remove “.vtk” and then finally all numbers. I know that some might be able to do this efficiently with a regex function - as I am new to regex I would like to see such an example.

My question is not about optimizing my code, but improving the way it works/robustness, hopefully doing this with implementing some kind of regex.

Thanks for your time.

DNF · May 5, 2019, 6:17pm

Disclaimer: This is not well tested.

function getfilestrings(namelist, pattern::Regex=".*")  # default is to accept all strings
    filenames = Set{String}()  # this makes sure that the list is unique
    for filename in namelist
        m = match(pattern, filename)
        isnothing(m) && continue
        str = first(m.captures)
        !isempty(str) && push!(filenames, str)
    end
    return filenames
end

list = getfilestrings(readdir(), r"([a-zA-Z]+).*\.vtk$")

I am also not quite sure what you want to be able to match on. This should work for your example, but if you don’t accept filenames starting with underscore, for example, then you can modify the regex to add a ^ in the beginning, like this: r"^([a-zA-Z]+).*\.vtk$"

I’m using the regex to capture the first group of letters in the groups a-z and A-Z, so no non-ascii characters, and require the string to end with .vtk, that’s what the dollar sign is for.

Ahmed_Salih · May 5, 2019, 6:30pm

Thanks! I will try to test this for a bit and ask regarding parts I don’t completely understand.

Kind regards

natemcintosh · May 5, 2019, 6:57pm

Hi Ahmed!
Another way to do it might be along the lines of something like this

function glob_thing(pattern, to_search::AbstractVector)::AbstractVector
    filter(x -> occursin(pattern, x), to_search)
end

function name_beginning(name::AbstractString)
    names_no_extension = splitext(name)[1]
    first_part = split(names_no_extension, "_")[1]
    return first_part
end

files = readdir("/path/to/your/data/")
vtk_files = glob_thing(".vtk", files)
name_beginnings = name_beginning.(vtk_files)
unique(name_beginnings)

The first function glob_thing, filters down the data from files to only those ending with “.vtk”. I named it glob_thing because I’m using it to do something similar to the glob function from terminals. I got this function from @tkoolen, who helped me out with one of my past questions.

Now that we have only files that end in “.vtk”, let’s remove everything in the names after the first underscore character. We do this wit the name_beginning() function. This uses splitext to split file names into everything before the file extension, and everything after. Then we split everything before the file extension on the “_” character, and return the first element, first_part.

My favorite part of this solution is using dot notation to apply name_beginning() on all the files in vtk_files. For more on how this works, see this page.

Finally, let’s single out all the unique names with the unique function.

Ahmed_Salih · May 5, 2019, 7:20pm

Thanks for both of your answers, @DNF and @natemcintosh! Your explanations are very neat and it is nice to see how it is possible to approach this problem. At the end I chose DNF’s implementation since it is more robust, ie. does not care about “_” or “-”, and it is only about 10% slower than my initial implementation (speed in this regard is not important, just a bonus), while the implementation using glob_thing etc. ended up being in my case double as slow - also regarding the outputs, DNF and natemcintosh approach gave respectively:

Set(["PartFluidOut", "PartFluid"])

2-element Array{SubString{String},1}

Which seems to be a bit of a hassle to work with, so I changed it back to a Array{String,1}.

But thank you to both of you, it was very nice to be forced to think out of the box and having to look up some new Julia terms, ie. “continue” and related syntax, to become better at the language. My final function for anyone interested is:

function getfilestrings(namelist, pattern::Regex=".*")  # default is to accept all strings
    filenames::Array{String,1} = []  # this makes sure that the list is unique
    for filename in namelist
        m = match(pattern, filename)
        isnothing(m) && continue
        str = first(m.captures) #? captures for help, it just gets the "PartFluid", component
        !isempty(str) && push!(filenames, str)
    end
    return unique(filenames)
end

DNF · May 5, 2019, 7:25pm

I think it would be more idiomatic to write:

filenames = String[]

And, though I haven’t compared performance, it seems more aesthetically appealing to write

!isempty(str) && !in(str, filenames) && push!(filenames, str)

instead of pushing everything and then unique!ing it away afterwards.

Ahmed_Salih · May 5, 2019, 7:37pm

Thanks for your suggestions!

Regarding the first one, I see your point using String[], but I have chosen to not adopt it, since Atom does not highlight “String” for me for some reason, so stuck to the former. But I agree with you.

I kept using unique as well, since it was a bit faster and had about 20% allocations.

DNF · May 5, 2019, 7:41pm

I must admit I really dislike the style of variable::TypeAssertion = something. Type assertions are a last resort imho. Maybe you could try

filenames = Vector{String}()

Ahmed_Salih · May 5, 2019, 7:46pm

Thanks, that works!

tkoolen · May 5, 2019, 7:52pm

In this case, it’s not just a matter of style preference. There’s a functional difference between the following two functions:

function f()
    filenames::Array{String,1} = []
    return filenames
end

function g()
    filenames = String[] # or Vector{String}()
    return filenames
end

namely that [] is a shortcut for Vector{Any}(), so the ::Array{String,1} actually calls convert(Vector{String}, []). So the code_warntype is different for these two functions:

Body::Array{String,1}
1 ─ %1 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{Any,1}, svec(Any, Int64), :(:ccall), 2, Array{Any,1}, 0, 0))::Array{Any,1}
│   %2 = (Base.arraysize)(%1, 1)::Int64
│   %3 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{String,1}, svec(Any, Int64), :(:ccall), 2, Array{String,1}, :(%2), :(%2)))::Array{String,1}
│   %4 = invoke Base.copyto!($(QuoteNode(IndexLinear()))::IndexLinear, %3::Array{String,1}, $(QuoteNode(IndexLinear()))::IndexLinear, %1::Array{Any,1})::Array{String,1}
└──      return %4

julia> @code_warntype g()
Body::Array{String,1}
1 ─ %1 = $(Expr(:foreigncall, :(:jl_alloc_array_1d), Array{String,1}, svec(Any, Int64), :(:ccall), 2, Array{String,1}, 0, 0))::Array{String,1}
└──      return %1

and likewise for the code_native, which is much shorter for g().

DNF · May 5, 2019, 7:52pm

Oh, by the way, instead of

unique(filenames)

you should use

unique!(filenames)

That’s much faster.

Ahmed_Salih · May 5, 2019, 7:58pm

I am not so experienced in code_warntype, but the gist of your comment is that, running with String[] or Vector{String}() is more efficient coding-wise? Will have a look through my code and fix these then.

It is actually kind of fascinating how much difference their can be at seemingly similar functions

DNF · May 5, 2019, 8:02pm

Yes, but it’s really not a question of understanding obscure rules, it all makes logical sense.

String[]

or

Vector{String}()

directly creates a vector of strings, while

filenames::Array{String,1} = []

first creates [] and then converts it. It’s all consistent and out in the open.

Ahmed_Salih · May 5, 2019, 8:12pm

Thanks, that makes sense to me.

Topic		Replies	Views
Getting a list of file names New to Julia matlab	4	2407	May 15, 2020
Get Vector of filenames in directory matching regex. But has the language changed? New to Julia question	5	667	May 30, 2020
Correct usage of regex matches New to Julia regex	5	714	May 9, 2021
Get list of pdf files New to Julia filesystem	18	3361	September 10, 2019
Remove unmatched parts via regular expression New to Julia strings , regex	6	1373	December 31, 2021

Help me make a regex, instead of this?

Related topics