Creating a Table from Regex

It occurred to me that it would be nice to be able to create a table from a string by providing a regex that uses named groups.
nushell’s parse function (docs, example) does this.

A RegexMatch has all the right things for a Tables.jl row, except it doesn’t provide getproperty overloads as historically it’s fields were part of it’s public API (though that has progressively become less true, it would be too breaking to change that now).
But we can wrap a RegexMatch in a suitable type.
(Might as well use Tables.AbstractRow though we could go down the getproperty route instead if we would rather).


using Tables
struct RegexMatchRow <: Tables.AbstractRow
    match::RegexMatch
end
Tables.getcolumn(m::RegexMatchRow, i::Int) = getfield(m, :match)[i]
Tables.getcolumn(m::RegexMatchRow, i::Symbol) = getfield(m, :match)[i]
Tables.columnnames(m::RegexMatchRow) = Symbol.(keys(getfield(m, :match)))

Example:

julia> pattern = r"(?P<id>\w+) +(?P<desktop>-?\d+) +(?P<x>-?\d+) +(?P<y>-?\d+) +(?P<width>\d+) +(?P<height>\d+) +(?P<pc>\w+) +(?P<title>.*)";

julia> DataFrame(RegexMatchRow.(match.(pattern, eachline(`wmctrl -l -G`))))
5×8 DataFrame
 Row │ id          desktop    x          y          width      height     pc         title                             
     │ SubStrin…   SubStrin…  SubStrin…  SubStrin…  SubStrin…  SubStrin…  SubStrin…  SubStrin…                         
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0x02800003  -1         0          56         3760       2372       Aji        @!0,28;BDH
   2 │ 0x02000001  0          324        471        2677       1942       Aji        Slack | * data | Julia
   3 │ 0x04200003  0          20         44         2376       2372       Aji        Latest Domains/Data topics - Jul…
   4 │ 0x04000010  0          176        -36        2832       1604       Aji        julia-master /home/oxinabox
   5 │ 0x05400038  0          1936       192        1844       2298       Aji        new 1 - Notepadqq

Something that would be nice to do on top of this would be to infer the types base on the regex.
e.g. if the capture is for (?P<width>\d+) we can take a pretty solid guess that this column is Int.

7 Likes

Yeah, we should include something like this in CSV.jl; having “regex” functionality has come up a few times, but it’s always been a little unclear whether you’d want to specify a regex per cell, or regex delimiter, etc. I think specifying a regex per row with named groups for columns is a clear solution to this. We could make a CSV.RegexFile or something to do the iterating/matching for you and implement the tables interface for you.

4 Likes