Suggestions for converting raw RTF (rich text format) to plain text

I am going to be pulling some data from an IBM DB2 database that is stored as CLOB data. It’s Rich Text Format (RTF) text data and what I want to do is convert it to plain text. Here’s an example of what it looks like:

 {\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
 This is some {\b bold} text.\par
 }

What I want is to strip out all the RTF junk and end up with:

This is some bold text.

I took a quick look at the JuliaText family of packages and didn’t see anything that looked like it could strip out the RTF stuff and from what I’ve read, you can’t get all of the RTF stuff with RegEx.

It looks like my only option may be to call a library from another language, or to utilize a command line tool from Julia. Any thoughts/advice would be much appreciated!

If you’re on Mac, there’s a nifty command-line program called textutil.

:cry: I’m on a Windows system

TIL

Is this an Apple program?

Yes, I think so, like sips the image converter…

Probably worth investigating Pandoc for your text-processing needs…

This stackoverflow answer offered a Python function striprtf(text) to accomplish this, which was pretty easy to convert to Julia:

striprtf(text) function (translated from Python)
# translated from https://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string
# and https://stackoverflow.com/questions/44580580/how-to-convert-rtf-string-to-plain-text-in-python-using-any-library
let pattern = r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)"i,
    destinations = Set{String}([
      "aftncn","aftnsep","aftnsepc","annotation","atnauthor","atndate","atnicn","atnid",
      "atnparent","atnref","atntime","atrfend","atrfstart","author","background",
      "bkmkend","bkmkstart","blipuid","buptim","category","colorschememapping",
      "colortbl","comment","company","creatim","datafield","datastore","defchp","defpap",
      "do","doccomm","docvar","dptxbxtext","ebcend","ebcstart","factoidname","falt",
      "fchars","ffdeftext","ffentrymcr","ffexitmcr","ffformat","ffhelptext","ffl",
      "ffname","ffstattext","field","file","filetbl","fldinst","fldrslt","fldtype",
      "fname","fontemb","fontfile","fonttbl","footer","footerf","footerl","footerr",
      "footnote","formfield","ftncn","ftnsep","ftnsepc","g","generator","gridtbl",
      "header","headerf","headerl","headerr","hl","hlfr","hlinkbase","hlloc","hlsrc",
      "hsv","htmltag","info","keycode","keywords","latentstyles","lchars","levelnumbers",
      "leveltext","lfolevel","linkval","list","listlevel","listname","listoverride",
      "listoverridetable","listpicture","liststylename","listtable","listtext",
      "lsdlockedexcept","macc","maccPr","mailmerge","maln","malnScr","manager","margPr",
      "mbar","mbarPr","mbaseJc","mbegChr","mborderBox","mborderBoxPr","mbox","mboxPr",
      "mchr","mcount","mctrlPr","md","mdeg","mdegHide","mden","mdiff","mdPr","me",
      "mendChr","meqArr","meqArrPr","mf","mfName","mfPr","mfunc","mfuncPr","mgroupChr",
      "mgroupChrPr","mgrow","mhideBot","mhideLeft","mhideRight","mhideTop","mhtmltag",
      "mlim","mlimloc","mlimlow","mlimlowPr","mlimupp","mlimuppPr","mm","mmaddfieldname",
      "mmath","mmathPict","mmathPr","mmaxdist","mmc","mmcJc","mmconnectstr",
      "mmconnectstrdata","mmcPr","mmcs","mmdatasource","mmheadersource","mmmailsubject",
      "mmodso","mmodsofilter","mmodsofldmpdata","mmodsomappedname","mmodsoname",
      "mmodsorecipdata","mmodsosort","mmodsosrc","mmodsotable","mmodsoudl",
      "mmodsoudldata","mmodsouniquetag","mmPr","mmquery","mmr","mnary","mnaryPr",
      "mnoBreak","mnum","mobjDist","moMath","moMathPara","moMathParaPr","mopEmu",
      "mphant","mphantPr","mplcHide","mpos","mr","mrad","mradPr","mrPr","msepChr",
      "mshow","mshp","msPre","msPrePr","msSub","msSubPr","msSubSup","msSubSupPr","msSup",
      "msSupPr","mstrikeBLTR","mstrikeH","mstrikeTLBR","mstrikeV","msub","msubHide",
      "msup","msupHide","mtransp","mtype","mvertJc","mvfmf","mvfml","mvtof","mvtol",
      "mzeroAsc","mzeroDesc","mzeroWid","nesttableprops","nextfile","nonesttables",
      "objalias","objclass","objdata","object","objname","objsect","objtime","oldcprops",
      "oldpprops","oldsprops","oldtprops","oleclsid","operator","panose","password",
      "passwordhash","pgp","pgptbl","picprop","pict","pn","pnseclvl","pntext","pntxta",
      "pntxtb","printim","private","propname","protend","protstart","protusertbl","pxe",
      "result","revtbl","revtim","rsidtbl","rxe","shp","shpgrp","shpinst",
      "shppict","shprslt","shptxt","sn","sp","staticval","stylesheet","subject","sv",
      "svb","tc","template","themedata","title","txe","ud","upr","userprops",
      "wgrffmtfilter","windowcaption","writereservation","writereservhash","xe","xform",
      "xmlattrname","xmlattrvalue","xmlclose","xmlname","xmlnstbl",
      "xmlopen" ]),
    specialchars = Dict{String,String}([
      "par" => "\n",
      "sect" => "\n\n",
      "page" => "\n\n",
      "line" => "\n",
      "tab" => "\t",
      "emdash" => "\u2014",
      "endash" => "\u2013",
      "emspace" => "\u2003",
      "enspace" => "\u2002",
      "qmspace" => "\u2005",
      "bullet" => "\u2022",
      "lquote" => "\u2018",
      "rquote" => "\u2019",
      "ldblquote" => "\201C",
      "rdblquote" => "\u201D" ])
    global striprtf
    function striprtf(text::AbstractString)
        stack = Tuple{Int,Bool}[]
        ignorable = false       # Whether this group (and all inside it) are "ignorable".
        ucskip = 1              # Number of ASCII characters to skip after a unicode character.
        curskip = 0             # Number of ASCII characters left to skip
        out = IOBuffer()        # Output buffer.
        for match in eachmatch(pattern, text)
            word,arg,hex,char,brace,tchar = match.captures
            if brace !== nothing
                curskip = 0
                if brace == "{"
                    # Push state
                    push!(stack, (ucskip,ignorable))
                elseif brace == "}"
                    # Pop state
                    ucskip,ignorable = pop!(stack)
                end
            elseif char !== nothing # \x (not a letter)
                curskip = 0
                ch = only(char)
                if ch == '~'
                    !ignorable && print(out, '\ua0')
                elseif ch in "{}\\"
                    !ignorable && print(out, char)
                elseif ch == '*'
                    ignorable = true
                end
            elseif word !== nothing # \foo
                curskip = 0
                if word in destinations
                    ignorable = true
                elseif ignorable
                    nothing
                elseif word in keys(specialchars)
                    print(out, specialchars[word])
                elseif word == "uc"
                    ucskip = parse(Int, arg)
                elseif word == "u"
                    c = parse(Int, arg)
                    c < 0 && (c += 0x10000)
                    print(out, Char(c))
                    curskip = ucskip
                end
            elseif hex !== nothing # \'xx
                if curskip > 0
                    curskip -= 1
                elseif !ignorable
                    c = parse(Int, hex, base=16)
                    print(out, Char(c))
                end
            elseif tchar !== nothing
                if curskip > 0
                    curskip -= 1
                elseif !ignorable
                    print(out, tchar)
                end
            end
        end
        return String(take!(out))
    end
end

For example, this gives:

julia> striprtf(raw"""
        {\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
        This is some {\b bold} text.\par
        }""")
"This is some bold text.\n"
5 Likes

This should be in a package

Might be worthwhile first to look at other sources, e.g. this Python striprtf package, which looks like it is an upgraded version of the same stackoverflow answer, to see if anything is missing.

1 Like

I created a draft package: GitHub - JuliaStrings/StripRTF.jl: strip RTF to plain text

The implementation is more complicated than the one posted above, mainly because it needs to deal with messy encoding issues. (RTF is only marginally Unicode-aware.)

6 Likes

Did you try pandoc?

1 Like

Thanks. Will you register it? I was just needing a tool like this :slight_smile:

Already done: New package: StripRTF v1.0.0 by JuliaRegistrator · Pull Request #83294 · JuliaRegistries/General · GitHub

2 Likes