Suggestions for converting raw RTF (rich text format) to plain text

mthelm85 · May 9, 2023, 5:58pm

I am going to be pulling some data from an IBM DB2 database that is stored as CLOB data. It’s Rich Text Format (RTF) text data and what I want to do is convert it to plain text. Here’s an example of what it looks like:

 {\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
 This is some {\b bold} text.\par
 }

What I want is to strip out all the RTF junk and end up with:

This is some bold text.

I took a quick look at the JuliaText family of packages and didn’t see anything that looked like it could strip out the RTF stuff and from what I’ve read, you can’t get all of the RTF stuff with RegEx.

It looks like my only option may be to call a library from another language, or to utilize a command line tool from Julia. Any thoughts/advice would be much appreciated!

cormullion · May 9, 2023, 6:14pm

If you’re on Mac, there’s a nifty command-line program called textutil.

mthelm85 · May 9, 2023, 6:16pm

I’m on a Windows system

e3c6 · May 9, 2023, 6:31pm

TIL

Is this an Apple program?

cormullion · May 9, 2023, 6:37pm

Yes, I think so, like sips the image converter…

Probably worth investigating Pandoc for your text-processing needs…

stevengj · May 9, 2023, 7:35pm

This stackoverflow answer offered a Python function striprtf(text) to accomplish this, which was pretty easy to convert to Julia:

striprtf(text) function (translated from Python)

# translated from https://stackoverflow.com/questions/188545/regular-expression-for-extracting-text-from-an-rtf-string
# and https://stackoverflow.com/questions/44580580/how-to-convert-rtf-string-to-plain-text-in-python-using-any-library
let pattern = r"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)"i,
    destinations = Set{String}([
      "aftncn","aftnsep","aftnsepc","annotation","atnauthor","atndate","atnicn","atnid",
      "atnparent","atnref","atntime","atrfend","atrfstart","author","background",
      "bkmkend","bkmkstart","blipuid","buptim","category","colorschememapping",
      "colortbl","comment","company","creatim","datafield","datastore","defchp","defpap",
      "do","doccomm","docvar","dptxbxtext","ebcend","ebcstart","factoidname","falt",
      "fchars","ffdeftext","ffentrymcr","ffexitmcr","ffformat","ffhelptext","ffl",
      "ffname","ffstattext","field","file","filetbl","fldinst","fldrslt","fldtype",
      "fname","fontemb","fontfile","fonttbl","footer","footerf","footerl","footerr",
      "footnote","formfield","ftncn","ftnsep","ftnsepc","g","generator","gridtbl",
      "header","headerf","headerl","headerr","hl","hlfr","hlinkbase","hlloc","hlsrc",
      "hsv","htmltag","info","keycode","keywords","latentstyles","lchars","levelnumbers",
      "leveltext","lfolevel","linkval","list","listlevel","listname","listoverride",
      "listoverridetable","listpicture","liststylename","listtable","listtext",
      "lsdlockedexcept","macc","maccPr","mailmerge","maln","malnScr","manager","margPr",
      "mbar","mbarPr","mbaseJc","mbegChr","mborderBox","mborderBoxPr","mbox","mboxPr",
      "mchr","mcount","mctrlPr","md","mdeg","mdegHide","mden","mdiff","mdPr","me",
      "mendChr","meqArr","meqArrPr","mf","mfName","mfPr","mfunc","mfuncPr","mgroupChr",
      "mgroupChrPr","mgrow","mhideBot","mhideLeft","mhideRight","mhideTop","mhtmltag",
      "mlim","mlimloc","mlimlow","mlimlowPr","mlimupp","mlimuppPr","mm","mmaddfieldname",
      "mmath","mmathPict","mmathPr","mmaxdist","mmc","mmcJc","mmconnectstr",
      "mmconnectstrdata","mmcPr","mmcs","mmdatasource","mmheadersource","mmmailsubject",
      "mmodso","mmodsofilter","mmodsofldmpdata","mmodsomappedname","mmodsoname",
      "mmodsorecipdata","mmodsosort","mmodsosrc","mmodsotable","mmodsoudl",
      "mmodsoudldata","mmodsouniquetag","mmPr","mmquery","mmr","mnary","mnaryPr",
      "mnoBreak","mnum","mobjDist","moMath","moMathPara","moMathParaPr","mopEmu",
      "mphant","mphantPr","mplcHide","mpos","mr","mrad","mradPr","mrPr","msepChr",
      "mshow","mshp","msPre","msPrePr","msSub","msSubPr","msSubSup","msSubSupPr","msSup",
      "msSupPr","mstrikeBLTR","mstrikeH","mstrikeTLBR","mstrikeV","msub","msubHide",
      "msup","msupHide","mtransp","mtype","mvertJc","mvfmf","mvfml","mvtof","mvtol",
      "mzeroAsc","mzeroDesc","mzeroWid","nesttableprops","nextfile","nonesttables",
      "objalias","objclass","objdata","object","objname","objsect","objtime","oldcprops",
      "oldpprops","oldsprops","oldtprops","oleclsid","operator","panose","password",
      "passwordhash","pgp","pgptbl","picprop","pict","pn","pnseclvl","pntext","pntxta",
      "pntxtb","printim","private","propname","protend","protstart","protusertbl","pxe",
      "result","revtbl","revtim","rsidtbl","rxe","shp","shpgrp","shpinst",
      "shppict","shprslt","shptxt","sn","sp","staticval","stylesheet","subject","sv",
      "svb","tc","template","themedata","title","txe","ud","upr","userprops",
      "wgrffmtfilter","windowcaption","writereservation","writereservhash","xe","xform",
      "xmlattrname","xmlattrvalue","xmlclose","xmlname","xmlnstbl",
      "xmlopen" ]),
    specialchars = Dict{String,String}([
      "par" => "\n",
      "sect" => "\n\n",
      "page" => "\n\n",
      "line" => "\n",
      "tab" => "\t",
      "emdash" => "\u2014",
      "endash" => "\u2013",
      "emspace" => "\u2003",
      "enspace" => "\u2002",
      "qmspace" => "\u2005",
      "bullet" => "\u2022",
      "lquote" => "\u2018",
      "rquote" => "\u2019",
      "ldblquote" => "\201C",
      "rdblquote" => "\u201D" ])
    global striprtf
    function striprtf(text::AbstractString)
        stack = Tuple{Int,Bool}[]
        ignorable = false       # Whether this group (and all inside it) are "ignorable".
        ucskip = 1              # Number of ASCII characters to skip after a unicode character.
        curskip = 0             # Number of ASCII characters left to skip
        out = IOBuffer()        # Output buffer.
        for match in eachmatch(pattern, text)
            word,arg,hex,char,brace,tchar = match.captures
            if brace !== nothing
                curskip = 0
                if brace == "{"
                    # Push state
                    push!(stack, (ucskip,ignorable))
                elseif brace == "}"
                    # Pop state
                    ucskip,ignorable = pop!(stack)
                end
            elseif char !== nothing # \x (not a letter)
                curskip = 0
                ch = only(char)
                if ch == '~'
                    !ignorable && print(out, '\ua0')
                elseif ch in "{}\\"
                    !ignorable && print(out, char)
                elseif ch == '*'
                    ignorable = true
                end
            elseif word !== nothing # \foo
                curskip = 0
                if word in destinations
                    ignorable = true
                elseif ignorable
                    nothing
                elseif word in keys(specialchars)
                    print(out, specialchars[word])
                elseif word == "uc"
                    ucskip = parse(Int, arg)
                elseif word == "u"
                    c = parse(Int, arg)
                    c < 0 && (c += 0x10000)
                    print(out, Char(c))
                    curskip = ucskip
                end
            elseif hex !== nothing # \'xx
                if curskip > 0
                    curskip -= 1
                elseif !ignorable
                    c = parse(Int, hex, base=16)
                    print(out, Char(c))
                end
            elseif tchar !== nothing
                if curskip > 0
                    curskip -= 1
                elseif !ignorable
                    print(out, tchar)
                end
            end
        end
        return String(take!(out))
    end
end

For example, this gives:

julia> striprtf(raw"""
        {\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\pard
        This is some {\b bold} text.\par
        }""")
"This is some bold text.\n"

e3c6 · May 9, 2023, 7:39pm

This should be in a package

stevengj · May 9, 2023, 7:47pm

Might be worthwhile first to look at other sources, e.g. this Python striprtf package, which looks like it is an upgraded version of the same stackoverflow answer, to see if anything is missing.

stevengj · May 10, 2023, 1:49am

I created a draft package: GitHub - JuliaStrings/StripRTF.jl: strip RTF to plain text

The implementation is more complicated than the one posted above, mainly because it needs to deal with messy encoding issues. (RTF is only marginally Unicode-aware.)

j-fu · May 10, 2023, 10:38am

Did you try pandoc?

e3c6 · May 13, 2023, 1:13pm

Thanks. Will you register it? I was just needing a tool like this

stevengj · May 13, 2023, 4:01pm

Already done: New package: StripRTF v1.0.0 by JuliaRegistrator · Pull Request #83294 · JuliaRegistries/General · GitHub

Topic		Replies	Views
How convert docx and odt into pure text? General Usage	2	603	March 8, 2020
Effective Text Extraction from Documents (PDFs) General Usage question , strings , data , nlp , etl	2	1147	February 9, 2021
PDF Parser and Reading API Data	42	12156	July 30, 2020
Create text document without LaTeX General Usage	3	149	August 20, 2024
Regular Expression Data	4	872	April 20, 2017

Suggestions for converting raw RTF (rich text format) to plain text

Related topics