Parsing WhatsApp text chat

Hello,
I am trying to parse a whatsapp text chat as a DataFrame . What I need is to parse this format:

06/18/2021 20:48 - +36 41 9989-8989: Hello, how are you? <= all in one column
or
06/18/2021 20:48 - Paul Faben: Hello, how are you? <= all in one column

to get this table:

Date | Time | Number | Text <= 4 columns

Any help would be great, as I am new to Julia…
Thanks

I am sure it’s possible to implement a more efficient version of this by writing a custom parser, but to implement a simple version, you can just use a regex together with the Dates stdlib:

using DataFrames, Dates

function read_chats(str)
    df = DataFrame(date=Date[], time=Time[], number=String[], text=String[])
    for row in eachmatch(r"^(?<date>.+?) (?<time>.+?) \- (?<number>.*?)\: (?<text>.*)$"m, str)
        date = parse(Date, row["date"], dateformat"mm/dd/yyyy")
        time = parse(Time, row["time"])
        push!(df, (; date, time, number=row["number"], text=row["text"]))
    end
    return df
end

r"^(?<date>.+?) (?<time>.+?) \- (?<number>.*?)\: (?<text>.*)$"m is a regex describing the format your lines are in, where each part you are interested in is a named group. You can use parse to then parse the dates and times to Date and Time objects.

Reading in two rows:

julia> read_chats("""
       06/18/2021 20:48 - +36 41 9989-8989: Hello, how are you?
       06/18/2021 20:48 - Paul Faben: Hello, how are you?""")
2Γ—4 DataFrame
 Row β”‚ date        time      number            text                
     β”‚ Date        Time      String            String              
─────┼─────────────────────────────────────────────────────────────
   1 β”‚ 2021-06-18  20:48:00  +36 41 9989-8989  Hello, how are you?
   2 β”‚ 2021-06-18  20:48:00  Paul Faben        Hello, how are you?

To read from a file, you can just read it into a string using read(filename, String).

2 Likes

Awesome! Thank you very much @simeonschaub