Hello,
I am trying to parse a whatsapp text chat as a DataFrame . What I need is to parse this format:
06/18/2021 20:48 - +36 41 9989-8989: Hello, how are you? <= all in one column
or
06/18/2021 20:48 - Paul Faben: Hello, how are you? <= all in one column
to get this table:
Date | Time | Number | Text <= 4 columns
Any help would be great, as I am new to Juliaβ¦
Thanks
I am sure itβs possible to implement a more efficient version of this by writing a custom parser, but to implement a simple version, you can just use a regex together with the Dates stdlib:
using DataFrames, Dates
function read_chats(str)
df = DataFrame(date=Date[], time=Time[], number=String[], text=String[])
for row in eachmatch(r"^(?<date>.+?) (?<time>.+?) \- (?<number>.*?)\: (?<text>.*)$"m, str)
date = parse(Date, row["date"], dateformat"mm/dd/yyyy")
time = parse(Time, row["time"])
push!(df, (; date, time, number=row["number"], text=row["text"]))
end
return df
end
r"^(?<date>.+?) (?<time>.+?) \- (?<number>.*?)\: (?<text>.*)$"m is a regex describing the format your lines are in, where each part you are interested in is a named group. You can use parse to then parse the dates and times to Date and Time objects.
Reading in two rows:
julia> read_chats("""
06/18/2021 20:48 - +36 41 9989-8989: Hello, how are you?
06/18/2021 20:48 - Paul Faben: Hello, how are you?""")
2Γ4 DataFrame
Row β date time number text
β Date Time String String
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 2021-06-18 20:48:00 +36 41 9989-8989 Hello, how are you?
2 β 2021-06-18 20:48:00 Paul Faben Hello, how are you?
To read from a file, you can just read it into a string using read(filename, String).
2 Likes
Awesome! Thank you very much @simeonschaub