Hello,
I am trying to parse a whatsapp text chat as a DataFrame
. What I need is to parse this format:
06/18/2021 20:48 - +36 41 9989-8989: Hello, how are you?
<= all in one column
or
06/18/2021 20:48 - Paul Faben: Hello, how are you?
<= all in one column
to get this table:
Date | Time | Number | Text
<= 4 columns
Any help would be great, as I am new to Juliaβ¦
Thanks
I am sure itβs possible to implement a more efficient version of this by writing a custom parser, but to implement a simple version, you can just use a regex together with the Dates
stdlib:
using DataFrames, Dates
function read_chats(str)
df = DataFrame(date=Date[], time=Time[], number=String[], text=String[])
for row in eachmatch(r"^(?<date>.+?) (?<time>.+?) \- (?<number>.*?)\: (?<text>.*)$"m, str)
date = parse(Date, row["date"], dateformat"mm/dd/yyyy")
time = parse(Time, row["time"])
push!(df, (; date, time, number=row["number"], text=row["text"]))
end
return df
end
r"^(?<date>.+?) (?<time>.+?) \- (?<number>.*?)\: (?<text>.*)$"m
is a regex describing the format your lines are in, where each part you are interested in is a named group. You can use parse
to then parse the dates and times to Date
and Time
objects.
Reading in two rows:
julia> read_chats("""
06/18/2021 20:48 - +36 41 9989-8989: Hello, how are you?
06/18/2021 20:48 - Paul Faben: Hello, how are you?""")
2Γ4 DataFrame
Row β date time number text
β Date Time String String
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 β 2021-06-18 20:48:00 +36 41 9989-8989 Hello, how are you?
2 β 2021-06-18 20:48:00 Paul Faben Hello, how are you?
To read from a file, you can just read it into a string using read(filename, String)
.
2 Likes
Awesome! Thank you very much @simeonschaub