Building Whatsapp Analyzer in Pharo: Parsing (Part 1)

As a part of getting familiar with Dataframe library as well as Pharo, I decided to build a Whatsapp chat analyzer in it. This post puts down steps to parse the exported *.txt file into a Dataframe object.

If you are unfamiliar with Pharo, check out Pharo by Example as well as this blog post.

Prerequisites

1. Basics of Pharo/smalltalk

You must know the basics of Pharo, such as installing packages, defining classes/subclasses, message passing etc. These are well covered in the above linked blog post and Pharo by Example (first 3 chapters are enough for this post).

2. Whatsapp chat

The first thing we need is a Whatsapp Chat log. Open any Whatsapp group or personal chat, tap on options (3 dots on top right), then on more and select Export chat (without media). You may either save it to your Google Drive, email the chat to yourself or send it to someone in your contact and download it.

Whatsapp Export

I have uploaded a sample Whatsapp group chat at this link, which you may use for following this tutorial. The sample has 247 lines from 50 users.

3. Pharo with Dataframe library

You need Pharo VM with Dataframe library installed. For this post, I used Pharo 7.0. You will find the instructions to install and use it in the book Pharo by Example.

Open playground, copy paste the following, and run it. It will install the Dataframe library. Save the image so that you won’t need to install it again.

1
2
3
4


Metacello new
  baseline: 'DataFrame';
  repository: 'github://PolyMathOrg/DataFrame/src';
  load.

An excellent resource for learning about this library is Dataframe by Example, which goes over the messages that you can use, and scenarios in which you might use them.

Reading the chat

Parsing lines

Here is how a line in the exported chat looks like:

1

4/10/19, 6:18 PM - Member14: Don't have a surname, it's empty in passport as well, what to fill in access via form in surname field, name?

We get four fields from a line - Date, Time, Author, and Message, in the following format:

1

<d/m/yy>, <h:mm (AM/PM)> - <author>: <message>

Date occurs till the first ,, time from , to -, author from - till second : (since time has a : in it) and message is the rest of the line.

However, a chat consists of non-message lines when a group name is changed, photo is updated, or members are removed or added, such as these:

1
2
3
4
5


4/10/19, 8:08 AM - Messages to this group are now secured with end-to-end encryption. Tap for more info.
3/28/19, 8:39 AM - Member0 created group "GROUPNAME"
4/10/19, 8:08 AM - Member0 added you
4/10/19, 8:29 AM - Member0 added Member4
4/10/19, 8:36 AM - Member0 added Member6

These can be detected as they (mostly) do not have more than one :. We ignore such lines since they are system generated and not users’ messages. Some of the messages may be multiline, such as:

1
2
3
4


4/10/19, 9:44 AM - Member5: Hi
Even I got the admit last night only.
Did go thru the links here, pretty informative.
So the next step is just to accept and send transcripts right? Anything else for the time being ?

An easy way to detect them is checking if a line starts with a date regex, \d(\d)?/\d(\d)?/\d\d. You can learn what regexes are online and test them at regex101. The lines that match regex should be added as a new row, and those which don’t should be appended to previous row.

We’ll use the above four fields as columns in our dataframe along with the checks discussed and parse the lines. It would be easier for us to parse if we define a custom class with appropriate messages. We create a class WhatsappReader inside WhatsappAnalyzer package.

1
2
3
4


DataFrameReader subclass: #WhatsappReader
  instanceVariableNames: ''
  classVariableNames: ''
  package: 'WhatsappAnalyzer'

The parent class is DataFrameReader since it allows us to use DataFrame readFrom: using: method. This is discussed in the next section. Add the following method into class-side:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26


insert: line into: df
  "Parses given line and inserts it as a row in dataframe. An utility method."

  | commaIndex hyphenIndex colonIndex oldRow |
  (line isNotEmpty) ifTrue: [
    ((line copyUpTo: $,) matchesRegex: '\d(\d)?/\d(\d)?/\d\d') 
      ifTrue: [ 
        commaIndex := line indexOf: $,.
        hyphenIndex := line indexOf: $-.
        colonIndex := line indexOf: $: startingAt: hyphenIndex.
        (colonIndex ~= 0 ) ifTrue: [
          df addRow: { 
            line copyUpTo: $, .
            line copyFrom: commaIndex+2 to: hyphenIndex-2 .
            line copyFrom: hyphenIndex+2 to: colonIndex-1 .
            (line copyFrom: colonIndex+1 to: line size) allButFirst
          } named: df numberOfRows + 1.
        ]
      ]
      ifFalse: [
        oldRow := df row: df numberOfRows.
        oldRow at: 'Message' put: ((oldRow at: 'Message') , '\n' , line).
        df removeRow: df numberOfRows.
        df addRow: oldRow.
      ]
  ]

Lines 8 to 17 are used to parse a valid message line. We copy parts of the line using copyUpTo: and copyFrom: to: messages, and add them to the dataframe using addRow: named: message. Note that we use { }, which denotes a dynamic array.

Line 6 checks if the message starts with a valid date, if it doesn’t, it appends the line to previous row using 21 to 24.

Line 11 serves as a check for valid message line - if the line doesn’t have a second :, it is a line that we are not interested in, such as when group name is changed.

You can try it out using the following code in Playground. Try all the possible lines (valid/invalid/multiline).

1
2
3
4


line := '4/10/19, 6:18 PM - Member14: Don''t have a surname, it''s empty in passport as well, what to fill in access via form in surname field, name?'.
df := DataFrame withColumnNames: #('Date' 'Time' 'Author' 'Message').
WhatsappReader insert: line into: df.
"Inspect df to see the change."

Reading the chat file

DataFrame library has a method readFrom: using: which enables us to write custom readers which can enter data into the dataframe. One such reader is DataFrameCsvReader whose implementation can be found here.

To achieve similar result, we have inherited the DataFrameReader class, which makes it readFrom: a subclass responsiblity. Now we need to override that method, and we do so as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


readFrom: chatFileReference
  "Read Whatsapp chat and add it into the dataframe. Takes FileRefernce as it's argument"
  
  | df line |
  df := self createDataFrame.
  chatFileReference readStreamDo: [ :inputStream |
    [ inputStream atEnd ] whileFalse: [ 
      line := inputStream nextLine.
      self class insert: line into: df.
      ]
     ].
  ^ df

We read the FileReference line by line using readStreamDo and nextLine. inputStream atEnd ensures that we have read all lines. readStreamDo closes the stream when the block is executed. createDataFrame has been defined as follows:

1
2
3
4


createDataFrame 
  "Creates an empty dataframe with four columns parsed from Whatsapp chat"

  ^ DataFrame withColumnNames: #('Date' 'Time' 'Author' 'Message')

Now, we finally can run the program to get parsed chat.

1
2
3


fileRef := '/path/to/group0.txt' asFileReference.
reader := WhatsappReader new.
df := DataFrame readFrom: fileRef using: reader.

Inspecting df, we see the following: Parser Output We were able to parse 220 lines of data (including multiline), and ignored 27 lines.

The resulting code is available at my GitHub: Whatsapp-Analyzer.

In the following parts, we’ll clean the data, find interesting metrics on it, and plot them!