Building Whatsapp Analyzer in Pharo: Analysis & Visualization (Part 3)

Till the last post, we parsed a Whatsapp *.txt chat file, and created helper methods for preprocessing. In this one, we will analyze the messages stored in the dataframe obtained from previous parts.

Prerequisites

1. Parsed df object

Follow Part 1 to obtain a df object with messages stored in it. A sample chat has been added for convenience.

2. ChatFeatures and ChatAnayzer classes

These classes defined in Part 2 will be used in this part. You can also clone the repo mentioned at the end of part 2 to obtain them.

Preparation

Let’s create the following class, which will be used to analyze the dataframe.

1
2
3
4


Object subclass: #ChatAnalyzer
   instanceVariableNames: ''
   classVariableNames: ''
   package: 'WhatsappAnalyzer'

We’ll also create a class-side method:

1
2
3
4


getAuthorsFrom: df
   "Returns an OrderedCollection of authors present in the dataframe"

   ^ (df column: #Author) asSet.

Analysis

Basic Analysis

The total messages exchanged can be easily found out by df size, since we have already removed system messages (eg - ABC changed the group name to XYZ) and handled multi-line messages in part 2. We can infer messages like: This message was deleted as well as Media <Media Omitted> by using reject similar to ChatCleaner removeMessages: from. (Note: The error block is due to a bug in the library, which will be fixed soon)

We will store all results in a Dictionary, which can be later used to turn into a json format using NeoJSON.

Here is a basic method which analyzes basic message exchanges:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


getMessageCounts: df
   "Counts total messages, text messages, media and deleted messages"

   | messageCounts tempDf |
   messageCounts := Dictionary new.
   messageCounts add: 'Total'->(df size).
   [
      tempDf := df select: [ :row | (row at: #Message) = '<Media omitted>' ].
      messageCounts add: 'Media'->(tempDf size).
   ]   ifError: [ 
      messageCounts add: 'Media'->0.
       ].
   [
      tempDf := df select: [ :row | (row at: #Message) = 'This message was deleted' ].
      messageCounts add: 'Deleted'->(tempDf size).
   ]   ifError: [ 
      messageCounts add: 'Deleted'->0.
       ].
   messageCounts add: 'Text'->((df size) - (messageCounts at: 'Media') - (messageCounts at: 'Deleted')).
   ^ messageCounts

You can use it as:

1

messageCounts := ChatAnalyzer getMessageCounts: df.

By filtering Author field, we can do same for getting each author’s message counts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


getAuthorMessageCounts: df
   "Generates message counts for each author"

   | messageCounts authors authorDf |
   messageCounts := Dictionary new.
   authors := (df column: #Author) asSet.
   authors do: [ :author |
      authorDf := df select: [ :row | (row at: #Author) = author].
      messageCounts add: author->(self getMessageCounts: (authorDf)).
       ].
   ^ messageCounts 

You can add it to previous messageCounts as:

1

messageCounts add: 'Authors'->(ChatAnalyzer getAuthorMessageCounts: df).

Now, we can extract metrics such as most active user, average message per user and others. To group them, we create yet another method:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


getBasicTextAnalysis: df
   "Returns a dictionary with following metrics:
      Most active user w/count
      Most active media user w/count
      Messages/user 
   "

   | messageCounts textAnalysis|
   textAnalysis := Dictionary new.
   messageCounts := self getMessageCounts: df.
   messageCounts add: 'Authors'->(self getAuthorMessageCounts: df).
   
   "Calculate most active user"
   textAnalysis add: 'Most active user'->''.
   textAnalysis add: 'Most active user messages'->0.
   
   "Average messages per user"   
   textAnalysis add: 'Messages/user'->0.  
   (messageCounts at: #Authors) keysAndValuesDo: [ :key :value |
      ((textAnalysis at: 'Most active user messages') < (value at: 'Text')) ifTrue: [ 
            textAnalysis at: 'Most active user messages' put: (value at: 'Text').
            textAnalysis at: 'Most active user' put: key.
          ].
      textAnalysis at: 'Messages/user' put: ((textAnalysis at: 'Messages/user') + (value at: 'Text')).
       ].
   
   textAnalysis at: 'Messages/user' put: (textAnalysis at: 'Messages/user') asFloat / ((messageCounts at: #Authors) size).
   messageCounts add: 'Basic Text Analysis'->messageCounts.
   
   ^ messageCounts

We iterate through summary of each Author’s messages using messageCounts at: 'Authors' keysAndValuesDo: and count occurences as needed. For example - Most active user is found out by finding max of Text key for each Author.

Additional fields for this analysis is present on my repo. Here is how the resulting messageCounts will look:

Basic text analysis

Emojis

1. Most frequently used emojis

In the last part, we added ChatCleaner getEmojisFrom: which returned a dataseries with only emojis present. We’ll iterate through the series counting the emojis encountered and storing it in emojiCount which is a DataSeries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


getEmojiCountFrom: emojiDs
   "Returns an DataSeries having an emoji as key and counts as values"

   | emojiCount |
   emojiCount := DataSeries new name: #EmojiCount.
   emojiDs do: [ :message |
      message do: [ :emoji |
         emojiCount at: emoji
            transform: [ :count | count + 1 ]
            ifAbsent: [ emojiCount add: emoji->1 ].
         ]
      ].
   ^ emojiCount sortDescending.

We can now get this result:

1
2
3


emojiDs := ChatCleaner getEmojisFrom: df.
ChatAnalyzer getEmojiCountFrom: emojiDs.
a DataSeries($🏻->7 $👍->5 $♂->3 $‍->3 $🙋->3 $🙂->2 $😅->1 $🤷->1 $🤔->1 $🎉->1 $😂->1 $🤨->1 $😁->1 $😓->1)

Whatsapp seems to have the format <emoji><color><gender> which is why color ranks highest (default is yellow), as well as male also is high (since default is female in some emojis). The 4th element in the series has unicode 8205 which occurs between emoji and ♂. We can ignore them by adding an ignore-set.

1
2
3


ignoreEmoji := Set withAll: #(127995 9794 8205 2640).
"and add following in main body before line 8"
(ignoreEmoji includes: (emoji asUnicode)) ifFalse:

1

 a DataSeries($👍->5 $🙋->3 $🙂->2 $😅->1 $🤷->1 $🤔->1 $🎉->1 $😂->1 $🤨->1 $😁->1 $😓->1)

2. Emojis used per person

We can extend our previous method by filtering each author and counting emojis used by that author.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


getMemberEmojiCountFrom: df
   "Returns a dictionory of DataSeries having emoji count of member"

   | emojiPerPerson authors authorDf authorEmojiCount |
   emojiPerPerson := Dictionary new.
   authors := (df column: #Author) uniqueValues.
   authors do: [ :author |
      authorDf := df select: [ :row | (row at: #Author) = author].
      authorEmojiCount := self getEmojiCountFrom: (ChatCleaner getEmojisFrom: authorDf).
      (authorEmojiCount isEmpty) ifFalse: [
         emojiPerPerson add: author->authorEmojiCount.
         ]
       ].
   ^ emojiPerPerson

We filter an author’s posts in line 8, get emoji count of resulting dataframe, and add it into emojiPerPerson. Here is how the output looks like for the sample chat provided in part 1: Emojis Per Author Inspector is unable to render some unicode characters, however you can export it to csv and see the output.

Frequently used phrases

To see phrases commonly used in a chat, we can count ngrams across all messages. This is done by iterating through the messages and adding ngrams in a dictionary with key as ngram and count as value, similar to counting of emojis.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


getNgramCount: ngram columnName: columnName
   "Returns a DataSeries with ngram count."

   | nGramCount |
   nGramCount := DataSeries new name: columnName.

   ngrams do: [ :ngramCollection |
     ngramCollection do: [ :ngram |
       nGramCount at: ngram
               transform: [ :count | count + 1 ]
               ifAbsent: [ nGramCount add: ngram->1 ].
        ].
      ].

   ^ nGramCount sortDescending.

You can run it and see the output using:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


blacklist := Set withAll: #('This message was deleted' '<Media omitted>').

nGramDf := df deepCopy.
nGramDf := ChatCleaner removeMessages: blacklist from: nGramDf.
nGramDf column: #Message put: ((ChatCleaner getWordsFrom: nGramDf) asArray).
nGramDf column: #Message put: ((ChatCleaner messagesAsLowercase: nGramDf) asArray).
nGramDf column: #Message put: ((ChatCleaner removeStopwordsFrom: nGramDf) asArray).
nGramDf addColumn: (ChatFeatures getNgramsFromDataFrame: nGramDf withN: 3) named: #Ngrams.

nGramCount := ChatAnalyzer getNgramCount: (nGramDf column:#Ngrams) columnName: #NgramCount.

To summarize, line 2 removes all punctuation and emojis, line 3 converts remaining characters to lowercase, and line 4 strips the messages of the stopwords. Line 5 adds a new column with ngrams which is used to create nGramCount.

Here is how the output looks like: Ngram counts The left is with stopwords removed and right is without. Removing stopwords is only effective if the messages are bigger. Also, the ngram methods in ChatCleaner accept ngrams less than n to prevent null strings. It can be modified to having only n-length ngrams by making sure buffer is full before appending to array.

The above method can be applied to per-author by applying filters, as done in basic analysis.

DateTime analysis

Grouping messages by date or time can allow us to find most frequent hour/date of the chat history. In Pharo, you can group objects using the method group: by: aggregateUsing:, such as for finding most frequent date, we can:

1
2
3
4


df
   group: #Message
   by: #Date
   aggregateUsing: #size.

It first groups messages by date, and calculates the size of the created group, returning a DataSeries.

As for the most frequent hour, we need to first convert time strings into hour, followed by grouping.

1
2
3
4
5
6
7
8
9


df addColumn: ((df column: #Time) collect: [
      :element |
      element asTime hour
   ]) named: #Hour.

df
   group: #Message
   by: #Hour
   aggregateUsing: #size.

We can do the same for each user by filtering Author. Here is how it will look for Member26:

1
2
3
4
5


(nGramDf select: [ :row | (row at: #Author) = 'Member26' ])
   group: #Message
   by: #Hour
   aggregateUsing: #size.
>>> a DataSeries(9->20 12->15 13->1)   

This indicates that user is active mostly around 9am and 12pm.

Visualization

Following is the analysis of one of my chats. Visualisation is done using Roassal. A good resource to learn it is http://agilevisualization.com/, and from it’s inbuilt browser containing multiple examples. I did not get enough time to learn Roassal completely so my scripts are bit of a hack, that’s why I have just linked the output. If you are curious about the source code, or wish to see a detailed part on this, mail me.

1
2
3
4
5
6
7


Totals:
   Letters exchanged: 413,040
   Words exchanged: 80,474
   Media: 285
   Deleted: 132
   Total messages: 20,584
   Duration: 2017-02-10 to 2019-04-13

1
2
3
4
5
6
7


Basic text analysis:
   Most active user: Member1
   Most active user messages: 12104
   Most active user percentage: 60.02%
   Most active media user: Member1
   Most active user media: 226
   Average messages/user: 10,083.5

Message distribution by user:

1

Most common emoji: 😂 appeared 2344 times (80% of all emojis).

Top 10 emoji-wheel for member 1:

Note that I could not find Android emoji fonts for MacOS, on which this was recorded. For reference, here are top 10 emojis:

1

😂, 🤣, 🤦, 😛, 😅, 😒, 🤭, 😌, 😏, 🔥

Messages per hour by members:

Here is messages exchanged vs month: Messages per month

NameCloud for n=1: Word Count

NameCloud for n=3: Trigram

Next steps

With the last 3 tutorials, you’ll have tools to analyze your own chat data. You can extend this to other chats such as discord or even mailing list, or add additional insights such as:

member which initiates conversations
member’s behavioral pattern
trending topics across chats/months

and so on.

If you spot any error in this series or can think of additional analysis that can be done, feel free to email me!