Till the last post, we parsed a Whatsapp *.txt chat file, and created helper methods for preprocessing. In this one, we will analyze the messages stored in the dataframe obtained from previous parts.
Prerequisites
1. Parsed df object
Follow Part 1 to obtain a df object with messages stored in it. A sample chat has been added for convenience.
2. ChatFeatures and ChatAnayzer classes
These classes defined in Part 2 will be used in this part. You can also clone the repo mentioned at the end of part 2 to obtain them.
Preparation
Let’s create the following class, which will be used to analyze the dataframe.
getAuthorsFrom:df"Returns an OrderedCollection of authors present in the dataframe"^ (dfcolumn:#Author) asSet.
Analysis
Basic Analysis
The total messages exchanged can be easily found out by df size, since we have already removed system messages (eg - ABC changed the group name to XYZ) and handled multi-line messages in part 2. We can infer messages like: This message was deleted as well as Media <Media Omitted> by using reject similar to ChatCleaner removeMessages: from. (Note: The error block is due to a bug in the library, which will be fixed soon)
We will store all results in a Dictionary, which can be later used to turn into a json format using NeoJSON.
Here is a basic method which analyzes basic message exchanges:
getBasicTextAnalysis:df"Returns a dictionary with following metrics:
Most active user w/count
Most active media user w/count
Messages/user
"| messageCounts textAnalysis|textAnalysis:=Dictionarynew.messageCounts:=selfgetMessageCounts:df.messageCountsadd:'Authors'->(selfgetAuthorMessageCounts:df)."Calculate most active user"textAnalysisadd:'Most active user'->''.textAnalysisadd:'Most active user messages'->0."Average messages per user"textAnalysisadd:'Messages/user'->0.
(messageCountsat:#Authors) keysAndValuesDo: [ :key:value|
((textAnalysisat:'Most active user messages') < (valueat:'Text')) ifTrue: [
textAnalysisat:'Most active user messages'put: (valueat:'Text').textAnalysisat:'Most active user'put:key.
].textAnalysisat:'Messages/user'put: ((textAnalysisat:'Messages/user') + (valueat:'Text')).
].textAnalysisat:'Messages/user'put: (textAnalysisat:'Messages/user') asFloat/ ((messageCountsat:#Authors) size).messageCountsadd:'Basic Text Analysis'->messageCounts.^messageCounts
We iterate through summary of each Author’s messages using messageCounts at: 'Authors' keysAndValuesDo: and count occurences as needed. For example - Most active user is found out by finding max of Text key for each Author.
Additional fields for this analysis is present on my repo. Here is how the resulting messageCounts will look:
Emojis
1. Most frequently used emojis
In the last part, we added ChatCleaner getEmojisFrom: which returned a dataseries with only emojis present. We’ll iterate through the series counting the emojis encountered and storing it in emojiCount which is a DataSeries.
1
2
3
4
5
6
7
8
9
10
11
12
13
getEmojiCountFrom:emojiDs"Returns an DataSeries having an emoji as key and counts as values"| emojiCount |emojiCount:=DataSeriesnewname:#EmojiCount.emojiDsdo: [ :message|messagedo: [ :emoji|emojiCountat:emojitransform: [ :count|count+1 ]
ifAbsent: [ emojiCountadd:emoji->1 ].
]
].^emojiCountsortDescending.
Whatsapp seems to have the format <emoji><color><gender> which is why color ranks highest (default is yellow), as well as male also is high (since default is female in some emojis). The 4th element in the series has unicode 8205 which occurs between emoji and โ. We can ignore them by adding an ignore-set.
1
2
3
ignoreEmoji:=SetwithAll:#(127995979482052640)."and add following in main body before line 8"
(ignoreEmojiincludes: (emojiasUnicode)) ifFalse:
We can extend our previous method by filtering each author and counting emojis used by that author.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
getMemberEmojiCountFrom:df"Returns a dictionory of DataSeries having emoji count of member"| emojiPerPerson authors authorDf authorEmojiCount |emojiPerPerson:=Dictionarynew.authors:= (dfcolumn:#Author) uniqueValues.authorsdo: [ :author|authorDf:=dfselect: [ :row| (rowat:#Author) =author].authorEmojiCount:=selfgetEmojiCountFrom: (ChatCleanergetEmojisFrom:authorDf).
(authorEmojiCountisEmpty) ifFalse: [
emojiPerPersonadd:author->authorEmojiCount.
]
].^emojiPerPerson
We filter an author’s posts in line 8, get emoji count of resulting dataframe, and add it into emojiPerPerson. Here is how the output looks like for the sample chat provided in part 1:
Inspector is unable to render some unicode characters, however you can export it to csv and see the output.
Frequently used phrases
To see phrases commonly used in a chat, we can count ngrams across all messages. This is done by iterating through the messages and adding ngrams in a dictionary with key as ngram and count as value, similar to counting of emojis.
To summarize, line 2 removes all punctuation and emojis, line 3 converts remaining characters to lowercase, and line 4 strips the messages of the stopwords. Line 5 adds a new column with ngrams which is used to create nGramCount.
Here is how the output looks like:
The left is with stopwords removed and right is without. Removing stopwords is only effective if the messages are bigger. Also, the ngram methods in ChatCleaner accept ngrams less than n to prevent null strings. It can be modified to having only n-length ngrams by making sure buffer is full before appending to array.
The above method can be applied to per-author by applying filters, as done in basic analysis.
DateTime analysis
Grouping messages by date or time can allow us to find most frequent hour/date of the chat history. In Pharo, you can group objects using the method group: by: aggregateUsing:, such as for finding most frequent date, we can:
1
2
3
4
dfgroup:#Messageby:#DateaggregateUsing:#size.
It first groups messages by date, and calculates the size of the created group, returning a DataSeries.
As for the most frequent hour, we need to first convert time strings into hour, followed by grouping.
This indicates that user is active mostly around 9am and 12pm.
Visualization
Following is the analysis of one of my chats. Visualisation is done using Roassal. A good resource to learn it is http://agilevisualization.com/, and from it’s inbuilt browser containing multiple examples. I did not get enough time to learn Roassal completely so my scripts are bit of a hack, that’s why I have just linked the output. If you are curious about the source code, or wish to see a detailed part on this, mail me.
1
2
3
4
5
6
7
Totals:
Letters exchanged: 413,040
Words exchanged: 80,474
Media: 285
Deleted: 132
Total messages: 20,584
Duration: 2017-02-10 to 2019-04-13
1
2
3
4
5
6
7
Basic text analysis:
Most active user: Member1
Most active user messages: 12104
Most active user percentage: 60.02%
Most active media user: Member1
Most active user media: 226
Average messages/user: 10,083.5
Message distribution by user:
1
Most common emoji: ๐ appeared 2344 times (80% of all emojis).
Top 10 emoji-wheel for member 1:
Note that I could not find Android emoji fonts for MacOS, on which this was recorded. For reference, here are top 10 emojis:
With the last 3 tutorials, you’ll have tools to analyze your own chat data. You can extend this to other chats such as discord or even mailing list, or add additional insights such as:
member which initiates conversations
member’s behavioral pattern
trending topics across chats/months
and so on.
If you spot any error in this series or can think of additional analysis that can be done, feel free to email me!