Building Whatsapp Analyzer in Pharo: Preprocessing (Part 2)

In the last post, we were able to parse a Whatsapp chat *.txt file into a dataframe object. In this one, we will preprocess the dataframe to make it ready for analysis.

Prerequisites

1. WhatsappReader class

The WhatsappReader class provides way to parse exported chat into the dataframe. Follow steps of the last part to create such class.

2. A parsed dataframe

You must also have a df object containing parsed messages. You can create one using last code snippet in previous tutorial.

Preparation

We will create two classes - ChatCleaner and ChatFeatures.

ChatCleaner would contain messages to clean the chat, such as converting to lowercase, removing stopwords etc.

1
2
3
4


Object subclass: #ChatCleaner
  instanceVariableNames: ''
  classVariableNames: ''
  package: 'WhatsappAnalyzer'

ChatFeatures would provide messages to extract features from the chats, such as ngrams.

1
2
3
4


Object subclass: #ChatFeatures
  instanceVariableNames: ''
  classVariableNames: ''
  package: 'WhatsappAnalyzer'

Generating n-grams

n-grams are sets of n consecutive words/characters extracted from a string. These are generally used by aggregating n-grams and extracting top by count. eg: n-grams (n=2) of the string I am feeling lucky! will be I am, am feeling, feeling lucky. Counting n-grams across corpora is simple yet effective strategy to extract information from given texts.

We can generate n-grams by using an OrderedCollection lastWords to use as a buffer for last n words. We concatenate when buffer is full using join operator before adding it in n-grams. Define the following in the ChatFeatures class-side method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


getNgramsFromLine: line withN: n
  "Returns an OrderedCollection of word ngrams with n = parameter n."

  | ngrams lastWords listOfWords |
  ngrams := OrderedCollection new.
  lastWords := OrderedCollection new.
  listOfWords := line splitOn: Character space.
  
  listOfWords do: [ :word |
    (lastWords size < n) ifFalse: [
      ngrams add: (' ' join: lastWords).
      lastWords removeFirst.
      ].
      lastWords addLast: word.
   ].
  ngrams add: (' ' join: lastWords).
  
  ^ ngrams

In line 12, we remove the first word from lastWords. Since n-grams act as a sliding window, we are emulating a queue in lastWords by adding at end and removing front. We also have added another ngram add at the end, to flush the buffer.

Here is how to use it:

1
2


line := 'A quick brown fox'.
ChatFeatures getNgramsFromLine: line withN: 2.

1

an OrderedCollection('A quick' 'quick brown' 'brown fox')

For character n-grams, we pass a String object instead of listOfWords, and join lastWords (or here, lastChars) using an empty string ''.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


getCharNgramsFromLine: line withN: n
  "Returns an OrderedCollection of character ngrams with n = parameter n."

  | ngrams lastChars |
  ngrams := OrderedCollection new.
  lastChars := OrderedCollection new.
  
  line do: [ :word |
    (lastChars size < n) ifFalse: [
      ngrams add: ('' join: lastChars).
      lastChars removeFirst.
      ].
      lastChars addLast: word.
   ].
  ngrams add: ('' join: lastChars).
  
  ^ ngrams

Here is how output will look for n=7 on I am lucky:

1

an OrderedCollection('I am lu' ' am luc' 'am luck' 'm lucky')

We now extend this to DataFrame, using the following code:

1
2
3
4
5
6
7


getNgramsFromDataFrame: df withN: n
  "Returns an array of word ngrams with n = parameter n."

  
  ^ ((df column: #Message) collect: [ :line |
     self getNgramsFromLine: line withN: n.
     ]) asArray.

You can create a similar one for char-ngrams.

Try running it as follows:

1

df addColumn: (ChatFeatures getNgramsFromDataFrame: df withN: 3) named: #Ngrams.

Output looks like this: N-gram dataframe

Punctuations makes inference difficult eg: index 19 has - which is counted as a word. For n-grams to be effective, we need to clean our DataFrame. Let’s add a few messages to make cleaning easier.

Setting appropriate column type

If you inspect the df, you’ll notice that all columns are of type ByteString, except a few cells with WideString. Columns like Date, Time need to be in appropriate type so that analyzing them becomes easier.

We iterate through rows, transforming each cell to appropriate data type:

1
2
3
4
5
6
7
8


setTypesFor: df
   "Transforms the df columns into appropriate types"

   df do: [ :row |
   row at: #Date transform: [ :date | date asDate ].
   row at: #Time transform: [ :time | time asTime ].
   row at: #Message transform: [ :message | message asWideString ].
   ].

This should be placed as a class-side method in ChatClenaer. Usage:

1

ChatCleaner setTypesFor: df.

An alternative way to implement this would be using toColumn: columnName applyElementwise: block method, or setting it in WhatsappReader itself.

Making messages lowercase

For analysing word count, we need to ensure that words are in similar case since 'Word' ~= 'word'. We apply a similar transformation lowercase:

1
2
3
4


messagesAsLowercase: df
   "Returns a DataSeries of the message column with lowercase strings"

   ^ ((df column: #Message) collect: [ :message | message asLowercase ]).

You can run it as:

1

df column: #Message put: (ChatCleaner messagesAsLowercase: df).

Cleaning output

Keep words only

We need to remove punctuation so that when we split a string by space, we directly will get an array of words. The regex used to match non-words is [^\w\s]. ^ stands for not matching, \w stands for a series of letters, and \s stands for series of different spaces space \t \r \n.

1
2
3
4
5
6
7


getWordsFrom: df
   "Returns a DataSeries with removes punctuation and digits from #Message"

   ^ ((df column: #Message) collect: [ :message |
       message copyWithRegex: '[^\w\s]' matchesReplacedWith: ''
       ]
     )

Keep emojis only

To count emojis, we need a regex that will keep them only. However, their unicode ranges are too varied, and I found it easier just to match non-emoji characters and remove them using the regex [\w\d\s\\:.,''"-/?!()[]<>@’^“”=+_] (It’s just a regex with all symbols I have seen in chat mixed together).

1
2
3
4
5
6
7


getEmojisFrom: df
   "Returns the #Message column while removing non-emoji characters. The regex might not work perfectly"

   ^ ((df column: #Message) collect: [ :message |
       message copyWithRegex: '[\w\d\s\\:.,''"-/?!()[]<>@’^“”=+_]' matchesReplacedWith: ''
       ]
     )

Removing stopwords

Stopwords are the words that do not give you any insight, such as is, are, his, when etc. The list of stopwords here is taken from sklearn, who have sourced it from “Glasgow Information Retrieval Group”. Define following to class-side messages in ChatCleaner:

1
2
3
4
5
6
7


getStopwords
  "Returns array of stopwords"

  ^ Set withAll: #(
  'list' 'is' 'removed'
  'please' 'visit' github' 'repo' 'for' 'complete' 'list'
  )

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


removeStopwordsFrom: df
  "Removes stop words and returns Message dataseries"

  | stopWords words |
  stopWords := self getStopwords.
  ^ (df column: #Message) collect: [ :message |
   words := ((message trimBoth) splitOn: ('\s+' asRegex)).
   words := words reject: [ :word |
     stopWords includes: (word asLowercase)
      ].
   ' ' join: words
    ]

In the above snippet, we just reject input words present in stopWords and create an OrderedCollection out of those. To get words from message, we first remove any whitespace from both ends and the split using \s+ regex, which matches one or more whitespace characters.

Removing blacklisted messages

There are messages like This message was deleted and <Media omitted> which would not contribute to text analysis. To remove such messages (and any additional ones) we add them to a set and pass it to the following method:

1
2
3
4
5
6
7
8


removeMessages: blacklistSet from: df
   "Removes messages present in blacklistSet from df"

   | outputDf |
   outputDf := df reject: [ :row |
      blacklistSet includes: (row at: #Message)
      ].
   ^ outputDf   

This can be used as:

1
2
3
4


blacklist := Set withAll: #('This message was deleted' '<Media omitted>').
nGramDf := (ChatCleaner removeMessages: blacklist from: nGramDf).
" If you run this on group0.txt, you will see 17 messages removed. "
(df size) - (ngram size)

I think that’s all the messages needed! I’ll add few more if I think they are missing while writing next parts. You can find the messages implemented here: Whatsapp-Analyzer.

In the next part, we’ll analyze our parsed chat by using messages we have created.