intro

A reddit user posted ~1.7 billion reddit comments. There is also a one month subset of this data that we can work with on our local machines.

Check out the code on github.

Download the one month subset of Reddit comments. We start by getting word counts of the comments and grouping them by subreddit. Then we will look for words that “define” each subreddit.

tokenization

The process of extracting words is called tokenization as a string is broken up into “tokens” that are used to characterize the string. In our case the tokens will, approximately, be each word of the comments. The tokenization specification is arbitrary. You could look at pairs of words (bigrams), triplets of words (trigrams), or even every other word. In our case we use every word individually.

What happens if we have different forms of the same word? We probably want to count “foot” and “feet” as the same token. We have two options: a stemmer which chops off the ends of words, or a lemmatizer which does a morphological analysis on each word to determine its lemma (a canonical version of the word).

In this project we can piggy back off of nltk’s implementation of a WordNet lemmatizer. We will also use nltk’s part of speech tagger to pass the part of speech into the lemmatizer. Finally, the word_tokenize function will do exactly what you think… it splits up the string into distinct words.

class LemmaTokenizer(object):

    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()

    def __call__(self, doc):
        res = []
        for token, pos in pos_tag(word_tokenize(doc)):
            pos = pos.lower()

            if pos[0] in ['a', 'n', 'v']:
                res.append(self.lemmatizer.lemmatize(token, pos[0]))
            else:
                res.append(self.lemmatizer.lemmatize(token))

        return res

By passing text through an instance of this class we will get a list of words that approximate the original string.

vectorization

We can use this lemmatizer in our pipeline to vectorize our text. This means we transform our comments from a string into a vector of word counts (token counts). Scikit-learn provides an implementation of a vectorizer.

vect = CountVectorizer(tokenizer=tokenizer, preprocessor=preprocess, stop_words='english')

tfidf

We mentiond in the intro section that we were going to look for words that “define” each subreddit. So now that we have a count of each word for each subreddit we need to rank them. One way of doing this is using Term Frequency Inverse Document Frequency (tfidf). The basic idea is to look at how frequently a word appears in a subreddit and offset it by the number of subreddits that word appears in. So if the word “kumquat” appears a lot of times in /r/fruit but doesn’t appear in very many other subreddits it will be weighted strongly. If the word “banana” appears often in /r/fruit and it also appears in lots of other subreddits, then it will be weighted weakly. There are a number of variations on this but we will be using the default one from sklearn.

\[ idf(d, t) = \log{\dfrac{n_d + 1}{df(d, t) + 1} + 1} \]

where \(n_d\) is the number of documents and \(df(d, t)\) is the number of times term \(t\) appears in document \(d\). The resulting matrix is then normalized using the l2 norm.

TfidfTransformer().fit_transform(counts)

results

Let’s look at some of the top words for varying subreddits (words with corresponding tfidf scores):

funny		aww		worldnews		todayilearned
mo	0.04	pup	0.088	hamas	0.053	muslim	0.036
͡°	0.037	breed	0.086	palestinian	0.05	jew	0.032
/r/funny	0.036	puppy	0.084	/r/worldnews	0.045	islam	0.031
b	0.029	kitty	0.08	israel	0.042	religion	0.03
r/funny	0.028	kitten	0.079	kurd	0.041	tax	0.03
reposts	0.028	pet	0.073	gaza	0.039	rape	0.029
gifs	0.028	dog	0.072	israeli	0.038	nazi	0.029
repost	0.028	cat	0.071	ukraine	0.037	christian	0.028
penis	0.027	adorable	0.069	iran	0.037	government	0.028
ha	0.027	paw	0.069	palestine	0.037	slave	0.028
toilet	0.026	breeder	0.069	muslim	0.036	gay	0.028
racist	0.026	rescue	0.069	hezbollah	0.036	jewish	0.027
karma	0.026	animal	0.067	ukrainian	0.036	hitler	0.027
meme	0.026	shelter	0.067	arab	0.035	drug	0.027
religion	0.026	husky	0.065	paywall	0.035	military	0.027
gif	0.026	vet	0.063	islam	0.035	slavery	0.027
sex	0.026	amp	0.061	assad	0.035	population	0.027
muslim	0.025	adopt	0.061	sunni	0.035	wage	0.027
gay	0.025	cute	0.058	nato	0.034	culture	0.027
rape	0.025	cuddle	0.057	sharia	0.034	billion	0.026