Thematic Modeling (LDA) (I) - General Understanding and Simple Applications

Updated to 4 months ago

This article is mainly for those who are not too fond of math, in which basically no mathematical formulas are used.
catalogs

Intuitive Understanding of Thematic Models
Popular definition of LDA
LDA classification principle
The Essence of LDA
Simple Application of Thematic Modeling - Hillary Emailgate

1. Visual understanding of the subject model

这里写图片描述
Listen to the name should know what he is talking about? If there is an article text, by the words in it, to determine what type of article he is, if the article appears in a lot of sports words, for example, basketball, soccer, and so on, then the theme model will classify it as a sports article.

Because the topic model involves more mathematical derivation, let's start with a little chestnut to understand what it's trying to do. Suppose there is this scenario:

A senior HR received a resume for an Algorithm Engineer application, and he wanted to see if the person was a big bully, or a colorful pencil, just by the resume, and how did he determine that?

His general approach is to get this resume and see what the person's resume includes in what they've written.
What about before that, he must also have been exposed to a lot of interviews with algorithmic engineers, and he judged, based on those recruits, a big bully, potentially:

Wear a striped shirt.
Worked at BAT
Worked on large projects

This HR will look at the interviewer is not wearing a striped shirt, has not been employed in the BAT, has done what bull project, if all meet the conditions, then this HR will judge that this person should be a bull, if he is just wearing a striped shirt, did not do anything to get the project, then we should hesitate, because he is the possibility of the colorful pen is relatively large.

The relationship between this example and the topic model can be represented by this diagram:
这里写图片描述
In LDA's eyes, it's the equivalent of bags of words, with a bunch of words in each bag, and when you use them, you just detect whether they appear or not.

It can be expressed as in Eq:

P (oldest cow | distinguished phenomenon ， bamboo strips used for writing (old) pass through) = these distinguished phenomenon exist oldest cow center go out appear (used form a nominal expression) substandard criticize (i.e. enumerate shortcomings) oldest cow gather around (sb) there are (used form a nominal expression) classifier for houses, small buildings, hospitals and institutions there are distinguished phenomenon X these bamboo strips used for writing (old) pass through be born in the year of (one of the 12 animals) sentence-final interrogative particle oldest cow (used form a nominal expression) distinguished phenomenon classifier for individual things or people, general, catch-all classifier criticize (i.e. enumerate shortcomings)

Common definition of

What is LDA?

It is an unsupervised Bayesian model.
is a topic model that gives each document in a document set as a probability distribution.
is a type of unsupervised learning that does not require a manually labeled training set for training; all that is needed is the document set and the number of specified topics.
is a typical bag-of-words model, which considers a document to be a collection of words with no order or sequential relationship between words.

Its main advantage is that for each topic, you can find some words to describe it.

Classification Principles

Having previously written in detail about the principles of Bayesian modeling and the ideas it represents, please poke around for details:The Magical Bayesian Mind, here is just a brief description of how it works, used in this sense:

P (oldest cow | bamboo strips used for writing (old) pass through) = P （ oldest cow ） P （ bamboo strips used for writing (old) pass through | oldest cow ） \sum P （ oldest cow ） P （ bamboo strips used for writing (old) pass through | oldest cow ）

After a series of derivations, one can obtain such a chain relationship:

P (classical Chinese poem | Kangxi radical 118 grade (of goods)) = P （ classical Chinese poem | trump card (in card games) surname Ti ） P （ trump card (in card games) surname Ti | Kangxi radical 118 grade (of goods) ）

That is:

classical Chinese poem \to trump card (in card games) surname Ti \to Kangxi radical 118 grade (of goods)

Such a relationship.

The probability of a word appearing under the same topic, and the probability of a topic appearing under the same document, the product of the two probabilities, you can get the probability of a word appearing in a certain document, we can adjust these two distributions when training.
这里写图片描述

From this, the LDA generation process can be defined:

For each document, a topic is drawn from the topic distribution; (equivalent to the left panel)
A word was randomly selected from the distribution of words corresponding to the drawn theme; (drawn in the right panel)
Repeat the process until you have traversed every word in the entire document.

After these three steps, it is possible to adjust the product of the two distributions by seeing if it matches the distribution of the given article.

To be a little more specific: (w is for word; d is for document; t is for topic; uppercase is for total set, lowercase is for individual.)
Each document d in D is viewed as a sequence of words:

< w 1, w 2, . . ., w n >

All the different words involved in D form a large collection of vocabularies V (vocabulary), and LDA takes the document collection D as input and wishes to train two result vectors (assuming that k topics are formed and there are m words in V).

Result Vector 1: For each document d in D, the probability of corresponding to a different topic θd : $< p t 1, . . ., p t k >$ <script type="math/tex; mode=display" > </script> where pti denotes the probability that d corresponds to the i-th theme out of k themes, and is computed in a simple way: $p t i = d center there are how (what extent) stop (doing sth) classifier for individual things or people, general, catch-all classifier classical Chinese poem be (prefix indicating ordinal number, e.g. first, number two etc) i classifier for individual things or people, general, catch-all classifier trump card (in card games) surname Ti surname Ye there are (used form a nominal expression) d center classifier for houses, small buildings, hospitals and institutions there are classical Chinese poem (used form a nominal expression) assemble criticize (i.e. enumerate shortcomings)$
Result Vector 2: For each Tcenter(used form a nominal expression)trump card (in card games)surname Tit , generating probability vectors for different words ϕt ： $< p w 1, . . ., p w m >$ <script type="math/tex; mode=display" > </script> where pwi indicated theme t Generate the probability of the ith word in V. Calculation method: $p w i = trump card (in card games) surname Ti t treat (sb a certain way) agree (to do sth) until (a time) V center (prefix indicating ordinal number, e.g. first, number two etc) i classifier for individual things or people, general, catch-all classifier surname Shan classical Chinese poem go out appear (used form a nominal expression) substandard criticize (i.e. enumerate shortcomings) trump card (in card games) surname Ti t arrive at (a decision, conclusion etc) (used form a nominal expression) classifier for houses, small buildings, hospitals and institutions there are surname Shan classical Chinese poem assemble criticize (i.e. enumerate shortcomings)$

4.The Essence of LDA

Having said all that, the core of LDA is, in fact, still this formula:

P (classical Chinese poem | Kangxi radical 118 grade (of goods)) = P （ classical Chinese poem | trump card (in card games) surname Ti ） P （ trump card (in card games) surname Ti | Kangxi radical 118 grade (of goods) ）

Use the expression as follows:

P (w | d) = P (w | t) * P (t | d)

It is actually the subject as the middle layer, which is passed through the two previous vectors (

θd ，

ϕt ), respectively, are given

P(w|t),P(t|d) ,Its learning process can be represented as:

The LDA algorithm begins by randomly giving the θd ， ϕt Assignment (for all d and t)
For specific documents ds The ith word in wi If we make the word correspond to the subject of the tj The above equation can be rewritten as follows. $P j (w i | d s) = P (w i | t j) * P (t j | d s)$
Enumerate the topics in T to get all the pj(wi|ds) . The result of these probability values can then be based on ds The ith word in the wi The easiest way to choose a theme is to take the order Pj(wi|ds) Topics with the highest probability tj 。
in the event that ds The ith word in the wi By choosing a different topic here than the original one, there is an impact on the θd ， ϕt have an impact, and their impact in turn affects the impact on the above mentioned p(w|d) The calculation of the

for all w in all documents d in the document set D once p(w|d) computation and re-selecting the topics is viewed as one iteration. After n iterations it is possible to converge to the desired classification result for LDA.

5. Simple Application of Topic Modeling - Hillary Emailgate

We're pretty much done understanding up to this point if we don't want to get specific about the derivation of specific math formulas, the point is to learn how to use them?

Let's use the Hillary Emailgate one to see how gensim should be used to categorize emails.

from gensim import corpora, models, similarities
import gensim
import numpy as np
import pandas as pd
import re

df = pd.read_csv("../input/")
# There were a lot of Nan values in the original email data that were just thrown away.
df = df[['Id','ExtractedBodyText']].dropna()

()

Data Style:
这里写图片描述

Do a simple preprocessing:

def clean_email_text(text):
    text  = ('\n'," ")
    text = ('-'," ",text)
    text = (r"\d+/\d+/\d+", "", text) #Date, doesn't mean much to the subject model
    text = (r"[0-2]?[0-9]:[0-6][0-9]", "", text) #Time, it's pointless
    text = (r"[\w]+@[\.\w]+", "", text) #Email address. It's pointless.
    text = (r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) #URL, it's pointless.
    pure_text = ''
    for letter in text:
        if () or letter ==' ':
            pure_text += letter
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text

docs = df['ExtractedBodyText']
docs = (lambda x :clean_email_text(x))

Look at what it's been processed into:

docs.head(2).values

It's being processed as one word at a time.

这里写图片描述

To wit:

[["one" radical in Chinese characters (Kangxi radical 1) clause (of law or treaty) post (office) classifier for clothes, luggage, decorations; piece of work; a matter, an event word talisman bunch or cluster] ， [another "one" radical in Chinese characters (Kangxi radical 1) clause (of law or treaty) post (office) classifier for clothes, luggage, decorations; piece of work; a matter, an event word talisman bunch or cluster], . . .]

Handwritten stop words, this and assorted stop words written by others:stopwords

stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours', 
            'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their', 
            'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once', 
            'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you', 
            'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will', 
            'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be', 
            'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself', 
            'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both', 
            'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn', 
            'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about', 
            'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn', 
            'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which']

Subtext:
texts = [[word for word in ().split() if word not in stoplist] for doc in doclist] texts[0]
Of course you can also use packages like jieba,bltk.
What you get is a document a bag of words.

Build the anticipation library: each word is replaced with a numerical index to get an array.

dictionary = (texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Get:
这里写图片描述
This list tells us that the 14th (first from 0) email has a total of 6 meaningful words (after our text preprocessing and removing stop words)

Where the word 36 occurs once, the word 505 occurs once, and so on.

Then, we can finally build the model:

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
lda.print_topic(10, topn=5)

The most common words that get categorized in #10 are:

‘0.007*kurdistan + 0.006*email + 0.006*see + 0.005*us + 0.005*right’
Break out the five themes and take a look:

lda.print_topics(num_topics=5,num_words =6)

这里写图片描述

You can practice gesim sometime :)
gensim User's Guide

Detailed derivation: mathematical formulas version next post Introduction

Reference: July Online