This article is mainly for those who are not too fond of math, in which basically no mathematical formulas are used.
catalogs
- Intuitive Understanding of Thematic Models
- Popular definition of LDA
- LDA classification principle
- The Essence of LDA
- Simple Application of Thematic Modeling - Hillary Emailgate
1. Visual understanding of the subject model
Listen to the name should know what he is talking about? If there is an article text, by the words in it, to determine what type of article he is, if the article appears in a lot of sports words, for example, basketball, soccer, and so on, then the theme model will classify it as a sports article.
Because the topic model involves more mathematical derivation, let's start with a little chestnut to understand what it's trying to do. Suppose there is this scenario:
- A senior HR received a resume for an Algorithm Engineer application, and he wanted to see if the person was a big bully, or a colorful pencil, just by the resume, and how did he determine that?
His general approach is to get this resume and see what the person's resume includes in what they've written.
What about before that, he must also have been exposed to a lot of interviews with algorithmic engineers, and he judged, based on those recruits, a big bully, potentially:
- Wear a striped shirt.
- Worked at BAT
- Worked on large projects
This HR will look at the interviewer is not wearing a striped shirt, has not been employed in the BAT, has done what bull project, if all meet the conditions, then this HR will judge that this person should be a bull, if he is just wearing a striped shirt, did not do anything to get the project, then we should hesitate, because he is the possibility of the colorful pen is relatively large.
The relationship between this example and the topic model can be represented by this diagram:
In LDA's eyes, it's the equivalent of bags of words, with a bunch of words in each bag, and when you use them, you just detect whether they appear or not.
It can be expressed as in Eq:
Common definition of
What is LDA?
- It is an unsupervised Bayesian model.
- is a topic model that gives each document in a document set as a probability distribution.
- is a type of unsupervised learning that does not require a manually labeled training set for training; all that is needed is the document set and the number of specified topics.
- is a typical bag-of-words model, which considers a document to be a collection of words with no order or sequential relationship between words.
Its main advantage is that for each topic, you can find some words to describe it.
Classification Principles
Having previously written in detail about the principles of Bayesian modeling and the ideas it represents, please poke around for details:The Magical Bayesian Mind, here is just a brief description of how it works, used in this sense:
The probability of a word appearing under the same topic, and the probability of a topic appearing under the same document, the product of the two probabilities, you can get the probability of a word appearing in a certain document, we can adjust these two distributions when training.
From this, the LDA generation process can be defined:
- For each document, a topic is drawn from the topic distribution; (equivalent to the left panel)
- A word was randomly selected from the distribution of words corresponding to the drawn theme; (drawn in the right panel)
- Repeat the process until you have traversed every word in the entire document.
After these three steps, it is possible to adjust the product of the two distributions by seeing if it matches the distribution of the given article.
To be a little more specific: (w is for word; d is for document; t is for topic; uppercase is for total set, lowercase is for individual.)
Each document d in D is viewed as a sequence of words:
All the different words involved in D form a large collection of vocabularies V (vocabulary), and LDA takes the document collection D as input and wishes to train two result vectors (assuming that k topics are formed and there are m words in V).
- Result Vector 1: For each document d in D, the probability of corresponding to a different topic
θd
:
<pt1,...,ptk><script type="math/tex; mode=display" > </script> where pti denotes the probability that d corresponds to the i-th theme out of k themes, and is computed in a simple way:pti=dcenterthere arehow (what extent)stop (doing sth)classifier for individual things or people, general, catch-all classifierclassical Chinese poembe(prefix indicating ordinal number, e.g. first, number two etc)iclassifier for individual things or people, general, catch-all classifiertrump card (in card games)surname Tisurname Yethere are(used form a nominal expression)dcenterclassifier for houses, small buildings, hospitals and institutionsthere areclassical Chinese poem(used form a nominal expression)assemblecriticize (i.e. enumerate shortcomings)
- Result Vector 2: For each
Tcenter(used form a nominal expression)trump card (in card games)surname Tit
, generating probability vectors for different words
ϕt
:
<pw1,...,pwm><script type="math/tex; mode=display" > </script> where pwi indicated theme t Generate the probability of the ith word in V. Calculation method:
pwi=trump card (in card games)surname Tittreat (sb a certain way)agree (to do sth)until (a time)Vcenter(prefix indicating ordinal number, e.g. first, number two etc)iclassifier for individual things or people, general, catch-all classifiersurname Shanclassical Chinese poemgo outappear(used form a nominal expression)substandardcriticize (i.e. enumerate shortcomings)trump card (in card games)surname Titarrive at (a decision, conclusion etc)(used form a nominal expression)classifier for houses, small buildings, hospitals and institutionsthere aresurname Shanclassical Chinese poemassemblecriticize (i.e. enumerate shortcomings)
4.The Essence of LDA
Having said all that, the core of LDA is, in fact, still this formula:
- The LDA algorithm begins by randomly giving the θd , ϕt Assignment (for all d and t)
- For specific documents
ds
The ith word in
wi
If we make the word correspond to the subject of the
tj
The above equation can be rewritten as follows.
Pj(wi|ds)=P(wi|tj)∗P(tj|ds)
- Enumerate the topics in T to get all the pj(wi|ds) . The result of these probability values can then be based on ds The ith word in the wi The easiest way to choose a theme is to take the order Pj(wi|ds) Topics with the highest probability tj 。
- in the event that ds The ith word in the wi By choosing a different topic here than the original one, there is an impact on the θd , ϕt have an impact, and their impact in turn affects the impact on the above mentioned p(w|d) The calculation of the
for all w in all documents d in the document set D once p(w|d) computation and re-selecting the topics is viewed as one iteration. After n iterations it is possible to converge to the desired classification result for LDA.
5. Simple Application of Topic Modeling - Hillary Emailgate
We're pretty much done understanding up to this point if we don't want to get specific about the derivation of specific math formulas, the point is to learn how to use them?
Let's use the Hillary Emailgate one to see how gensim should be used to categorize emails.
from gensim import corpora, models, similarities
import gensim
import numpy as np
import pandas as pd
import re
df = pd.read_csv("../input/")
# There were a lot of Nan values in the original email data that were just thrown away.
df = df[['Id','ExtractedBodyText']].dropna()
()
Data Style:
Do a simple preprocessing:
def clean_email_text(text):
text = ('\n'," ")
text = ('-'," ",text)
text = (r"\d+/\d+/\d+", "", text) #Date, doesn't mean much to the subject model
text = (r"[0-2]?[0-9]:[0-6][0-9]", "", text) #Time, it's pointless
text = (r"[\w]+@[\.\w]+", "", text) #Email address. It's pointless.
text = (r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text) #URL, it's pointless.
pure_text = ''
for letter in text:
if () or letter ==' ':
pure_text += letter
text = ' '.join(word for word in pure_text.split() if len(word)>1)
return text
docs = df['ExtractedBodyText']
docs = (lambda x :clean_email_text(x))
Look at what it's been processed into:
docs.head(2).values
It's being processed as one word at a time.
To wit:
Handwritten stop words, this and assorted stop words written by others:stopwords
stoplist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours',
'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their',
'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once',
'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you',
'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will',
'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be',
'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself',
'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both',
'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn',
'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about',
'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn',
'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which']
Subtext:texts = [[word for word in ().split() if word not in stoplist] for doc in doclist]
texts[0]
Of course you can also use packages like jieba,bltk.
What you get is a document a bag of words.
Build the anticipation library: each word is replaced with a numerical index to get an array.
dictionary = (texts)
corpus = [dictionary.doc2bow(text) for text in texts]
Get:
This list tells us that the 14th (first from 0) email has a total of 6 meaningful words (after our text preprocessing and removing stop words)
Where the word 36 occurs once, the word 505 occurs once, and so on.
Then, we can finally build the model:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
lda.print_topic(10, topn=5)
The most common words that get categorized in #10 are:
- ‘0.007*kurdistan + 0.006*email + 0.006*see + 0.005*us + 0.005*right’
Break out the five themes and take a look:
lda.print_topics(num_topics=5,num_words =6)
You can practice gesim sometime :)
gensim User's Guide
Detailed derivation: mathematical formulas version next post Introduction
Reference: July Online