Topic inference visualization
In this page we will be visualizing the inference of topics in an image dataset and a text dataset. We will be using, as in most examples, the console applications which are readily available once you install LDA++. The image dataset is the well known Olivetti Faces and the textual dataset is the 20 news groups. Besides LDA++ will use scikit-learn to fetch the datasets and matplotlib and wordcloud to plot the inference process. All those libraries are very easily installed using pip or you could download anaconda for a full python distribution.
Fetching the datasets
We will use scikit-learn to fetch and preprocess the datasets in a few lines of python. The purpose of this example is to visualize the inference process and not to produce the best possible topics so shortcuts will be taken to save computation and experimentation time.
For simplicity you can download the following code as a script.
In : import numpy as np In : from sklearn.datasets import fetch_olivetti_faces, fetch_20newsgroups In : faces = fetch_olivetti_faces() In : with open("/path/to/faces.npy", "wb") as f: ...: np.save(f, (faces.data*255).astype(np.int32).T) # the pixels are normalized to [0, 1] ...: np.save(f, faces.target.astype(np.int32)) ...: In : newsgroups = fetch_20newsgroups( ...: subset="train", ...: remove=("headers", "footers", "quotes") ...: ) In : from sklearn.feature_extraction.text import CountVectorizer In : tf_vectorizer = CountVectorizer(max_df=0.9, min_df=2, max_features=1000, ...: stop_words="english") In : newsdata = tf_vectorizer.fit_transform(newsgroups.data) In : with open("/path/to/news.npy", "wb") as f: ...: np.save(f, newsdata.astype(np.int32).T) ...: np.save(f, newsgroups.target.astype(np.int32)) In : # Save the feature names in order to visualize the textual topics In : feature_names = tf_vectorizer.get_feature_names() In : import pickle In : pickle.dump(feature_names, open("/path/to/fnames.pickle", "w"))
After downloading the data and transforming them into the
format readable by the console
applications we can very easily infer topics from
these datasets. We will use the
--snapshot_every option to save a model from
each epoch so that we can later visualize the inference process.
The following code trains two lda models one for the faces dataset and one for
the 20 news groups. We infer 10 topics for the faces dataset and 20 for the text
dataset. One should change the
--workers option depending on the number of
parallel threads his processor can execute.
$ lda train --topics 10 --iterations 100 \ > --e_step_iterations 100 --e_step_tolerance 0.1 \ > --snapshot_every 1 --workers 4 \ > faces.npy faces_model.npy E-M Iteration 1 100 200 ... $ lda train --topics 20 --iterations 100 \ > --e_step_iterations 100 --e_step_tolerance 0.1 \ > --snapshot_every 1 --workers 4 \ > news.npy news_model.npy E-M Iteration 1 100 200 ...
After executing the above code (and the code from the previous section) the directory should contain the following files:
- faces_model.npy_001 - faces_model.npy_100
- news_model.npy_001 - news_model.npy_100
As it is obvious the files
(faces | news)_model.npy_(001 - 100) are the
models for the corresponding epochs and we will be able to use them to plot the
In order to visualize the evolution of the topics, firstly we need to visualize a topic. The faces dataset has been reformatted so that the topics can be visualized as a $64 \times 64$ image and the text topics will be represented by a wordcloud that emphasizes the most probable words.
The above images can be generated with the following code.
In : import matplotlib.pyplot as plt In : import wordcloud In : import pickle In : import numpy as np In : def load_topics(path): ...: with open(path) as f: ...: _ = np.load(f) ...: return np.load(f) ...: In : # Visualize the faces topic In : f = plt.figure(figsize=(3, 3)) In : plt.imshow(load_topics("faces_model.npy").reshape(64, 64), cmap="gray") Out: <matplotlib.image.AxesImage at 0x7f41f8b7bf90> In : plt.xticks() Out: (, <a list of 0 Text xticklabel objects>) In : plt.yticks() Out: (, <a list of 0 Text yticklabel objects>) In : plt.tight_layout() In : f.savefig("path/to/image.png") In : In : # Visualize the 20 newsgroups topic In : f = plt.figure(figsize=(4, 3)) In : plt.imshow( ...: wordcloud.WordCloud().fit_words( ...: zip( ...: pickle.load(open("fnames.pickle")), ...: load_topics("news_model.npy") ...: ) ...: ).to_image() ...: ) Out: <matplotlib.image.AxesImage at 0x7f41fcea6750> In : plt.xticks() Out: (, <a list of 0 Text xticklabel objects>) In : plt.yticks() Out: (, <a list of 0 Text yticklabel objects>) In : plt.tight_layout() In : f.savefig("path/to/image.png")
In the following figure we have applied the above visualization for all the topics of the faces dataset for different epochs.
We can see that after one epoch all topics start from approximately the same position and it is really hard to predict what the final outcome will be for each topic. We can see clearly that there are topics that focus on some facial characteristics and not others. For instance, the second topic generates no mouths (hence the large black blob where the mouth would be) and the 6th topic generates beards.
We can perform the same visualization for the 20 news groups dataset but since the images of the wordclouds are larger we will visualize the inference of a single topic. We observe that the topics now converge much faster in the first tens of epochs.
Another attribute of a topic that we can visualize is the distribution over the words and its evolution. When the distribution over the words stops changing then the topic model has converged. It is common to check convergence using the likelihood instead. In the following figure we see the change in the distribution of the same topic as in the above figure. We see that indeed the topic changes very little from the 30th epoch and onwards.