Topic inference visualization

In this page we will be visualizing the inference of topics in an image dataset and a text dataset. We will be using, as in most examples, the console applications which are readily available once you install LDA++. The image dataset is the well known Olivetti Faces and the textual dataset is the 20 news groups. Besides LDA++ will use scikit-learn to fetch the datasets and matplotlib and wordcloud to plot the inference process. All those libraries are very easily installed using pip or you could download anaconda for a full python distribution.

Fetching the datasets

We will use scikit-learn to fetch and preprocess the datasets in a few lines of python. The purpose of this example is to visualize the inference process and not to produce the best possible topics so shortcuts will be taken to save computation and experimentation time.

For simplicity you can download the following code as a script.

In [1]: import numpy as np
In [2]: from sklearn.datasets import fetch_olivetti_faces, fetch_20newsgroups
In [3]: faces = fetch_olivetti_faces()
In [4]: with open("/path/to/faces.npy", "wb") as f:
   ...:     np.save(f, (faces.data*255).astype(np.int32).T) # the pixels are normalized to [0, 1]
   ...:     np.save(f, faces.target.astype(np.int32))
   ...: 
In [5]: newsgroups = fetch_20newsgroups(
   ...:     subset="train",
   ...:     remove=("headers", "footers", "quotes")
   ...: )
In [6]: from sklearn.feature_extraction.text import CountVectorizer
In [7]: tf_vectorizer = CountVectorizer(max_df=0.9, min_df=2, max_features=1000,
   ...:                                 stop_words="english")
In [8]: newsdata = tf_vectorizer.fit_transform(newsgroups.data)
In [8]: with open("/path/to/news.npy", "wb") as f:
   ...:     np.save(f, newsdata.astype(np.int32).T)
   ...:     np.save(f, newsgroups.target.astype(np.int32))
In [9]: # Save the feature names in order to visualize the textual topics
In [10]: feature_names = tf_vectorizer.get_feature_names()
In [11]: import pickle
In [12]: pickle.dump(feature_names, open("/path/to/fnames.pickle", "w"))

Training LDA

After downloading the data and transforming them into the format readable by the console applications we can very easily infer topics from these datasets. We will use the --snapshot_every option to save a model from each epoch so that we can later visualize the inference process.

The following code trains two lda models one for the faces dataset and one for the 20 news groups. We infer 10 topics for the faces dataset and 20 for the text dataset. One should change the --workers option depending on the number of parallel threads his processor can execute.

$ lda train --topics 10 --iterations 100 \
>           --e_step_iterations 100 --e_step_tolerance 0.1 \
>           --snapshot_every 1 --workers 4 \
>           faces.npy faces_model.npy
E-M Iteration 1
100
200
...
$ lda train --topics 20 --iterations 100 \
>           --e_step_iterations 100 --e_step_tolerance 0.1 \
>           --snapshot_every 1 --workers 4 \
>           news.npy news_model.npy
E-M Iteration 1
100
200
...

After executing the above code (and the code from the previous section) the directory should contain the following files:

As it is obvious the files (faces | news)_model.npy_(001 - 100) are the models for the corresponding epochs and we will be able to use them to plot the topic evolution.

Topic visualization

In order to visualize the evolution of the topics, firstly we need to visualize a topic. The faces dataset has been reformatted so that the topics can be visualized as a $64 \times 64$ image and the text topics will be represented by a wordcloud that emphasizes the most probable words.

Two random topics from the datasets

The above images can be generated with the following code.

In [1]: import matplotlib.pyplot as plt
In [2]: import wordcloud
In [3]: import pickle
In [4]: import numpy as np
In [5]: def load_topics(path):
   ...:     with open(path) as f:
   ...:         _ = np.load(f)
   ...:         return np.load(f)
   ...: 
In [6]: # Visualize the faces topic
In [7]: f = plt.figure(figsize=(3, 3))
In [8]: plt.imshow(load_topics("faces_model.npy")[6].reshape(64, 64), cmap="gray")
Out[8]: <matplotlib.image.AxesImage at 0x7f41f8b7bf90>
In [9]: plt.xticks([])
Out[9]: ([], <a list of 0 Text xticklabel objects>)
In [10]: plt.yticks([])
Out[10]: ([], <a list of 0 Text yticklabel objects>)
In [11]: plt.tight_layout()
In [12]: f.savefig("path/to/image.png")
In [13]: 
In [13]: # Visualize the 20 newsgroups topic
In [14]: f = plt.figure(figsize=(4, 3))
In [15]: plt.imshow(
    ...:     wordcloud.WordCloud().fit_words(
    ...:         zip(
    ...:             pickle.load(open("fnames.pickle")),
    ...:             load_topics("news_model.npy")[0]
    ...:         )
    ...:     ).to_image()
    ...: )
Out[15]: <matplotlib.image.AxesImage at 0x7f41fcea6750>
In [16]: plt.xticks([])
Out[16]: ([], <a list of 0 Text xticklabel objects>)
In [17]: plt.yticks([])
Out[17]: ([], <a list of 0 Text yticklabel objects>)
In [18]: plt.tight_layout()
In [18]: f.savefig("path/to/image.png")

Topic evolution

In the following figure we have applied the above visualization for all the topics of the faces dataset for different epochs.

The evolution of all 10 topics in the Olivetti faces dataset

We can see that after one epoch all topics start from approximately the same position and it is really hard to predict what the final outcome will be for each topic. We can see clearly that there are topics that focus on some facial characteristics and not others. For instance, the second topic generates no mouths (hence the large black blob where the mouth would be) and the 6th topic generates beards.

We can perform the same visualization for the 20 news groups dataset but since the images of the wordclouds are larger we will visualize the inference of a single topic. We observe that the topics now converge much faster in the first tens of epochs.

Evolution of a single topic in the 20 news groups dataset

Another attribute of a topic that we can visualize is the distribution over the words and its evolution. When the distribution over the words stops changing then the topic model has converged. It is common to check convergence using the likelihood instead. In the following figure we see the change in the distribution of the same topic as in the above figure. We see that indeed the topic changes very little from the 30th epoch and onwards.

Evolution of a topic's distribution in the 20 news groups dataset