Getting Started
The goal of this page is to introduce you to the C++ library LDA++. If instead you just want to use LDA++ to infer topics from a corpus see the console applications.
LDA++ has only one dependency to external libraries. It depends on
Eigen for efficient
matrix and vector operations. As it will be seen in the following sections
Eigen matrices and vectors are used to interface easily with the library. The
model parameters, for instance, are Eigen matrices and they can be printed with
std::cout
or otherwise manipulated.
All of the classes in the library are parameterized using templates with
respect to the floating point scalar type. This allows us to save memory (and
maybe speed up) using single precision floats by changing a simple double
to
float
.
LDA facade
All types of LDA training take place through the ldaplusplus::LDA facade. This class combines an expectation step implementation, a maximization step implementation and some model parameters to perform variational inference and compute the optimal LDA model parameters. The interface of LDA is heavily inspired (the same really) with the Estimator, Transformer and Classifier scikit-learn interfaces.
namespace ldaplusplus {
template <typename Scalar>
class LDA
{
typedef Eigen::Matrix<Scalar, Eigen::Dynamic, Eigen::Dynamic> MatrixX;
typedef Eigen::Matrix<Scalar, Eigen::Dynamic, 1> VectorX;
public:
void fit(const Eigen::MatrixXi &X, const Eigen::VectorXi &y);
void fit(const Eigen::MatrixXi &X);
void partial_fit(const Eigen::MatrixXi &X, const Eigen::VectorXi &y);
MatrixX transform(const Eigen::MatrixXi &X);
MatrixX decision_function(const Eigen::MatrixXi &X);
MatrixX decision_function(const MatrixX &Z);
Eigen::VectorXi predict(const MatrixX &scores);
Eigen::VectorXi predict(const Eigen::MatrixXi &X);
const std::shared_ptr<parameters::Parameters> model_parameters();
...
}
} // namespace ldaplusplus
The easiest way to interact with ldaplusplus::LDA is through Eigen matrices. As is expected LDA::fit() fits the model to the provided training data mutating the model parameters which are accesible through LDA::model_parameters(). The functions LDA::decision_function() and LDA::predict() both assume supervised LDA with linear classifier.
Assuming we have created an LDA instance the following example showcases the use of the facade.
using namespace Eigen;
LDA<double> lda = ... get an LDA instance ...;
// Create 1000 random documents
MatrixXi X = (ArrayXXd::Random(100, 1000).abs() * 20).matrix().cast<int>();
VectorXi y = (ArrayXd::Random(1000).abs() * 5).matrix().cast<int>();
// Assuming lda represents a supervised model
lda.fit(X, y);
auto model = std:static_pointer_cast<parameters::SupervisedModelParameters<double> >(
lda.model_parameters()
);
// Contains the dirichlet prior
VectorXd alpha = model->alpha;
// Contains the topics
MatrixXd beta = model->beta;
// Contains the supervised parameters eta
MatrixXd eta = model->eta;
// Create 100 random test documents
MatrixXi X_test = (ArrayXXd::Random(100, 100).abs() * 20).matrix().cast<int>();
// Z now contains the topic mixtures for each document
MatrixXd Z = lda.transform(X_test):
// y_test contains the predictions and in pseudocode is
// y_test = (lda.transform(X_test).transpose() * lda.model_parameters()->eta).argmax(axis=1)
VectorXi y_test = lda.predict(X_test);
// You can further train one more iteration using partial_fit
lda.partial_fit(X, y)
LDABuilder
Although we could build an LDA instance directly using its constructor, it is easier to use the provided builder ldaplusplus::LDABuilder to ensure the readability of our code. A builder instance can be implicitly casted to an LDA instance, thus the creation of a new LDA instance is as easy as the following code.
// Create an unsupervised lda with 10 topics expecting 1000 words vocabulary
LDA<double> lda = LDABuilder<double>().initialize_topics_random(1000, 10);
Initialize model parameters
In order to create an LDA from an LDABuilder at least the model parameters must
be initialized. The LDABuilder checks if the model parameters have been
initialized correctly and throws a std::runtime_error
in case they haven't.
In case of unsupervised LDA (which is the default for LDABuilder) only the
topics must be initialized using one of the LDABuilder::initialize_topics_*()
functions.
namespace ldaplusplus {
template <typename Scalar>
class LDABuilder
{
public:
...
LDABuilder & initialize_topics_seeded(const Eigen::MatrixXi &X, size_t topics, ...);
LDABuilder & initialize_topics_random(size_t words, size_t topics);
LDABuilder & initialize_topics_from_model(
std::shared_ptr<parameters::ModelParameters<Scalar> > model);
...
} // namespace ldaplusplus
In the case of supervised topic models (sLDA and fsLDA) one must also
initialize the supervised model parameters (after initializing the topics)
using one of the LDABuilder::initialize_eta_*()
functions.
namespace ldaplusplus {
template <typename Scalar>
class LDABuilder
{
public:
...
LDABuilder & initialize_eta_zeros(size_t num_classes);
LDABuilder & initialize_eta_uniform(size_t num_classes);
LDABuilder & initialize_eta_from_model(
std::shared_ptr<parameters::SupervisedModelParameters<Scalar> > model);
...
} // namespace ldaplusplus
Choose LDA method
Choosing the LDA method means choosing the variational inference method for solving an LDA problem, namely the Expectation Step and the Maximization Step. Unlike the console applications which focus on three models, the library contains a lot more models and allows users to define their own. The LDABuilder has support for creating LDA models that use any of the variational implementations that ship with the library.
Choosing an implementation for the Expectation and Maximization steps is done
by calling methods named set_[method_name]_[e or m]_step
. Almost all methods
have sensible default parameters and we encourage you to read the Api
documentation of
ldaplusplus::LDABuilder for the
documentation of the parameters. For every set_*
method there exists a
corresponding get_*
method that returns a pointer to the corresponding
implementation instance. Next follows a list with all the available method
names:
- classic (LDA)
- supervised (sLDA)
- fast_supervised (fsLDA)
- fast_supervised_online (fsLDA online maximization step only)
- semi_supervised (experimental)
- multinomial_supervised (experimental)
- correspondence_supervised (experimental)
Examples
In this section we will provide some examples using the LDABuilder to instantiate various kinds of LDA models and use them on a small randomly generated dataset.
The code below can be compiled, provided you have installed
LDA++, with the following simple command g++ -std=c++11 test.cpp -o test
-lldaplusplus
.
#include <iostream>
#include <Eigen/Core>
#include <ldaplusplus/LDABuilder.hpp>
using namespace Eigen;
using namespace ldaplusplus;
int main() {
// Define some variables that we will be using in LDA creation
size_t num_classes = 5;
size_t num_topics = 10;
// Create a random dataset 100 words 50 documents and
// corresponding class labels
MatrixXi X = (ArrayXXd::Random(100, 50).abs() * 20).matrix().cast<int>();
VectorXi y = (ArrayXd::Random(50).abs() * num_classes).matrix().cast<int>();
// Create the simplest lda possible an Unsupervised LDA with random topic
// initialization
LDA<double> lda = LDABuilder<double>().initialize_topics_random(
X.rows(), // X.rows() is the number of words in the vocab
num_topics // how many topics we want to infer
);
// Create a supervised LDA as defined by Wang et al in Simultaneous image
// classification and annotation
LDA<double> slda = LDABuilder<double>()
.set_supervised_e_step()
.set_supervised_m_step()
.initialize_topics_seeded(X, num_topics)
.initialize_eta_zeros(num_classes); // we need to
// initialize eta
// as well now
// Create a fast supervised LDA as defined in Fast Supervised LDA for
// Discovering Micro-Events in Large-Scale Video Datasets
LDA<double> fslda = LDABuilder<double>()
.set_fast_supervised_e_step()
.set_fast_supervised_m_step()
.initialize_topics_seeded(X, num_topics)
.initialize_eta_zeros(num_classes);
// Train all our models
lda.fit(X);
slda.fit(X, y);
fslda.fit(X, y);
// Extract the top words of the unsupervised model
auto model = lda.model_parameters<parameters::ModelParameters<> >();
VectorXi top_words(model->beta.rows());
for (int i=0; i<model->beta.rows(); i++) {
model->beta.row(i).maxCoeff(&top_words[i]);
}
std::cout << "Top Words:" << std::endl << top_words
<< std::endl << std::endl;
// Now to transform the data using the slda model we need an unsupervised
// lda model (because we do not know the class labels for the untransformed
// data)
LDA<double> transformer = LDABuilder<double>().initialize_topics_from_model(
slda.model_parameters<parameters::ModelParameters<> >()
);
MatrixXd Z = transformer.transform(X);
std::cout << "The topic mixtures for the first document" << std::endl
<< Z.col(0) << std::endl << std::endl;
// Predict the class labels using the fslda model (again we will be using
// an unsupervised model because we do not know the class labels
// beforehand)
auto sup_model = fslda.model_parameters<parameters::SupervisedModelParameters<> >();
LDA<double> predictor = LDABuilder<double>()
.initialize_topics_from_model(sup_model)
.initialize_eta_from_model(sup_model);
VectorXi y_pred = predictor.predict(X);
std::cout << "Accuracy: " << (y.array() == y_pred.array()).cast<float>().mean()
<< std::endl;
return 0;
}
And here follows a possible output
Top Words:
44
8
65
71
18
33
6
58
56
36
The topic mixtures for the first document
23.433
39.2927
58.34
153.297
96.5914
32.2527
121.298
77.9765
317.722
27.7965
Accuracy: 0.92