Introduction (motivation) Most recent trajectory of research in

Introduction (motivation)
Most recent trajectory of research in the fields of information retrieval and
data mining invariably demonstrates growing interest in the problem of topic
evolution research from both academics and data science practitioners.
Accelerated advancement of internet technology and production of corpora
in the online environments from academic archives to social media assures the
growing importance of topic evolution research as a key tool to navigate in,
process, and find information in textual corpora.
Latent Dirichlet Allocation – the model behind most of the topic modeling
research studied within the scope of this paper – is a powerful tool, de facto
favored by most researchers today over its predecessor – Probabilistic Latent
Semantic Analysis (PLSA). However, the LDA method has serious shortcomings:
primarily in that it does not explicitly allow for modeling of temporal
relationships. In order to obtain results pertaining to the evolution of ideas,
LDA-based models need to be extended.
An important subject of inquiry in topic evolution modeling -and one that
sheds light on yet another weakness of the classic LDA-based models- is the
application to the online environments. Characterized by emergent (as opposed
to static) topics as well as by dynamic size of the corpus, where themes are
‘being born’ and vanish along the timeline, the online environments require
specific algorithms to track new topic emergence and old topic disappearance.
A modern classic in the family of LDA-based models -Blei and Lafferty’s Dynamic
Topic Model (2008)- has another serious drawback. The model imposes
constraints on the time period, penalizing large changes from year to year. Hall
et al. (2008) propose an alternative, where post-hoc calculations are performed
based on the observed probability of each topic given the current time period.
In this paper we analyze the DTM method and examine the improvements
ensuing from the efforts of Hall et. al. We also address the fact that most of the
probabilistic topic evolution algorithms are designed to work with static corpora
with a fixed pre-determined size. We examine the ways this technical challenge
is circumvented in research related to the online corpora topic evolution.
2 Literatue review (history of the problem)
The earliest work on topic evolution modeling dates back to DARPA’s 1998
Topic Detection and Tracking (TDT) (Allan 2012), an effort to optimize navigation
in and processing of the stream of news stories information.
The current state-of-the-art for the topic evolution modeling problem is most
accurately regarded as a split into three main types of models:
1) The discrete time topic evolution model,
2
2) The continuous time topic evolution model Wang, McCallum (2006);
Wang (2008); Kawamae (2011); Dubey (2013), and
3) The online topic evolution model Canini (2009); Hoffmann (2010).
The clear dominance of probabilistic topic models in the research on topic
evolution is largely due to the two seminal works: Hofmann’s Probabilistic
Latent Semantic Analysis (PLSA) (Hofmann 2001), and Blei’s Latent Dirichlet
Allocation (Blei 2003).
Blei and Lafferty (2006) represent documents distributed in time as generated
from a normal distribution centroid over topics, where the centroid of
each following time frame is generated from that of the preceding period. In a
slightly different approach, Wang and McCallum (2006) assume that each document
chooses its own time stamp based on a topic-specific beta distribution.
Shan and Li (2010) provide a thorough overview of topic evolution models
based on LDA with time (i.e., post-discretizing or pre-discretizing methods).
To track topic number and structure changes over continuous time, Elshamy
(2013) proposed a model based on the online hierarchical Dirichlet process.
Another class is directed probabilistic topic models, as summarized by Daud et
al. (2010), is the Direct Probabilistic Topic models – an unsupervised learning
paradigm with soft clustering abilities.
Though these studies provided valuable information, they all show relatively
narrow scope: in these works, both corpus and vocabulary size are predetermined,
which renders the methods inapplicable to the online environment.
Another drawback to the DTM model is that it imposes constraints on the
time period, penalizing large changes from year to year. Hall et al. (2008)
propose an alternative, where post-hoc calculations are performed based on the
observed probability of each topic given the current time period.
3 The Dynamic Topic Model method (Blei et al.)
3.1 Theoretical foundations
In Dynamic Topic Models, Blei et al. (2006) attempt to extend the basic topic
model described in Latent Dirichlet Allocation (Blei et al., 2003) to a state
space model. A basic topic model using Latent Dirichlet Allocation (LDA)
is a generative probabilistic model of a corpus of documents, and the typical
inferential problem that one tries to solve is the estimation of the posterior
probability of the hidden variables given a document:
p (?, z | w, ?, ?) = p (?, z, w | ?, ?)
p (w | ?, ?)
Where ? is the topic proportions of the document, z is the topic assignments
of the words in the document, w is a distribution of words in the documents,
3
? is a vector of K concentration parameters of the Dirichlet distribution from
which ? is drawn, and ? are the multinomial distributions of the K topics over
words in the vocabulary. Due to intractability of this distribution, the method
of variational inference is used to approximate the above posterior.
The model described in Blei et al. attempts to expand on generative model
described above by extending it to a state space model. This is done by linking
time consecutive vectors of parameters with normal distribution. Adopting the
notation used in Blei et al., we denote as ?k,t the vector of natural parameters for
topic k at time t, where the length of the vector, denoted V , equals the number
of words in the vocabulary. Although the Dirichlet distribution is typically used
to model uncertainty for distributions over words, it has a drawback in not being
amenable to sequential modeling. For this reason, Blei et al. elect to link the
natural parameters of ?k,t with a state space model that evolves with Gaussian
noise:
?k,t ? N
?k,t?1, ?2
I

A similar connection is made between ?t and ?t?1, where ?t is the mean of
logistic normal from which the topic proportions ? are drawn.
?t ? N
?t?1, ?2
I

After defining the links between consecutive time state spaces, the generative
process for the dynamic topic model described in Blei et al. can be specified at
a time slice t as:
1. Draw topics ?k,t ? N
?k,t?1, ?2
I

2. Draw ?t ? N
?t?1, ?2
I

3. For each document:
(a) Draw ? ? N
?t, a2
I

(b) For each word:
i. Draw Z ? Mult (? (?))
ii. Draw Wt,d,n ? Mult (? (?t,z)),
where ? maps natural parameters to the mean parameters:
? (xt)w =
exp (xt,w)
P
w
exp (xt,w)
While the use of Gaussian models allows for a straightforward interpretation,
it does have a setback: non-conjugacy of the Gaussian and multinomial models
cause posterior inference to be intractable. To get around this intractability
Blei et al. make use of two methods: Variational Kalman Filtering and Variational
Wavelet Regression. When applied to a unigram model, the variational
approximations of both methods were able to smooth out the local fluctuations
of the unigram word counts while still preserving the sharp peaks that might
4
be indicative of a significant change in the corpus at a certain time period. One
advantage that the wavelet regression method seems to have over the Kalman
filter method is that it is able to distinguish peaks that are close together.
3.2 Alternative Methods
In this section we will discuss similar methods that exist in the literature, focusing
on the advantages and drawbacks of each one.
Most non-trivial attempts to solving the problem of dynamic topic modeling
are based on one of two main methods: LDA and Probabilistic Latent Semantic
Analysis (PLSA). PLSA is alatent variable model that is typically used to make
some inference on the mixture of topic of a given document d containing words
w from some corpus. PLSA can also be viewed as a precursor to LDA. Indeed,
LDA was developed to address the shortcomings of PSLA, in particular the fact
that the latter does not properly describe the generative process by which new
document are created.
The Dynamic Topic Model by Blei et al. adapts the typical LDA topic model
to a discrete time state space model that is capable of determining the trends
of different topics in a corpus over time. This type of dynamic topic model, a
discrete dynamic topic model (dDTM), is not without its drawbacks. One of
theses drawbacks is that for this type of model the time is discretized. This
drawback is addressed in papers that propose an alternative model such as
Topic over Time by Wang, McCallum (2006) and Continuous Time Dynamic
Topic Models by Wang (2008). In Continuous Time Dynamic Topic Models
the authors explain the reason behind using discretized time. They state that
resolution or granularity at which the time scale is chosen could either undermine
the assumption of exchangeability between documents in the same time step (if
the scale is too large) or the number of variational parameters could increase
at an unmanageable rate (if the scale is too small). The authors then go on to
propose and develop a replacement model called the continuous dynamic topic
model (cDTM) which can be interpreted as the limit of a dDTM at the smallest
time scale. Although natural parameters are still used in cDTM to represent
the topics, in modeling how a topic changed over time the paper differs from
dDTM that uses a discrete Guassian process, and instead proposes to use the
Brownian motion:
?k,0 ? N (m, v0I)
?k,j | ?k,i, s ? N (?k,i, v?j,iI).
Here i, j are arbitrary time indices such that j > i > 0, ?j,i is the time
between i, j, s are the time stamps, and the variance is a function of the lag
between observations or documents.
After establishing the process for modeling the natural parameters of ?k,t
and how they evolved through time, Wang goes on to outline generative process
in a fashion similar to that of the dDTM, with two major differences: ?k,t was
being drawn from Brownian motion as a Gaussian Process, and the parameters
used to determine the distribution of topics in a document, ?t, were now being
5
drawn from a Dir (?), where ? appears to be constant with respect to time.
The paper explains in further depth that when the dDTM looks at a set of
observations over an interval of time, this interval has to divided into discrete
ticks. In dDTM, a parameter has to control the variance at each of these ticks,
and the variance over the whole interval is represented by this parameter multiplied
by the number of ticks made. This leads to the full distribution over
terms being explicitly represented as each tick. And as granularity of model
becomes finer with greater number of ticks, calculating the posterior inference
can become more computationally taxing, even if the document are sparsely
distributed across the time interval. Since in cDTM the variance is a function
of the lag between observations, it allows for the probabilities at discrete steps
between observations to be ignored. This also allows for inference to be handled
sparsely, which transforms the problem of selecting granularity from a computational
concern to a modeling one. Much like the dDTM make use of variational
methods for inference, cDTM also uses variational methods for inference, more
specifically the Kalman filter method. Although in theory the Kalman Filter
algorithm can be applied to the cDTM, as mentioned previously the finer the
desired granularity the greater the demand for computational resources to approximate
the posterior inference. To get around this computational concern,
a sparse Kalman filter method is developed. The main idea behind this sparse
version of the Kalman Filter method is to not explicitly represent the ?ˆ
t of a
word w if it has not been observed at time stamp t.
Temporal Text Mining (TTM) is another method for model topics in a corpus
of documents and how they evolve and change over time. In a paper by
Mei and Zhai (2005) the authors lay out the process for which a TTM model
could be built to discover the evolutionary theme patterns (ETP) in a corpus
of documents. They first define a topic or a theme as a probability distribution
of words that share a topic in common across a corpus. A theme span is then
defined as themes that span a given time interval (in the paper the terms theme
and theme span are use interchangeably). Next an evolutionary transition is
defined as the connection between two topics that have similarity score above a
predefined threshold. After introducing these definitions, Mei and Zhai. build
a Theme Evolution Graph: a weighed directed graph where each vertex is a
theme (span) and the edges between them are the evolutionary transitions. In
order to find the the probability distribution associated with the theme spans,
the Expectation Maximization (EM) algorithm is implemented. To find the
evolutionary transitions they use the KL-divergence between two theme distributions.
If it was above a certain threshold, we can say that there is an edge
between the vertexes that represents the two theme distributions.
6
4 The improved method (Hall and Jurafsky)
4.1 Motivation (the drawbacks of Blei et al.)
In the paper ‘Studying the History of Ideas Using Topic Models'(2008) Hall,
Jurafsky, and Manning applied the idea of modeling topic evolution to the domain
of scientific literature in what remained to be the state-of-the art model
for the task of topic evolution until 2011 (Thomas et al., 2011).
Focusing on the domain of Computational Linguistics, the authors induce
topic clusters from the documents in the ACL Anthology corpus () to study
historical trends in the directions of scientific research. Using the estimations of
the strength metric, they identify increase or decline in the popularity of each
topic.
The crux of their method is a straightforward application of the unsupervised
topic modeling based on the LDA model by Blei et al. (2003). To model
temporal relationships, the authors begin with adopting Blei’s Dynamic Topic
Modeling approach. As discussed before, the DTM represents each year’s documents
as generated from a normal distribution centroid over topics. Every t-th
year’s centroid is generated from that of year t ? 1. The key improvement of
Hall et al. over Blei et al. addresses the following drawback of the DTM: the
original model imposes constraints on time periods, penalizing large year-toyear
changes. (This is also true for the Topics Over Time model of McCallum
et al., where each document is assumed to choose its own time stamp based on
a topic-specific beta distribution that stays relatively inflexible.)
4.2 Theoretical foundations
To address the shortcomings of the model by Blei et al., Hall et al. propose a
model with one novel element: the use post hoc calculations based on the observed
probability of each topic given the current year. An empirical probability
that an arbitrary paper d written in year y was generated from topic z is defined
as follows:
pˆ(z|y) = X
d:td=y
pˆ(z|d)ˆp(d|y)
=
1
C
X
d:td=y
pˆ(z|d)
=
1
c
X
d:td=y
X
z

i?d
I(z

i = z),
where td denotes the date a document was written, and probability ˆp(d|y) is
taken to be a constant 1/C. Hall et al. perform parameter estimation using
collapsed Gibbs sampling (Griffiths and Steyvers, 2004).
To examine the change in strength of each topic over time, the authors begin
by fitting a model of 100 topics from which the supposed generative process
draws. The authors, however, only keep 36 topics that they judge to be relevant
to the corpus, proceeding to manually select seed words for an additional 10
7
topics. The totaling 46 topics were then used as priors to a new run of fitting
100 topics to the corpus. (The authors maintain transparency in explicitly
marking the topics derived from manual seeds in their paper.)
The remainder of the investigation by Hall et al. is purely applied and poses
no interest with respect to the subject of this overview: the authors examine
trends in the probability mass of a given topic over time, visualized as a plot of
a smoothing of ˆp(z|y).
4.3 Corpus entropy and cross-corpora convergence
A separate purely practical application explored by Hall and Jurafsky in the
paper is a model that examines topic diversity in the proceedings of several scientific
conferences. Interested in cross-conference comparison of trend dynamics
that exist within a given conference proceedings corpus, the authors define the
empirical distribution of a topic z at a conference c as follows:
pˆ(z|c) = X
d:cd=c
pˆ(z|d)ˆp(d|c)
=
1
C
X
d:cd=c
pˆ(z|d)
=
1
c
X
d:cd=c
X
z

i?d
I(z

i = z),
which they then condition on the year foe each conference, transitioning to
pˆ(z|y, c).
Following that, the authors introduce the second novelty in their paper: they
propose to study the breadth of a conference output by introducing the topic
entropy:
H(z|c, y) = ?
X
K
i=1
pˆ(zi
|c, y) log ˆp(zi
|c, y).
The conditional entropy of the topic distribution of a conference is a measures
of the average amount of information expressed by each assignment to a random
variable. A higher topic entropy value signifies that the probability mass is more
evenly distributed across topics.
To investigate whether the topic distributions of the conferences were converging,
the authors examined the Jensen-Shannon divergence between each
pair of conferences.
4.4 Drawbacks of the model
An illustration of some of the drawbacks of the model described above can be
seen when studying literature that applies the Hall model to software engineering
repositories (Linstead et al., Thomas et al.) From Thomas et al. (2011):
“The Hall model was initially designed for corpora that have different properties
than source code histories. The Hall model typically operates on corpora
8
in which each version is completely different (e.g., conference proceedings) from
the previous. Files in source code histories, however, are typically incrementally
updated between versions (e.g., main.c version 34 versus main.c version 35), resulting
in partial or full duplications of files between successive versions. After
applying the Hall model to source code histories, we have found that this duplication
effect causes unintended results that reduces its effectiveness.” -Indeed,
the Hall model applies LDA to all versions at the same time.! “To combat these
issues, we propose the Diff model, an extension to the Hall model that operates
only on the changes between versions of the repository.”
5 Further research developments
In Thomas et al. (2011), the Hall model is adopted for the study of topics
in versions of the same software development documents. The intuitive and
simple application of LDA to all versions of a document makes the Hall model
attractive for use on such datasets: during the post-processing phase, topic
metrics are computed by first isolating all documents in a particular version.
Assignment of a topic za at version Vj is computed as the summation of the
topic’s memberships across all documents in Vj :
A(za, Vj ) = X
|Vj |
i=1
?di,j a.
However, since most changes are very small, the word co-occurrences that the
LDA operates on will largely remain the same, contributing to a skewed in the
direction of the duplicated files.
The authors thus propose a model that adds a “diff” step to the Hall model
to isolate the changes between successive versions of each document, described
in the authors’ illustration below:
In comparison with Hall et al., the latter model should find less distinct
topics (each topic having multiple concepts confounded), and new documents
will tend to be matched to fewer topics, resulting in fewer spikes in the evolution.
Significant development over the initial LDA model were made in the direction
of online inference. In the online LDA model (OLDA) (AlSumait et al.
2008) algorithm for mining text streams, an empirical Bayes method creates a
new model whenever a text item is added to the corpus.
9
Iwata et al.’s Multiscale Dynamic Topic Model (MDTM) (2010) employs
multiple timescales to analyze online topic evolution document datasets. In this
model, the multiscale word distributions of the former epoch are assumed to
generate current topic-specific distributions. In a unique approach, the MDTM
detects topic evolution in multiple time scales as well as respective topic popularity,
using the Dirichlet-multinomial distribution to describe the topic evolution.
In a rare detour from the reliance on the Dirichlet distribution, Nallapati
et al. (2010) model topic evolution using non-homogeneous Poisson process in
their Multiscale Topic Toography (MTTM) method.
Both MTTM and MDTM have a drawback in treating the number of topics
as well as time interval and scale as fixed. A method that clusters news articles
into storylines to extract popular topics and key ‘entities’ within each story,
as proposed by Ahmed et al. (2008), combines LDA with clustering. The
clustering is organized hierarchically: relevant news are gathered into a story,
and topic models are applied to each story separately. The strength of each
storyline is governed by a non-parametric model based on a generalization of
the Chinese Restaurant Process algorithm (which can be naturally interpreted
as a hierarchical Dirichlet process).
Both OLDA and MDTM are LDA-based models characterized by a relatively
low algorithm time complexity, allowed in part by treating the number of topics
as fixed (in contrast to the non-parametric Bayesian model by Ahmed et al.).
In all three online inference models discussed, an EM algorithm or a Gibbs
sampling method is employed.
6 Open questions and future research directions
One of the critical issues in topic modeling is the posterior inference for individual
data instances. This problem is particularly important in streaming online
environments, and some existing work (Than et al., 2015) suggests the use of
the Frank-Wolfe algorithm for recovering sparse solutions to posterior inference.
Another significant drawback of most literature expanding on both PLSA
and LDA models is that they tend to neglect the auxiliary information (citation
networks, authors, etc.). Yet, augmenting the LDA approach with auxiliary
metadata during the process of topic extraction could potentially aid the interpretation
of the topic model.
A major challenge pertains to the evaluation of experimental results. In
practice, there is no coherence between model perplexity as evaluated on the
held-out test set and the semantic importance of topic evolution results. According
to Chang et al. (2009), topic evolution models that show better held-out
qualities may actually represent fewer semantically meaningful topic evolution
results. There exists no unified criteria for evaluation of topic evolution models,
making quantitative comparison of experimental results among different models
either infeasible or -when proxied via qualitative methods- subjective. Establishing
a standard metric for the evaluation of topic evolution is an important
future development task.
10
Another problematic statistical assumptions pertaining to the probabilistic
topic models discussed is the “bag of words” principle. In natural language,
the semantic relationship among words, as well as their order do not appear
to be unimportant with regard to the evolution of topics, and the omission of
these factors is perhaps a relaxation at the cost of semantic meaningfulness.
There exist models (Wallach, 2016) that propose to replace words with phrases
as basic semes. We also believe that NLP (natural language processing) technology
is applicable to topic evolution research and may yield more practical
and meaningful findings.
In conclusion, 15 years since the introduction of the Latent Dirichlet Allocation,
topic modeling is a mature area of research, but the more complex niche
within it -topic evolution modeling- has not been developing without concrete
methods for cross-model comparison, albeit in a coherent and definite direction.
Overcoming the challenges outlined above would yield a strong unification of
the scientific efforts towards more semantically meaningful results.