The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
Coherence measures seek to emulate human judgment. They tend to rely on the degree to which words co-occur together. Yet, words may co-occur in ways that are unhelpful. For example, consider the words “the” and “this”. They co-occur very frequently in English-language documents. Yet these words are statistically independent. Knowing the relative frequency of the word “the” in a document does not give you any additional information on the relative frequency of the word “this” in a document.
Probabilistic coherence uses the concepts of co-occurrence and statistical independence to rank topics. Note that probabilistic coherence has not yet been rigorously evaluated to assess its correlation to human judgment. Anecdotally, those that have used it tend to find it helpful. Probabilistic coherence is included in both the textmineR and tidylda packages in the R language.
Probabilistic coherence averages differences in conditional probabilities across the top \(M\) most-probable words in a topic. Let \(x_{i, k}\) correspond the \(i\)-th most probable word in topic \(k\), such that \(x_{1,k}\) is the most probable word, \(x_{2,k}\) is the second most probable and so on. Further, let
\[\begin{align} x_{i,k} = \begin{cases} 1 & \text{if the } i\text{-th most probable word appears in a randomly selected document} \\ 0 & \text{otherwise} \end{cases} \end{align}\]
Then probabalistic coherence for the top \(M\) terms in topic \(k\) is calculated as follows:
\[\begin{align} C(k,M) &= \frac{1}{\sum_{j = 1}^{M - 1} M - j} \sum_{i = 1}^{M - 1} \sum_{j = i+1}^{M} P(x_{j,k} = 1 | x_{i,k} = 1) - P(x_{j,k} = 1) \end{align}\]
Where \(P(x_{j,k} = 1 | x_{i,k} = 1)\) is the fraction of contexts containing word \(i\) that contain word \(j\) and \(P(x_{j,k} = 1)\) is the fraction of all contexts containing word \(j\).
This brings us to interpretation:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.