Intelligent topic modeling: practical explanation and interpretation

Written by Sanjay Paul, CDIO, HM Revenue & Customs

One of the pivotal areas of application of Artificial Intelligence (AI) is Natural Language Processing (NLP). NLP involves a set of scientific methods to train systems to understand human language, both written text and spoken words. This specialist field of Machine Learning (ML) focuses largely on text processing, to simplify various manual tasks.

With the ability to understand natural language, NLP goes beyond text processing and speech recognition. A myriad of NLP / NLU applications are found in Machine Translation and Sentiment Analysis. Chatbots and Virtual Assistants are common applications in answering questions. NLP techniques are also used for natural language generation, creating human-like text or speech. Likewise, Information Retrieval (IR) is an NLP technique often used in large scale unstructured text mining to classify documents, best matching to their content.

IR from documents is usually followed by processes like classification and categorisation by tagging, indexing and creating searchable metadata. However, accuracy of IR is absolutely crucial for the subsequent downstream functions. And, this article particularly deliberates on the various methods and algorithms of IR, specifically in the light of explainability and interpretability.

IR process essentially decomposes into Topic Modeling (TM) and Text Summarisation (TS). Although the above terms sound similar and readers see they are used interchangeably, there are subtle differences. TM attempts to represent a document by discovering its abstract theme using statistical or probabilistic models. Usually its output is a set of key words in the order of significance. On the contrary, TS is the process of distilling the most important information from a source (or sources) to produce an abridged version. It is imperative that TM, more often than not, is context (i.e. situation or circumstances) sensitive while TS is reasonably contained by its original text, driven by its purpose.

A “topic” is a collection of discrete words. Industry-standard algorithms implementing TM found in extant and credible literature can be categorised as Algebraic, Probabilistic, Neural and Fuzzy topic building models. Algebraic models (e.g. Latent Semantic Analysis, Non-negative Matrix Factorisation) are simple and relatively efficient in computing. Bayesian Probabilistic models (e.g. Latent Dirichlet Allocation and its variants) are intuitive and extendable. Algorithms under Neural topic models (e.g. BERTopic) are scalable for large documents and offer high predictive accuracy.

Sparsity in short text documents (i.e. tweet, chat messages, Q&A on forums, office memo, emails, risk descriptions, clinical notes) is a challenge for TM. While the discussed TM techniques struggle in this situation, Fuzzy Topic Modeling (FTM) approach generally works well. The strength of FTM technique is its wide latitude of applicability rather than precision outcome.

Fuzzy c-means (FCM) clustering algorithm is a variant of k-means partitional algorithm. This implementation does not treat each word as mutually exclusive among the clusters, rather assigns different weights by their degree of belonging to each cluster. Normally, the sum of those weights (i.e. coefficients or probabilities) is equal to 1. Therefore, every word bears a fuzzy membership score.

Admittedly, the FCM algorithm falls short of accuracy. Also, it is slower in computation due to its nature of exhaustive iterative process. Yet, the FCM algorithm can offer a wealth of clues on explainability and interpretability.

Homogeneity score of the formed clusters explains its compactness, and so a lower intra-cluster variation. The consumers can interpret it as topic cohesiveness. Separation measures tell how well-separated a cluster is from other clusters. This can be interpreted as the distinctiveness of different topics.

FCM It is an unsupervised machine learning algorithm. It allows overlapping clusters, where intersection of boundaries can be interpreted as areas of ambiguity. In turn, it enables nuanced understanding of the topics. It allows detailed visualisation of the relationships between topic elements and clusters.

We acknowledge any language, albeit having a grammar, is a lot fluid in making sense. Simply put, a similar set of keywords could suggest different meanings with a sharp contrast. Therefore, any industry standard unsupervised TM algorithm is always subjected to bias and can induce impurity.

In conclusion, to experience higher degrees of confidence and trust in NLP / NLU, it is suggested that intelligent systems adapt supervised and reinforcement learning supplemented by its own context sensitive grammar. And, to reiterate, explainability and interpretability are the ways to warrant outputs of ML application. 

Read More AI & Data

Comments are closed.