5 - Text Clustering and Topic Modeling

Text clustering allows for creative solutions and diverse applications such as

  • outliers detection
  • speedup labeling
  • finding incorrectly labeled data
  • topic modeling (discover topics in a large collection of textual data)

A Common Pipeline for Text Clustering

A general pipeline for text clustering:

  1. Convert the input documents to embeddings with embedding model
  2. Reduce the dimensionality of embeddings with dimension reduction model
  3. Find groups of semantically similar documents with clustering model

From Text Clustering to Topic Modeling

Pipeline

Topic modeling tries to find themes or latent topics in a collection of textual data.

ApproachDescriptionProsCons
Classical approach (e.g., LDA)Assumes each topic is characterized by a probability distribution of words, and uses bag-of-words method to extract key words.Easy to implementNo consideration on the meaning of the words and phrases
BERTopic (a modular topic modeling framework)- Create clusters of semantically similar documents
- Generate a distribution over words (e.g., bag-of-words, c[class]-TF-IDF)
- Modular pipeline
- Can be adapted to different use cases using the same base model
- Still represents a topic through bag-of-words without considering the semantic structure

Table: Comparison of the classical and BERTopic approaches for topic modeling.

Re-ranker Block

Re-rank the initial set of the words to improve the resulting representation (neural search) via representation models. We can further reduce the redundancy by applying maximal marginal relavance (MMR) to diversify our topic representations (e.g., remove redundant key words summaries and summary).

How re-ranker works
Fig. Example of re-ranker reranks the original word distributions of the topic.

Here is the process of clustering, label topics, and reranker:

How re-ranker on top of the the c-TF-IDF representation.
Fig. The process of clustering (topic creation), representation (label the topics), and reranker (fine-tun representation).

The Text Generation Block

We can further improve the interpretability of topics by generating highly interpretable labels with generative models based on some representative documents and the topic keywords.

0%