Learn how we used Topic Modeling to help a client find hostile topics among various social media data


    The objective of the exercise was to identify hostile topics emerging out of social media data. four Data sources were analysed for the given exercise: Facebook, Instagram, YouTube and Twitter. Topic Modeling and other Natural Language Processing algorithms were used to identify several topics from social media posts and comments, across different geographies for a specific time period.

    Topic models are a family of statistical-based algorithms to summarize, explore and index large collections of text documents.1 Latent Dirichlet Allocation approach was used to identify several topics emerging out of social media conversations. 

    This case study would entail methodology followed to identify hostile topics. The data was scrapped from 4 social media sources by the client using specific keywords. At the behest of the client, the data, keywords used to extract the data and the final topic clusters emerging out of the data are kept confidential. The case study would give sample charts to depict the process followed.

    Data Sources Analysed

    • Facebook – Posts and comments
    • Instagram: Posts and comments
    • YouTube: Video description, comments and sub-comments
    • Twitter: Tweets and retweets

    Exploratory Data Analysis 
      • Data cleansing process was followed to get unique posts and comments and duplicate records were removed from the database.
      • Standard data checks for missing values were performed.
      • The data provided already had sentiment scores for posts, comments and tweets. The sentiment scores provided in the data were categorised into 5 categories – Neutral, Happy, Sad, Hate and Anger.
      • Each post/comment/tweet had 5 category wise scores which give a proportion of how much text from a sentence falls under each of these categories. For e.g.  a sentence could be 0.6 happy, 0.2 neutral, 0.1 sad, 0.05 hate and 0.05 anger. The total score of each post/comment /tweet adds up to 1.
          •  Since the objective of the exercise was to identify hostile topics – sentiment scores of anger and hatred were added up to form a “Hostile Score” for each post and comment.
          • A threshold of 0.5 was set up to identify hostile posts, comments, descriptions and tweets.
          • The rationale to set the threshold at 0.5 was made since, anything above the median of 0.5 is considered a substantial hostile score. This approach provides the best probability of finding hostile topics.
          • Negative posts, comments and tweets were analysed over a specific time period to understand the frequency of discussion about the emerging hostile topics. (refer to Chart 1 and Chart 2).
                                                                                                                             Chart 1

          The above chart shows most of the negative comments were made in the month of June followed by April.

                                                                                                                           Chart 2

          The above chart shows most of the negative comments were made in the month of August followed by July.

          The above charts represent the frequency of negative posts for the given time period. Each day can have more than one negative post/comment. The y-axis depicts the negative sentiment score for each post/comment.

          • Natural Language Processing algorithms were applied on top 5 negative posts, comments, tweets etc on each social media data to create a word cloud. The word cloud gives the most occurring negative words in the top 5 posts/comments/tweets etc.
          • TF-IDF algorithm was also applied on top 5 negative posts/comments/tweets to gather most important negative words.
          • While the word cloud gives importance to the most occurring words in the corpus, TF-IDF algorithm gives weightage to less occurring words which could be of high importance.
          • Topic Modeling using LDA approach was performed on the negative posts/comments/tweets.

          Topic Modeling using LDA

          • LDA algorithm was used on the 4 data sources to identify different topics emerging out of the negative posts/comments/tweets.
          • It was decided to get 5 topics (through parameter tuning) from each social media source: Facebook, Twitter, Instagram and YouTube. The idea was to ascertain whether similar topics emerge from the negative posts/comments across all the social media sources.
          • After several iterations of the LDA algorithm on the datasets, topic modelling results were finalised.
          • Topic Modeling results were visualised using extensive Python libraries.
          • The topics emerging out of the client data are hidden due to confidentiality reason.
          • A sample of how topic model results looks like is depicted in Chart 3 and Chart 4.
          • The charts represent 5 topics emerging from the data. 4 charts were created one for each data source.

                                                                               Chart 3

                                                                                Chart 4 

          On the left side in the chart, the 5 circles represent key topics which emerged from the posts/comments. The proximity among the circle indicates that the topics are related to an extent. In the above charts, there is no overlap between the circles. However, circle 2 and 5 are close to each other depicting similarity between topics. 

          The right side depicts the list of keywords which make up a topic. A full red bar across keyword denotes that the keyword majorly contributes to the topic formation.


          5 topics were identified from each data source. There were few topics which were common across the four data sources. A final list of the hostile topics and keywords was prepared and shared with the client.