Team: Emma Drobina, Alexandra DeLucia, Ashlynn Daughton, Lissa Moore
During the summer of 2021, I was part of a team at Los Alamos National Laboratory (LANL) working to address the problem of automated detection and characterization of communities on social media. We sought to develop unsupervised methods to categorize social media posts into communities and compare them to existing supervised methods. If unsupervised methods could be used effectively in this application area, it could save significant time and money by avoiding the manual effort needed to label training data sets. This task additionally required use of explainable knowledge discovery techniques to ensure that our algorithms are classifying communities in ways that are meaningful to humans (that is, shared topics of discussion and user behavior and interaction) while avoiding accurate classification based off of non-informative correlations.
We performed our analysis on 14 history related subreddits, which were grouped into 4 overarching communities: conspiracy, debunking, general, and what-if. Example posts for each category can be seen in the table below.
I generated text embeddings for the posts and comments for all 14 subreddits using BERT and PyTorch. I then aggregated the text embeddings for the comments and their metadata (using mean, median, min, max, and standard deviation) and appended them to their corresponding posts with metadata. As a baseline, I also performed analyses using a dataset without any aggregation, where comments and posts were unconnected. For unsupervised learning and explainability, I used exKMC, the state of the art for explainable clustering. Its methods consist of two steps: first, training a k-means algorithm to identify clusters, then using a binary threshold tree with 2k leaves to determine linear boundaries between clusters. For supervised learning, I used random forests with sklearn, and fully-connected neural network built with PyTorch. I generated explanations for the supervised models with LIME.
My investigations revealed that models trained on the aggregated dataset performed significantly better than those trained on the baseline (unaggregated) dataset. However, explanations did not highlight the aggregated features as important for either unsupervised or supervised models. This came as a surprise, since the improved performance that that came with the aggregated data indicates that the additional features were in some way key to the increase in accuracy. The methods I used to generate explanations relied on single features, so it is possible that the classification and cluster assignments actually relied on a complex interaction of multiple features, which would not be revealed with existing tools. This reveals a need for explainability tools that can understand these feature interactions, particularly in the complex and growing world of social media analysis.