In this notebook, we demonstrate that tensor decomposition can be used to learn latent variable models, or statistical models relating observed variables to unseen "hidden" variables. Specifically, we will use a tensor decomposition to learn a topic model for statements in a corpus of news stories. We begin with the entire corpus in order to demonstrate how tensor decompositions can uncover coherent topics. Then, because the statements in the corpus are labeled as "real" or "fake," we partition the corpus according to these labels and train one model on each subset. Using ENSIGN's post-processing tools, we study components that are unique to each subset. Finally, we show how we can use these models to infer whether a withheld news headline is real or fake.
# directory manipulation
import os
import shutil
# general data manipulation
import numpy as np
import pandas as pd
# text pre-processing
import string
from itertools import permutations
import nltk
from nltk import word_tokenize, bigrams, trigrams
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
# ENSIGN tools
import ensign.sptensor as spt
import ensign.cp_decomp as cpd
from ensign.comp_top_k import get_top_k
from ensign.synchronize_labels import synchronize_labels
from ensign.decomp_diff import decomp_diff
# topic visualization
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# cosine distance
from scipy.spatial.distance import cosine
In these experiments, we use the Fake News Inference Dataset as our corpus. The data contain statements that are labeled real (7591) and fake (7621).
# loading the corpus and partitioning into real and fake statements. holding out some data for prediction
data = pd.read_csv('data/news_data.csv')
holdout = data.loc[:5]
corpus = data.loc[5:]
real = corpus[corpus['label_fnn'] == 'real']
fake = corpus[corpus['label_fnn'] == 'fake']
corpus.head()
We pre-process the statements by tokenizing them, removing stop words, and stemming them.
We then create a 3-dimensional symmetric tensor that contains counts of all (non-ordered) triples that appear in the statements across the corpus. Anandkumar et al. show that the components yielded by decomposing this tensor with k components correspond to k topics (or distributions over the vocabulary) in an exchangeable single topic model. They also show that tensor decompositions can be used for learning other latent variable models, including mixtures of Gaussians and latent Dirichlet allocation.
Note that here we do not use the ENSIGN ETL module csv2tensor
but instead create and store the tensor directly along with the appropriate map files.
STOP = stopwords.words('english')
STEMMER = PorterStemmer()
# statement tokenizing, filtering, and stemming
def preprocess(text):
text = text.lower()
text = ''.join(c for c in text if c not in string.punctuation + '‘’“”')
text = word_tokenize(text)
text = [w for w in text if w not in STOP]
text = [STEMMER.stem(w) for w in text]
return text
# constructing the moment tensor for the single exchangeable topic model
def prepare_moment_tensor(df, tensor_dir):
stmts = df['statement'].values
triples = {}
indices = {}
n = 0
for stmt in stmts:
for trigram in trigrams(preprocess(stmt)):
n += 1
triple = tuple(sorted(trigram))
triples[triple] = triples[triple] + 1 if triple in triples else 1
for w in trigram:
if w not in indices:
indices[w] = len(indices)
tensor = []
for triple in triples:
val = triples[triple]
idx = [indices[w] for w in triple]
tensor += [' '.join([str(x) for x in list(p) + [val]]) for p in permutations(idx)]
header = 'sptensor\n3\n{} {} {}\n{}\n'.format(
len(indices), len(indices), len(indices), len(tensor))
if os.path.isdir(tensor_dir):
shutil.rmtree(tensor_dir)
os.mkdir(tensor_dir)
with open(tensor_dir + '/tensor_data.txt', 'w') as w:
w.write(header + '\n'.join(tensor))
labels = '\n'.join(indices.keys())
for i in range(3):
with open(tensor_dir + '/map_mode_{}.txt'.format(i), 'w') as w:
w.write('#w{}\n'.format(i) + labels)
# creating a word cloud visualizations for topics
def word_cloud(d, comp_id):
freq = dict(zip(d.labels[0], d.factors[0][:, comp_id]))
wordcloud = WordCloud()
wordcloud.generate_from_frequencies(frequencies=freq)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
# preparing the moment tensor for the whole corpus
prepare_moment_tensor(corpus, 'news_decomposition')
tensor = spt.read_sptensor('news_decomposition')
The rank 20 decomposition of the moment tensor corresponds to learning a exchangeable single topic model with 20 topics.
# learning the topic model for the whole corpus
decomp = cpd.cp_apr(tensor, 20)
In order to view the top-scoring words in each topic, we use get_top_k
from the ENSIGN module comp_top_k
. The function returns the top k (default of 10) scoring labels in each mode of the the specified components along with their indices and scores. As the tensor was symmetric, all the modes are identical. Therefore, for each component we only need to examine the zeroeth mode.
The topics below show coherent clustering of the vocabulary used in news statements. Different topics deal with dollar amounts, quantitative signifiers, time, political units, American government, politicians, political processes, healthcare, taxes, and crime. We show a selection below.
In addition to the raw listing of top weighted words in the topic, we also show a word cloud.
# getting top weighted words in each component (topic)
top_k = get_top_k(decomp.factors, decomp.labels, range(decomp.rank))
The top-scoring words in the topic are numbers, and the other reasonably ranked words are those that one would expect numbers to modify. For example: "cost", "worth", "defecit", "taxpayer", etc.
top_k[0][0]
word_cloud(decomp, 0)
The words with high weights in this topic refer to various aspects of American politics, such as "bill", "vote", "senate", "house", "candidate", "democrat", "republican".
top_k[1][0]
word_cloud(decomp, 1)
The words with high weights in this topic refer to various aspects of immigration, such as "illegal", "immigrant", "legal", "people", "countries".
top_k[3][0]
word_cloud(decomp, 3)
This topic gives high weight to words associated with various aspects of taxes such as "cut", "raise", "increase", "pay", "income", and "property".
top_k[5][0]
word_cloud(decomp, 5)
This topic deals with key words used to describe the American health care system including "insurance", "Obamacare", and "medicare".
top_k[6][0]
word_cloud(decomp, 6)
We now train tensor decompositions to learn a single exchangeable topic model for both real and fake news articles. Then, in order to see if certain topics only appear in either real or fake statements, we use the ENSIGN post-processing tool decomp_diff
. This tool computes a mapping between two or more decompositions in order to find similar components across decompositions. An entry [n, -1, 0] means that component n in decomposition 1 has no similar component in decomposition 2. Likewise, [-1, n, 0] means that component n in decomposition 2 has no similar component in decomposition 1. The default distance metric is the cosine similarity.
Before the decompositions are compared, it is necessary to synchronize the labels of the decompositions such that the indices in each of the tensor modes correspond to the same words. This synchronization is performed by the tool synchronize_labels
.
# prepare moment tensors for real and fake corpii
prepare_moment_tensor(real, 'real_decomposition')
prepare_moment_tensor(fake, 'fake_decomposition')
# synchronize labels for comparison
synchronize_labels(['real_decomposition', 'fake_decomposition'], in_place=True)
# load and decompose tensors
tensor_real = spt.read_sptensor('real_decomposition')
tensor_fake = spt.read_sptensor('fake_decomposition')
decomp_real = cpd.cp_apr(tensor_real, 20)
decomp_fake = cpd.cp_apr(tensor_fake, 20)
# compare decomposition components to find unique topics
mapping = decomp_diff([decomp_real, decomp_fake], ['mapping'], threshold=0.4)['mapping']
mapping
The mapping above shows that there are real and fake topics that do not have fake and real counterparts. Examples are plotted below.
top_k_real = get_top_k(decomp_real.factors, decomp_real.labels, range(decomp_real.rank))
top_k_fake = get_top_k(decomp_fake.factors, decomp_fake.labels, range(decomp_fake.rank))
real_idx = list(filter(lambda x : x[1][1] == -1, enumerate(mapping)))[0][1][0]
top_k_real[real_idx][0]
word_cloud(decomp_real, real_idx)
fake_idx = list(filter(lambda x : x[1][0] == -1, enumerate(mapping)))[0][1][1]
top_k_fake[fake_idx][0]
word_cloud(decomp_fake, fake_idx)
After constructing the word distributions of heldout headlines, we see if they are closer to topics (components) in the fake or real models in order to predict if the headline is more likely real or fake. We apply a simple prediction function to two test examples, which were held out from the training data.
def predict(statement):
tokens = preprocess(statement)
counts = np.zeros(len(decomp_real.labels[0]))
for w in tokens:
if w in decomp_real.labels[0]:
counts[decomp_real.labels[0].index(w)] += 1
counts /= counts.sum()
r = [cosine(counts, decomp_real.factors[0][:, comp_id]) for comp_id in range(decomp_real.rank)]
f = [cosine(counts, decomp_fake.factors[0][:, comp_id]) for comp_id in range(decomp_fake.rank)]
if min(r) < min (f):
return 'real'
else:
return 'fake'
print('truth:', holdout.loc[1]['label_fnn'])
print('prediction:', predict(holdout.loc[1]['statement']))
print('truth:',holdout.loc[5]['label_fnn'])
print('prediction:', predict(holdout.loc[5]['statement']))
In this notebook, following the work of Anandkumar et al., we demonstrated how tensor decompositions can be used to learn latent variable models. This approach formalized the intuition that tensor decompositions extract the latent patterns in data. We applied this technique in the domain of natural language processing in order to learn models describing fake and real news stories, then we used these models to analyze previously unseen headlines. The decomposition was a descriptive tool because components found by the tensor decomposition synthesized coherent topics found in the headlines. We used this topic model as the basis of a predictive tool by comparing the word distribution of unseen headlines to the known topics. This particular approach might be adapted as a fake news detector on social media.