hypergraph

Understanding Fake News

In this notebook, we demonstrate that tensor decomposition can be used to learn latent variable models, or statistical models relating observed variables to unseen "hidden" variables. Specifically, we will use a tensor decomposition to learn a topic model for statements in a corpus of news stories. We begin with the entire corpus in order to demonstrate how tensor decompositions can uncover coherent topics. Then, because the statements in the corpus are labeled as "real" or "fake," we partition the corpus according to these labels and train one model on each subset. Using ENSIGN's post-processing tools, we study components that are unique to each subset. Finally, we show how we can use these models to infer whether a withheld news headline is real or fake.

Table of Contents

In [1]:
# directory manipulation
import os
import shutil

# general data manipulation
import numpy as np
import pandas as pd

# text pre-processing
import string
from itertools import permutations
import nltk
from nltk import word_tokenize, bigrams, trigrams
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# ENSIGN tools
import ensign.sptensor as spt
import ensign.cp_decomp as cpd
from ensign.comp_top_k import get_top_k
from ensign.synchronize_labels import synchronize_labels
from ensign.decomp_diff import decomp_diff

# topic visualization
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# cosine distance
from scipy.spatial.distance import cosine

Data

In these experiments, we use the Fake News Inference Dataset as our corpus. The data contain statements that are labeled real (7591) and fake (7621).

In [2]:
# loading the corpus and partitioning into real and fake statements. holding out some data for prediction
data = pd.read_csv('data/news_data.csv')
holdout = data.loc[:5]
corpus = data.loc[5:]
real = corpus[corpus['label_fnn'] == 'real']
fake = corpus[corpus['label_fnn'] == 'fake']
corpus.head()
Out[2]:
id date speaker statement sources paragraph_based_content fullText_based_content label_fnn
5 8934 2014-02-17T00:00:00-05:00 Terry McAuliffe "Seventy percent of all uninsured live in hous... ['https://governor.virginia.gov/policy/executi... ['Gov. Terry McAuliffe is urging the General A... Gov. Terry McAuliffe is urging the General Ass... real
6 1415 2010-01-21T17:37:57-05:00 Chain email The House health care bill provides for "free ... ['http://michaelconnelly.viviti.com/entries/ge... ["A chain e-mail written by former attorney Mi... A chain e-mail written by former attorney Mich... fake
7 18318 2020-04-09T16:08:57-04:00 Facebook posts Says a pandemic occurs exactly every 100 years. ['https://www.facebook.com/photo.php?fbid=3091... ['According to this post on Facebook, pandemic... According to this post on Facebook, pandemics ... fake
8 5716 2012-04-11T07:30:00-04:00 Robert Menendez Says the United States "actually exports more ... ['http://www.youtube.com/watch?v=Cvo_fSKHkLY',... ['Drill it in the United States, keep it in th... Drill it in the United States, keep it in the ... real
9 13207 2016-10-07T10:00:24-04:00 Josh Hawley Says "he fought Obama at the Supreme Court — a... ['https://www.youtube.com/watch?v=D_zzqEA9N2A&... ["Throughout the primary and general election ... Throughout the primary and general election ca... real

Pre-processing and Moment Tensor Construction

We pre-process the statements by tokenizing them, removing stop words, and stemming them.

We then create a 3-dimensional symmetric tensor that contains counts of all (non-ordered) triples that appear in the statements across the corpus. Anandkumar et al. show that the components yielded by decomposing this tensor with k components correspond to k topics (or distributions over the vocabulary) in an exchangeable single topic model. They also show that tensor decompositions can be used for learning other latent variable models, including mixtures of Gaussians and latent Dirichlet allocation.

Note that here we do not use the ENSIGN ETL module csv2tensor but instead create and store the tensor directly along with the appropriate map files.

In [3]:
STOP = stopwords.words('english')
STEMMER = PorterStemmer()

# statement tokenizing, filtering, and stemming
def preprocess(text):
    text = text.lower()
    text = ''.join(c for c in text if c not in string.punctuation + '‘’“”')
    text = word_tokenize(text)
    text = [w for w in text if w not in STOP]
    text = [STEMMER.stem(w) for w in text]
    return text

# constructing the moment tensor for the single exchangeable topic model
def prepare_moment_tensor(df, tensor_dir):
    stmts = df['statement'].values
    triples = {}
    indices = {}
    n = 0
    for stmt in stmts:
        for trigram in trigrams(preprocess(stmt)):
            n += 1
            triple = tuple(sorted(trigram))
            triples[triple] = triples[triple] + 1 if triple in triples else 1
            for w in trigram:
                if w not in indices:
                    indices[w] = len(indices)
    tensor = []
    for triple in triples:
        val = triples[triple]
        idx = [indices[w] for w in triple]
        tensor += [' '.join([str(x) for x in list(p) + [val]]) for p in permutations(idx)]
    header = 'sptensor\n3\n{} {} {}\n{}\n'.format(
                        len(indices), len(indices), len(indices), len(tensor))
    
    if os.path.isdir(tensor_dir):
        shutil.rmtree(tensor_dir)
    os.mkdir(tensor_dir)
    with open(tensor_dir + '/tensor_data.txt', 'w') as w:
        w.write(header + '\n'.join(tensor))
    labels = '\n'.join(indices.keys())
    for i in range(3):
        with open(tensor_dir + '/map_mode_{}.txt'.format(i), 'w') as w:
            w.write('#w{}\n'.format(i) + labels)

# creating a word cloud visualizations for topics
def word_cloud(d, comp_id):
    freq = dict(zip(d.labels[0], d.factors[0][:, comp_id]))
    wordcloud = WordCloud()
    wordcloud.generate_from_frequencies(frequencies=freq)
    plt.figure()
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.show()
In [4]:
# preparing the moment tensor for the whole corpus
prepare_moment_tensor(corpus, 'news_decomposition')
tensor = spt.read_sptensor('news_decomposition')

Tensor Decomposition

The rank 20 decomposition of the moment tensor corresponds to learning a exchangeable single topic model with 20 topics.

In [5]:
# learning the topic model for the whole corpus
decomp = cpd.cp_apr(tensor, 20)

Viewing the Topics

In order to view the top-scoring words in each topic, we use get_top_k from the ENSIGN module comp_top_k. The function returns the top k (default of 10) scoring labels in each mode of the the specified components along with their indices and scores. As the tensor was symmetric, all the modes are identical. Therefore, for each component we only need to examine the zeroeth mode.

The topics below show coherent clustering of the vocabulary used in news statements. Different topics deal with dollar amounts, quantitative signifiers, time, political units, American government, politicians, political processes, healthcare, taxes, and crime. We show a selection below.

In addition to the raw listing of top weighted words in the topic, we also show a word cloud.

In [6]:
# getting top weighted words in each component (topic)
top_k = get_top_k(decomp.factors, decomp.labels, range(decomp.rank))

Numbers

The top-scoring words in the topic are numbers, and the other reasonably ranked words are those that one would expect numbers to modify. For example: "cost", "worth", "defecit", "taxpayer", etc.

In [7]:
top_k[0][0]
Out[7]:
[('million', 227, 0.0588726096988804),
 ('billion', 492, 0.055495147592943904),
 ('dollar', 259, 0.02834484465761747),
 ('trillion', 1134, 0.016705792260580117),
 ('money', 272, 0.014267555159786375),
 ('year', 29, 0.014070349794228517),
 ('cost', 336, 0.01396929613619127),
 ('taxpay', 1284, 0.013347878123735689),
 ('spend', 669, 0.012710439812249718),
 ('1', 373, 0.011855801657650052)]
In [8]:
word_cloud(decomp, 0)

American Politics

The words with high weights in this topic refer to various aspects of American politics, such as "bill", "vote", "senate", "house", "candidate", "democrat", "republican".

In [9]:
top_k[1][0]
Out[9]:
[('vote', 55, 0.03423056277348304),
 ('republican', 419, 0.03113673867126209),
 ('democrat', 588, 0.02886242433331514),
 ('bill', 12, 0.028212888229609778),
 ('hous', 9, 0.02468244583909429),
 ('senat', 424, 0.0222260036715591),
 ('say', 23, 0.0164916874853805),
 ('pass', 225, 0.015444874272371921),
 ('parti', 1071, 0.013322057565099138),
 ('congress', 761, 0.013083260037182918)]
In [10]:
word_cloud(decomp, 1)

Immigration

The words with high weights in this topic refer to various aspects of immigration, such as "illegal", "immigrant", "legal", "people", "countries".

In [11]:
top_k[3][0]
Out[11]:
[('illeg', 86, 0.026332998108428544),
 ('immigr', 87, 0.024550697447464197),
 ('gun', 172, 0.023672207750432835),
 ('peopl', 124, 0.018514408701841793),
 ('law', 440, 0.015581113826823196),
 ('allow', 56, 0.00866383397238251),
 ('legal', 897, 0.00853546506146313),
 ('check', 976, 0.007633014390822311),
 ('countri', 145, 0.00727758011581773),
 ('get', 49, 0.0069536890314183565)]
In [12]:
word_cloud(decomp, 3)

Taxes

This topic gives high weight to words associated with various aspects of taxes such as "cut", "raise", "increase", "pay", "income", and "property".

In [13]:
top_k[5][0]
Out[13]:
[('tax', 108, 0.11896927908484359),
 ('cut', 131, 0.03622485656488048),
 ('rais', 106, 0.027025124223030668),
 ('increas', 489, 0.020189695267788823),
 ('pay', 238, 0.02005322325640613),
 ('budget', 386, 0.01605152803870952),
 ('romney', 954, 0.014286274926623023),
 ('incom', 583, 0.013952467811310145),
 ('plan', 366, 0.013314300669138317),
 ('would', 185, 0.012955973447631695)]
In [14]:
word_cloud(decomp, 5)

Healthcare

This topic deals with key words used to describe the American health care system including "insurance", "Obamacare", and "medicare".

In [15]:
top_k[6][0]
Out[15]:
[('health', 10, 0.08370410958754429),
 ('care', 11, 0.06956774619718567),
 ('insur', 60, 0.027310629344606232),
 ('plan', 366, 0.02606890513857346),
 ('law', 440, 0.01669270801157477),
 ('would', 185, 0.016463901906769288),
 ('obamacar', 1631, 0.013316430421321745),
 ('bill', 12, 0.01023700099789195),
 ('medicar', 1083, 0.01008734763925023),
 ('act', 508, 0.009940435634070878)]
In [16]:
word_cloud(decomp, 6)

Fake New Study

We now train tensor decompositions to learn a single exchangeable topic model for both real and fake news articles. Then, in order to see if certain topics only appear in either real or fake statements, we use the ENSIGN post-processing tool decomp_diff. This tool computes a mapping between two or more decompositions in order to find similar components across decompositions. An entry [n, -1, 0] means that component n in decomposition 1 has no similar component in decomposition 2. Likewise, [-1, n, 0] means that component n in decomposition 2 has no similar component in decomposition 1. The default distance metric is the cosine similarity.

Before the decompositions are compared, it is necessary to synchronize the labels of the decompositions such that the indices in each of the tensor modes correspond to the same words. This synchronization is performed by the tool synchronize_labels.

In [17]:
# prepare moment tensors for real and fake corpii
prepare_moment_tensor(real, 'real_decomposition')
prepare_moment_tensor(fake, 'fake_decomposition')

# synchronize labels for comparison
synchronize_labels(['real_decomposition', 'fake_decomposition'], in_place=True)

# load and decompose tensors
tensor_real = spt.read_sptensor('real_decomposition')
tensor_fake = spt.read_sptensor('fake_decomposition')

decomp_real = cpd.cp_apr(tensor_real, 20)
decomp_fake = cpd.cp_apr(tensor_fake, 20)

# compare decomposition components to find unique topics
mapping = decomp_diff([decomp_real, decomp_fake], ['mapping'], threshold=0.4)['mapping']
mapping
Out[17]:
[[0, 1, 0.0917104297550908],
 [1, -1, 0],
 [2, 2, 0.18265666109054224],
 [3, 3, 0.03459313566788247],
 [4, 4, 0.1013767033907671],
 [5, 8, 0.23482662249723885],
 [6, 7, 0.3300842151502954],
 [7, -1, 0],
 [8, 13, 0.14646974662264678],
 [9, 5, 0.3299242292100033],
 [10, 0, 0.060224990738285245],
 [11, -1, 0],
 [12, 9, 0.14931433892709223],
 [13, -1, 0],
 [14, -1, 0],
 [15, 17, 0.3074898823451363],
 [16, 14, 0.1407293789830193],
 [17, -1, 0],
 [18, -1, 0],
 [19, -1, 0],
 [-1, 6, 0],
 [-1, 10, 0],
 [-1, 11, 0],
 [-1, 12, 0],
 [-1, 15, 0],
 [-1, 16, 0],
 [-1, 18, 0],
 [-1, 19, 0]]

Unique Topics

The mapping above shows that there are real and fake topics that do not have fake and real counterparts. Examples are plotted below.

In [18]:
top_k_real = get_top_k(decomp_real.factors, decomp_real.labels, range(decomp_real.rank))
top_k_fake = get_top_k(decomp_fake.factors, decomp_fake.labels, range(decomp_fake.rank))

Real Topic

In [19]:
real_idx = list(filter(lambda x : x[1][1] == -1, enumerate(mapping)))[0][1][0]
top_k_real[real_idx][0]
Out[19]:
[('job', 9411, 0.058210668448324385),
 ('new', 7544, 0.040110984740684884),
 ('rate', 4133, 0.03987632732435987),
 ('state', 8425, 0.02077828667147116),
 ('creat', 3745, 0.017299773436472213),
 ('unemploy', 8986, 0.015692870862409933),
 ('sinc', 7795, 0.014644727268732689),
 ('nation', 9217, 0.012557657168126),
 ('jersey', 4240, 0.01195537823909087),
 ('highest', 6387, 0.011651502646595471)]
In [20]:
word_cloud(decomp_real, real_idx)

Fake Topic

In [21]:
fake_idx = list(filter(lambda x : x[1][0] == -1, enumerate(mapping)))[0][1][1]
top_k_fake[fake_idx][0]
Out[21]:
[('say', 7698, 0.05443661995706916),
 ('clinton', 7422, 0.051041271007741876),
 ('scott', 7018, 0.03471054223040685),
 ('hillari', 8283, 0.03316138608824446),
 ('gov', 3213, 0.02571303145829814),
 ('show', 6161, 0.02512876873843072),
 ('walker', 3554, 0.02406409368079532),
 ('john', 6117, 0.01576687007661849),
 ('rick', 4456, 0.014286383746411732),
 ('joe', 8218, 0.013995997040735113)]
In [22]:
word_cloud(decomp_fake, fake_idx)

Inference on Unseen Headlines

After constructing the word distributions of heldout headlines, we see if they are closer to topics (components) in the fake or real models in order to predict if the headline is more likely real or fake. We apply a simple prediction function to two test examples, which were held out from the training data.

In [23]:
def predict(statement):
    tokens = preprocess(statement)
    counts = np.zeros(len(decomp_real.labels[0]))
    for w in tokens:
        if w in decomp_real.labels[0]:
            counts[decomp_real.labels[0].index(w)] += 1
    counts /= counts.sum()
    
    r = [cosine(counts, decomp_real.factors[0][:, comp_id]) for comp_id in range(decomp_real.rank)]
    f = [cosine(counts, decomp_fake.factors[0][:, comp_id]) for comp_id in range(decomp_fake.rank)]
    
    if min(r) < min (f):
        return 'real'
    else:
        return 'fake'
In [24]:
print('truth:', holdout.loc[1]['label_fnn'])
print('prediction:', predict(holdout.loc[1]['statement']))
truth: fake
prediction: fake
In [25]:
print('truth:',holdout.loc[5]['label_fnn'])
print('prediction:', predict(holdout.loc[5]['statement']))
truth: real
prediction: real

Key Takeaways

In this notebook, following the work of Anandkumar et al., we demonstrated how tensor decompositions can be used to learn latent variable models. This approach formalized the intuition that tensor decompositions extract the latent patterns in data. We applied this technique in the domain of natural language processing in order to learn models describing fake and real news stories, then we used these models to analyze previously unseen headlines. The decomposition was a descriptive tool because components found by the tensor decomposition synthesized coherent topics found in the headlines. We used this topic model as the basis of a predictive tool by comparing the word distribution of unseen headlines to the known topics. This particular approach might be adapted as a fake news detector on social media.