hypergraph

Tensor Decomposition with ENSIGN

In this notebook, we introduce the ENSIGN Python API and demonstrate how to use tensor decomposition to analyze data. Most, but not all Python API features are used, and we highly recommend referring to the Python API HTML documentation for additional capabilities. Further example calls to the Python API are also available in the ENSIGN User Guide and ENSIGN Tensor Guidebook.

The overview uses a dataset of Domain Name System (DNS) requests from systems on the Reservoir Labs network. The dataset has been anonymized by masking actual internal Internet Protocol (IP addresses). Typically, we analyze DNS data to find trends and anomalies in the domain names that clients are looking up. These trends and anomalies can be useful for discovering malware, discovering misconfigured devices, and defining network baseline traffic. This is but one possible application of tensor decomposition, and we recommend viewing the ENSIGN Use Case Notebooks, which apply tensor decomposition to geospatial analysis, cyber security, fake news detection, COVID-19 tracking, and bioinformatics.

Table of Contents

In [1]:
import os

import pandas as pd

#--------
#-------- BEGIN ENSIGN imports
#--------
from ensign.comp_top_k import get_top_k
import ensign.cp_decomp as cpd
import ensign.csv2tensor as c2t
from ensign.query_decomp import query_decomp
import ensign.sptensor as spt
import ensign.visualize as viz # Must be imported before matplotlib
#--------
#-------- END ENSIGN imports
#--------

import matplotlib.pyplot as plt

# To use matplotlib in a Jupyter notebook
%matplotlib inline 

# Limit the number of threads the decomposition algorithm uses
os.environ['OMP_NUM_THREADS'] = '16' 

Data

ENSIGN is built to "play nicely" with the widely-used pandas data analysis package. Here, we read a CSV file of information on DNS requests into a pandas dataframe. The data file contains one line per DNS connection. The dataframe will be converted to a tensor for decomposition and analysis.

Further information on the meaning of the fields in this file can be found in the documentation for Zeek, an open source network monitoring tool.

In [2]:
# Path to input data
CSV_FILE = 'data/dns_example.csv'
data = pd.read_csv(CSV_FILE)
data.tail()
Out[2]:
ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto trans_id query qclass ... rcode rcode_name AA TC RD RA Z answers TTLs rejected
10249 1.406488e+09 CJrBhK1yja1qPOvJKe fe80::aa11:43e2:f801:e345 5353 ff02::fb 5353 udp 0 npibbabc0.local 1 ... - - F F F F 0 - - F
10250 1.406488e+09 CJrBhK1yja1qPOvJKe fe80::aa11:43e2:f801:e345 5353 ff02::fb 5353 udp 0 kips.local 1 ... - - F F F F 0 - - F
10251 1.406488e+09 CNPNtW2Sfo66XbXDn4 172.16.1.65 37867 x.x.x.21 53 udp 48606 _ldap._tcp.reservoir.com 1 ... - - F F T F 0 - - F
10252 1.406488e+09 CyGdzr2QUhkyKM4l43 x.x.x.23 5353 224.0.0.251 5353 udp 0 npibbabc0.local 1 ... - - F F F F 0 fe80::42a8:f0ff:febb:abc0 10.1.1.28 120.000000 120.000000 F
10253 1.406488e+09 CyGdzr2QUhkyKM4l43 x.x.x.23 5353 224.0.0.251 5353 udp 0 - - ... - - F F F F 0 - - F

5 rows × 23 columns

Tensor Construction

Tensor decomposition algorithms do not operate directly on CSV files but on tensors formed from the CSV. A tensor is a multidimensional array in which each dimension, or mode, corresponds to a different feature in the data and the indices in a given mode correspond to the possible discrete values of the feature. We convert the data into a tensor with the ENSIGN ETL tool csv2tensor(), which creates a sparse tensor from a selection of features in the CSV. By default, the tensor is a count tensor, whose entries are the number of CSV entries that have the combination of features specified by the indices, but ENSIGN also supports Boolean and value tensors. A Boolean tensor indicates the existence of a CSV entry, and a value tensor uses another feature of the data for the tensor entries. The sparse tensor structure is explored more below by examining the fields of the sparse tensor object. The basic arguments used by the tensor construction function are:

  • filepaths: The path to one or more input files.
  • columns: The columns from data that are included in the tensor. Each column maps to one mode of the tensor.
  • types: The type of each column. The type of the first column is timestamp and it is formatted as seconds since the epoch. Any date format compatible with Python's strftime() and strptime() can also be used along with the datetime type. The rest of the columns are treated as strings (str). Other available types 64-bit integer (int64), 64-bit float (float64), and IPv4/IPv6 addresses (ip).
  • binning: How we discretize the data in each column. This describes how we map the given data from its representation in the CSV file/dataframe to the indices 0, 1, ... used in the tensor representation. We use minute to specify that all timestamps in a minute are binned together. In other words, timestamps representing 00:00:23, 00:00:26, and 00:45 would all map to the minute 00:00 and the index 0. Different binning strategies are available for datetime, timestamp, int64, float64, and ip types. See the csv2tensor() documentation for more details.

In our running DNS example, we choose columns corresponding to time, source and destination IP addresses, and destination port. These features are chosen because they encapsulate who is making requests and when. As the IP addresses and ports are discrete entities, we only bin the timestamp (to the nearest minute) so that close timestamps are considered semantically the same.

In [3]:
tensor = c2t.csv2tensor(
    CSV_FILE,
    columns=['ts', 'id.orig_h', 'id.resp_h', 'id.resp_p'],
    types=['timestamp', 'str', 'str', 'str'],
    binning=['minute', 'none', 'none',  'none'],
    gen_backtrack=True
)

Understanding the Sparse Tensor

We now explore the fields in the sparse tensor object produced by csv2tensor().

Order: A tensor is a multidimensional array. The number of dimensions is referred to as the "order" of a tensor. For the tensor we just created, the order is 4.

In [4]:
tensor.order
Out[4]:
4

Mode Names: Each of those dimensions corresponds to one column of the original CSV file. In the language of tensors, a dimension is referred to as a "mode."

In [5]:
tensor.mode_names
Out[5]:
['ts', 'id.orig_h', 'id.resp_h', 'id.resp_p']

Mode Sizes: Each of the modes has a size determined by the number of unique data values in the original CSV file for that column after they have been binned.

In [6]:
tensor.mode_sizes
Out[6]:
[61, 70, 21, 4]

Labels: Each mode has a number of labels equal to its size. These labels correspond to the values from the original CSV after binning. We print the first three labels of each mode below. Note that the labels for the 'ts' mode have been binned by minute.

In [7]:
for mode_id in range(4):
    print(tensor.mode_names[mode_id] + " labels: " + str(tensor.labels[mode_id][:3]))
ts labels: ['2014-07-27 17:59:00', '2014-07-27 18:00:00', '2014-07-27 18:01:00']
id.orig_h labels: ['172.16.2.17', 'x.x.x.113', 'x.x.x.129']
id.resp_h labels: ['x.x.x.21', 'x.x.x.2', '64.72.64.10']
id.resp_p labels: ['137', '53', '5353']

Sparse Tensor and Entries: As shown through the other fields, we have an order 4 tensor (i.e., a 4-dimensional array) of size 61 x 70 x 21 x 4. Each dimension of this array can be addressed by an integer or, equivalently, a label. Each value in the tensor (i.e., each value in the multidimensional array) can be addressed by a tuple of integers or, equivalently, a tuple of labels. Storing every value of the tensor would waste memory, so we choose to only store elements with nonzero values. This collection of values, and the tuples that index them, are collectively called a "sparse tensor." A tuple and its corresponding value are referred to as an "entry" of the sparse tensor.

We show the first five entries of the sparse tensor we created below. Note the following:

  • Each row represents an entry of the tensor. The first four columns form a tuple e.g., (0,0,0,0) for the zeroeth row below. The last entry is the value of the tensor at that tuple e.g., 2 for the zeroeth row below.
  • The tuple can be converted to the binned data values from the original CSV using labels. For example, (0,0,0,0) can be converted to ('2014-07-27 17:59:00', 'x.x.x.17', 'x.x.x.21', '53')
  • The value of an entry corresponds to the number of times that tuple was seen in the binned data from the original CSV.
In [8]:
tensor.entries.head()
Out[8]:
ts id.orig_h id.resp_h id.resp_p val_idx
0 0 0 0 0 2
1 0 0 0 1 1
2 0 1 0 1 2
3 0 2 0 1 2
4 0 3 0 1 2

Number of Nonzeroes (NNZ): In a sparse tensor, we only represent those elements that have a nonzero value. To get a count of how many entries there are in a tensor, we look at the nnz attribute of the tensor.

In [9]:
tensor.nnz
Out[9]:
1702

Tensor Decomposition

Having constructed a tensor that encapsulates features of our original data, we can now decompose our sparse tensor. The ENSIGN API supports several flavors of CANDECOMP/PARAFAC (CP) decomposition, which decomposes a tensor into sums of outer products of vectors. Each outer product is called a "component" and in practice represents subsets of correlated indices in each mode. Components typically represent a semantically coherent pattern-of-activity in the original tensor. The number of components is called the "rank" of the decomposition.

The varieties of CP decomposition in ENSIGN are CP-ALS, CP-ALS-NN, CP-APR, CP-APR-PDNR, and CP-APR-PQNR, which all make different assumptions of tensor data type and how to perform the optimization. In our running DNS example, we use the CP-APR decomposition for count tensors.

In [10]:
rank = 50
decomposition = cpd.cp_apr(tensor, rank, gen_backtrack=True)

Understanding the Decomposition

The decomposition can be written as a set of factor matrices and a set of weights.

Factors: There is one factor matrix per mode, and each has dimensions of mode size by rank. Each column of the factor matrix is that mode's vector in the outer product in one component. Each row represents a label in that mode. Factor matrices are accessed via the list decomposition.factors and each factor matrix is a NumPy ndarray.

In [11]:
decomposition.factors[0] # Factor matrix for ts mode
Out[11]:
array([[3.62868898e-210, 8.89876815e-003, 1.25029048e-003, ...,
        6.79297549e-006, 0.00000000e+000, 0.00000000e+000],
       [4.57802000e-003, 2.26514098e-002, 8.66403898e-001, ...,
        3.69254775e-008, 9.58682526e-001, 0.00000000e+000],
       [3.08535169e-002, 1.09991059e-002, 0.00000000e+000, ...,
        3.52288965e-004, 0.00000000e+000, 0.00000000e+000],
       ...,
       [2.08313367e-002, 3.15501780e-002, 0.00000000e+000, ...,
        9.90260781e-007, 0.00000000e+000, 0.00000000e+000],
       [1.82069710e-002, 6.22700075e-003, 0.00000000e+000, ...,
        8.81848487e-003, 0.00000000e+000, 0.00000000e+000],
       [3.38567634e-016, 2.51353364e-011, 0.00000000e+000, ...,
        1.12628653e-004, 0.00000000e+000, 0.00000000e+000]])

Weights: Each component has a weight, which represents the approximate number of original CSV lines represented by that component. Weights are are accessed via the list decomposition.weights and each weight is a 64-bit float.

In [12]:
decomposition.weights[0] # Weight of Component 0
Out[12]:
1976.6991080343648

Metrics: Statistics on the quality of the decomposition can be found in the decomposition.metrics dictionary. The available data are:

  • time: How long the decomposition algorithm ran (in seconds)
  • fit: The final fit, a measure of how well the decomposition reconstructs the tensor
  • cosine_sim: The cosine similarity between the tensor and the original decomposition
  • norm_scaling: An alternative to fit
  • coverage: The percentage of tensor entries that are nontrivially reconstructed by the decomposition
  • cp_total_iter: The number of CP iterations performed by the algorithm

The high fit and complete coverage indicate that this decomposition represents the original data well. A near-perfect fit is not necessary to obtain good results.

In [13]:
decomposition.metrics
Out[13]:
{'time': 0.993549108505249,
 'fit': 0.9412308192196687,
 'cosine_sim': 0.9982735035247059,
 'norm_scaling': 1.0002240032254404,
 'coverage': 1.0,
 'cp_total_iter': 100}

Saving the Decomposition to Disk

Because decompositions can be long running operations, we frequently wish to save the results to disk. The write_cp_decomp_dir() function does this by saving the factor matrices and weights of a decomposition as text files. The tensor that was decomposed is also written if requested. For later analysis, tensors and their decomposition can be read from disk.

In [14]:
cpd.write_cp_decomp_dir('decomposition', decomposition, write_tensor=True)

Visualization and Post-Processing

Interpreting a Component

An effective method for analyzing the results of a decomposition is plotting components. We use the plot_component() function below to produce a visual representation of one component in a decomposition. Recall that a component is a scaled outer product of columns of the factor matrices, so we plot each mode of a component.

The subplots below each represent one column of the corresponding factor matrix. The x-axis represent the labels of the mode, while the y-axis represents the score for each label. As a component is a pattern of related tensor entries, the score represents how dominant a given label is in the pattern. A score of 0.0 means that a label is not represented in a pattern. A single score with a value of 1.0 means that, for the given mode, the label with score 1.0 is the only label represented by the pattern. The most common case is when there are a number of scores between 0.0 and 1.0 where the magnitude of the score indicates the relative importance of the score in the pattern. The scores in each mode are normalized such that their sum is equal to 1.0. The highest scoring labels in each mode are printed to the right of the plots.

Intuitively, any tuple of indices formed by choosing one non-zero score from each mode should corresopnd to one or more entries in the original CSV data that are involved in this pattern. In practice, the decomposition represents the best fit of the data to the model, so compression may occur resulting in some tuples not being represented in the original data. This is normal and, in fact, often desirable in condensing the data into coherent patterns.

Components are sorted by weight. We plot the zeroeth and highest-weighted component below. Its weight is approximately 1976, which is a rough estimate of the sum of tensor entry values associated with this pattern. It captures baseline DNS requests from computers on the network. We determine this by analyzing the modes. The non-zero scores in the source IP mode are machines on the network and the only non-zero score in the destination IP mode is the network's primary DNS server (recognized by its IP address). The single non-zero scoring index in the destination port is not surprisingly for port 53. Finally, we conclude this component is capturing baseline requests due to the consistent scores across the time mode. It is very typical to capture such expected behavior in the highest weight component.

In [15]:
comp_id = 0
viz.plot_component(decomposition, comp_id)
Out[15]:

Getting the Highest Scoring Labels in a Component

In order to determine the significance of a component, it is useful to examine the top scoring labels in each mode. ENSIGN provides a utility function, get_top_k(), to retrieve the top scoring labels in one or more components. Below, we see an example invocation of this function that gets the top 10 highest scoring labels for all modes of all components of a decomposition. The arguments are described below.

  • decomposition.factors: The factor matrices of a decomposition to examine
  • decomposition.labels: The labels corresponding to the above factor matrices
  • list(range(rank)): A list of integers from 0 to the rank of the decomposition minus 1. This list tells get_top_k() what components to examine.
  • k=10: The maximum number of labels to return for each mode of each component

Here we use get_top_k() to examine the source IP addresses involved in Component 0.

In [16]:
top_k = get_top_k(decomposition.factors, decomposition.labels, list(range(rank)), k=10)

# Print results for mode 1 of Component 0
comp_id = 0
mode_id = 1
print('LABEL\t\t\t|INDEX\t\t|SCORE')
for tup in top_k[comp_id][mode_id]:
    print('{:<23}\t|{:<14}\t|{}'.format(*tup))
LABEL			|INDEX		|SCORE
x.x.x.182              	|15            	|0.2446167795615285
x.x.x.44               	|22            	|0.2446167795615285
x.x.x.90               	|11            	|0.24290732572126894
x.x.x.14               	|3             	|0.09109536937124511
x.x.x.174              	|6             	|0.08171037635060523
x.x.x.192              	|17            	|0.03654770058331348
x.x.x.202              	|19            	|0.03574243452203707
x.x.x.71               	|23            	|0.018694124629606963
x.x.x.79               	|39            	|0.0030572217618494877
x.x.x.66               	|30            	|0.0010118879370166168

Searching a Decomposition for a Label

If a pattern contains a label of interest, it may be important to know which other patterns include the label. ENSIGN provides this functionality with the query_decomp() function. Below we see an example invocation of this function that searches the decomposition for components that contain the label "x.x.x.21" in mode 2. The arguments are described below.

  • decomposition.factors: The factor matrices of a decomposition to examine
  • decomposition.labels: The labels corresponding to the above factor matrices
  • [mode_id]: A list of the mode_ids to search. In this case we are only searching in mode 1.
  • x.x.x.21: The label we are searching for

Not surprisingly, the tool will indicate that "x.x.x.21" appears in mode 2 in component 0, but it also indicates which other components involve that particular destination IP address.

In [17]:
mode_id = 2
query_decomp(decomposition.factors, decomposition.labels, [mode_id], 'x.x.x.21')
Out[17]:
[       Score  Mode  Component
 0   1.000000     2          0
 1   1.000000     2          2
 2   1.000000     2          3
 3   0.954941     2          5
 4   1.000000     2          7
 5   1.000000     2          9
 6   0.978103     2         13
 7   0.032895     2         14
 8   1.000000     2         15
 9   0.015427     2         16
 10  0.989032     2         19
 11  0.961964     2         20]

Backtracking to Original Data

By providing the gen_backtrack=True argument to both csv2tensor() and cp_apr(), we obtain mappings from tensor entries to original lines and from decomposition components to tensor entries, respectively. We can now compose these maps in order to see the original log entries corresponding to a component. This allows an analyst to examine all other fields not used in tensor construction when studying a pattern. We show how to recover the original log entries associated with component 0. As expected, many of the queries are for websites, which include Google, Gmail, and GitHub.

In [18]:
comp_id = 0
idxs = []
for entry in decomposition.cpd_backtrack[comp_id]:
    idxs += [line for [log, line] in tensor.spt_backtrack[entry]]
data.loc[idxs].reset_index(drop=True).head(10)
Out[18]:
ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto trans_id query qclass ... rcode rcode_name AA TC RD RA Z answers TTLs rejected
0 1.406484e+09 CVEoIdwRBDvbRIPwa x.x.x.14 51133 x.x.x.21 53 udp 40505 daisy.ubuntu.com 1 ... - - F F T F 0 - - F
1 1.406484e+09 C9PZof4KQjYU3yXnE6 x.x.x.14 34505 x.x.x.21 53 udp 14198 www.google.com 1 ... - - F F T F 0 - - F
2 1.406484e+09 CEkGqY3MAjvptniSm x.x.x.14 50112 x.x.x.21 53 udp 49711 daisy.ubuntu.com 1 ... - - F F T F 0 - - F
3 1.406484e+09 C0PX9g4SgnWmEazrzj x.x.x.14 60054 x.x.x.21 53 udp 47706 daisy.ubuntu.com 1 ... - - F F T F 0 - - F
4 1.406484e+09 C1bQIBObkAs8aUhkf x.x.x.14 25374 x.x.x.21 53 udp 58165 talkgadget.google.com 1 ... - - F F T F 0 - - F
5 1.406484e+09 CZwtIG1QhjdH2jzYf3 x.x.x.14 10403 x.x.x.21 53 udp 21662 mail.google.com 1 ... - - F F T F 0 - - F
6 1.406484e+09 CtEBFt24eEKfg1Gpm1 x.x.x.14 45554 x.x.x.21 53 udp 48182 daisy.ubuntu.com 1 ... - - F F T F 0 - - F
7 1.406484e+09 CtAUQ810P6fHGTY2Nk x.x.x.14 57021 x.x.x.21 53 udp 16406 daisy.ubuntu.com 1 ... - - F F T F 0 - - F
8 1.406484e+09 CpoJ4S36PgLNv5J6Hc x.x.x.174 37610 x.x.x.21 53 udp 14496 github.com 1 ... - - F F T F 0 - - F
9 1.406484e+09 CpoJ4S36PgLNv5J6Hc x.x.x.174 37610 x.x.x.21 53 udp 2316 github.com 1 ... - - F F T F 0 - - F

10 rows × 23 columns

Key Takeaways

In this notebook, we introduced the ENSIGN Python API for tensor construction and decomposition. We explained the basic principles of tensor decomposition and alluded to its utility in extracting patterns of interest in a variety of data. This notebook provides the requisite knowledge of ENSIGN and tensor decomposition to explore our other use case notebooks, which demonstrate how ENSIGN can be applied to practical problems in geospatial analysis, cyber security, fake news detection, COVID-19 tracking, and bioinformatics.