In this notebook, we use ENSIGN to build and decompose a Bro/Zeek connections ("conn") tensor in order to gain insight into the activity occurring on a medium-sized business network. The tensor is constructed from a Bro/Zeek log, which is formed by sensors collecting data on connections made on the network. The decomposition separates behaviors into different components that reveal both benign background traffic and potentially malicious activity. Using a detector and ENSIGN's Python tools, we investigate the components to quickly find possible threats. Using ENSIGN's backtracking capabilities, we recover original log entries associated with the malicious activity.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
import pandas as pd
from ensign.csv2tensor import csv2tensor
from ensign.cp_decomp import cp_apr, read_cp_decomp_dir, write_cp_decomp_dir
from ensign.comp_top_k import get_top_k
from ensign.query_decomp import query_decomp
from ensign.visualize import plot_component
from rad.rad import run_rad
import matplotlib
%matplotlib inline
The data we will investigate in this notebook is a standard Bro/Zeek conn.log containing information about network connections made during one business day.
columns = ['timestamp', 'uid', 'src.ip', 'src.port', 'dst.ip', 'dst.port', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents', 'ts_mcore', 'last_ts_mcore', 'mcore_orig_pkts', 'mcore_resp_pkts', 'mcore_orig_bytes', 'mcore_resp_bytes', 'mcore_orig_ip_bytes', 'mcore_resp_ip_bytes', 'mcore_hist_pktsize', 'mcore_hist_pkttime', 'orig_cc', 'resp_cc', 'pcr']
conn = pd.read_csv('data/cyber_data.log', sep='\t', skiprows=8, header=None)
conn.columns = columns
conn
We use our ETL tool csv2tensor
in order to read a Bro/Zeek log and construct a tensor out of relevant features. A Bro/Zeek log is a tab-separated file with additional header information, and it is the other format we support besides CSV files. With very few arguments, you can build a tensor for further analysis. The arguments provided in this call are:
filepaths
: The location of the CSVs or Bro/Zeek logs with relevant data. This may be a single path, a list of paths, or a path with wildcards. This allows for easy specification and minimal data pre-processing.columns
: The names of the columns to correspond to modes in the tensor. In this example, we construct the "conn" tensor, which has modes corresponding to the timestamp of the connection, the origin and destination hosts, and the destination port.types
: The datatypes of the modes so that the tool knows how to validate and discretize the databinning
: The desired descretization scheme for each mode. In this case, we only round the timestamps to the nearest minute.bro_log
: Used to indicate that the tool should read in the files as Bro/Zeek logsgen_backtrack
: Generate mapping from tensor entries to log entriestensor = csv2tensor(filepaths='data/cyber_data.log',
columns=['timestamp', 'src.ip', 'dst.ip', 'dst.port'],
types=['timestamp', 'ip', 'ip', 'int64'],
binning=['minute', 'none', 'none', 'none'],
bro_log=True,
gen_backtrack=True)
As csv2tensor
by default produces a count tensor, a tensor whose indices are counts of the the number of times the corresponding labels co-occur in the data, we decompose it with CP-APR, which assumes the tensor entries are distributed according to Poisson distributions. We decompose at rank 100 as this is often a useful starting point for analysis of a dataset of this size. We also generate a mapping from tensor components to tensor entries.
decomp = cp_apr(tensor, 100, mem_limit_gb=12, gen_backtrack=True)
tensor.write('cyber_decomposition')
write_cp_decomp_dir('cyber_decomposition', decomp)
Using plot_component
we can visualize the high-scoring labels in each mode in a given component. As the components are sorted by weight and the first component is plotted below, the activity it describes corresponds to a large portion of the log entries.
The component describes a behavior that occurs throughout the day (the timestamp
mode is nonzero througout its domain). The behavior consists of machines on the network connecting to 10.106.187.106
on port 53. This is indicated by the non-zero scoring labels in each of the other modes. This component is coherent as 10.106.187.106
is the network's DNS server. It is also reasonable that this is component has a large weight.
plot_component(decomp, 0)
There is a notable 50 minute gap in the time mode of the above component suggesting that something different was happening between the involved hosts at that time. We can investigate this by using the ENSIGN tool query_decomp
to search for other components involving the same hosts.
query_decomp([decomp.factors], [decomp.labels], [[1]], '10.55.89.190')[0].sort_values(by='Score', ascending=False).head()
The host we queried also appears predominately in component 2, plotted below. Its time mode clearly account for the time missing in the above DNS component. Based on the last two modes, this component also represents DNS traffic. The subtle difference between the DNS traffic during these two time periods is the higher representation by the top two scoring origin hosts. Indeed, these hosts are involved in inspecting the previous day's network traffic, and this job is kicked off at the time of the peak.
plot_component(decomp, 2)
While the component plotted above is coherent, it does not tell the user anything they don't already know about their network. Using ENSIGN's suite of tools, we can post-process the tensor decomposition components and flag those that have interesting features that may indicate some malicious activity.
In the following example, we run a Recurrent Activity Detection (RAD) module in order to automate the determination of which components describe periodic behavior. The RAD module returns a list of components, whether or not they have been flagged for periodic behavior, and the coefficient of variation (mean-normalized standard deviation) of the distances between peaks. First, we filter out non-recurrent activity and sort the components by the coefficient of variation. Then, we rank the components by coefficient of variation.
rad = run_rad('cyber_decomposition', 'rad/rad.txt', {"method" : "cluster", "eps" : 5, "nn" : 5})
rad = list(filter(lambda l : l[1], rad))
rad = sorted(rad, key=lambda l:l[2])
rad[:5]
In component 13, we see that approximately once an hour, 10.86.87.14
connects to 10.248.255.254
on many ports up to 1500. This indicates that 10.86.87.14
is periodically port scanning 10.248.255.254
. This component, which the RAD module flagged due to its periodic time mode, may warrant further investigation.
plot_component(decomp, 13)
Component 70 describes 10.163.28.175
connecting to 10.32.28.53
and 10.248.255.254
on port 22 once an hour. An analyst who knows their network would understand that this component corresponds to the network admin's design of dumping network logs once an hour from the machine that produces them to the two locations where they are stored. Therefore, it probably does not warrant further investigation.
plot_component(decomp, 70)
The investigation with the RAD module turned up an interesting activity: 10.86.87.14
port scanning 10.248.255.254
. We now ask whether 10.86.87.14
is involved in any other suspicious or malicious activity. We can use the ENSIGN tool query_decomp
in order to see which other components have a high score for 10.86.87.14
in the origin host mode. Besides component 13, the port scanning component, 10.86.87.14
also turns up in component 58.
query_decomp([decomp.factors], [decomp.labels], [[1]], '10.86.87.14')[0].sort_values(by='Score', ascending=False).head()
Plotting component 58 shows 10.86.87.14
connecting to several machines on the network throughout the day mainly on port 135. The actor is scanning systems on the network that provide remote services, hence why we see a spike in destination port 135 remote procedure call (RPC).
plot_component(decomp, 58)
In order to confirm that most of the destination machines are on the company network, we can use the get_top_k
module to list top labels in the destination host mode for this component. Below shows the top 50 destination hosts reached by 10.86.87.14
as described by component 58.
get_top_k(decomp.factors, decomp.labels, [58], 50)[58][2]
Because ENSIGN can construct mappings from tensor entries to log entries and from decomposition components to tensor entries, we can compose these mappings and produce original log entries associated with any given components. So to further our investigation into activities from host 10.86.87.14
we can use Ensign's backtrack capabilities to return the original network logs with activities surrounding the events associcated to the tensor components we've analyzed.
idxs = []
for comp_id in [13, 58]:
for entry in decomp.cpd_backtrack[comp_id]:
idxs += [line for [log, line] in tensor.spt_backtrack[entry]]
conn.loc[idxs]
In this notebook, we used tensor decompositions to find threats in cyber network logs. We showed how to use ENSIGN to mine unlabeled, multidimensional data for patterns that cue investigations into possible malicious activity. Notably, our investigation did not require labeled data or the use of signature-based methods. In general, tensor decompositions isolate discrete activities in the data and help to discover the "unknown unknowns." While in this notebook we processed the components with a recurrent activity detector to locate a threat, what we described is a flexible workflow that can incorporate numerous detectors in order to find a number of potential threat indicators. Using similar methods, we have previously uncovered and visualized patterns indicative of:
This workflow can be extended such that the patterns are examined on a daily basis to discover “what has changed” and support skilled hunt teams to make directed, efficient use of big-graph platforms and search tools. ENSIGN’s advanced unsupervised machine learning capability connects key dots that make clear who the relevant actors are.