hypergraph

Network Analysis and Anomaly Detection

In this notebook, we use ENSIGN to build and decompose a Bro/Zeek connections ("conn") tensor in order to gain insight into the activity occurring on a medium-sized business network. The tensor is constructed from a Bro/Zeek log, which is formed by sensors collecting data on connections made on the network. The decomposition separates behaviors into different components that reveal both benign background traffic and potentially malicious activity. Using a detector and ENSIGN's Python tools, we investigate the components to quickly find possible threats. Using ENSIGN's backtracking capabilities, we recover original log entries associated with the malicious activity.

Table of Contents

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np
import pandas as pd

from ensign.csv2tensor import csv2tensor
from ensign.cp_decomp import cp_apr, read_cp_decomp_dir, write_cp_decomp_dir
from ensign.comp_top_k import get_top_k
from ensign.query_decomp import query_decomp
from ensign.visualize import plot_component
from rad.rad import run_rad

import matplotlib
%matplotlib inline

Data

The data we will investigate in this notebook is a standard Bro/Zeek conn.log containing information about network connections made during one business day.

In [2]:
columns = ['timestamp', 'uid', 'src.ip', 'src.port', 'dst.ip', 'dst.port', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents', 'ts_mcore', 'last_ts_mcore', 'mcore_orig_pkts', 'mcore_resp_pkts', 'mcore_orig_bytes', 'mcore_resp_bytes', 'mcore_orig_ip_bytes', 'mcore_resp_ip_bytes', 'mcore_hist_pktsize', 'mcore_hist_pkttime', 'orig_cc', 'resp_cc', 'pcr']
conn = pd.read_csv('data/cyber_data.log', sep='\t', skiprows=8, header=None)
conn.columns = columns
conn
Out[2]:
timestamp uid src.ip src.port dst.ip dst.port proto service duration orig_bytes ... mcore_resp_pkts mcore_orig_bytes mcore_resp_bytes mcore_orig_ip_bytes mcore_resp_ip_bytes mcore_hist_pktsize mcore_hist_pkttime orig_cc resp_cc pcr
0 1.520312e+09 CVB0Ci2bigU7WwkdIa 10.55.89.190 60311 10.106.187.106 53 udp dns 0.000570 94 ... 212 4982 17278 7950 23214 - - - - -0.552381
1 1.520312e+09 CCH8XH3aiB7yaVqsfb 10.29.128.152 48656 10.106.187.106 53 udp dns 0.000604 100 ... 4 100 338 156 450 - - - - -0.543379
2 1.520312e+09 Cqlj26rE1jBTRm5Ui 10.29.128.152 56478 10.106.187.106 53 udp dns 0.000610 62 ... 4 110 458 166 570 - - - - -0.610063
3 1.520312e+09 CYZsqO1OfUc6EjQ8Yc 10.29.128.152 33011 10.18.10.43 49403 tcp - 0.153709 608 ... 18 1887 608 2571 1544 - - - - -0.512625
4 1.520312e+09 CVCh7Mur0eT0BiLe3 10.29.128.152 57289 10.18.10.43 28518 tcp - 0.001273 147 ... 5 50 147 266 407 - - - - 0.492386
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3186424 1.520399e+09 CYjK3CoWCsAHkM05j 10.204.231.249 45121 10.106.187.106 53 udp dns 0.000563 72 ... 60 1080 6120 1920 7800 - - - - -0.7
3186425 1.520399e+09 CWxwLc1Ck7W3RrCYRd 10.70.96.59 60117 10.106.187.106 53 udp dns 0.000457 40 ... 12 233 872 401 1208 - - - - -0.646018
3186426 1.520399e+09 CHidjjwDQeDcNxJn 10.246.131.128 58646 10.106.187.106 53 udp dns 0.000334 72 ... 34 1234 2560 2186 3512 - - - - -0.323944
3186427 1.520399e+09 CEVYtfmCpArL8C0l6 10.55.89.190 37423 10.106.187.106 53 udp dns 0.000515 94 ... 420 9870 34230 15750 45990 - - - - -0.552381
3186428 1.520399e+09 CimIdk3MjEjt0zFNqg 10.55.89.190 41830 10.106.187.106 53 udp dns 0.000601 94 ... 392 9306 31948 14850 42924 - - - - -0.552381

3186429 rows × 34 columns

Tensor Construction

We use our ETL tool csv2tensor in order to read a Bro/Zeek log and construct a tensor out of relevant features. A Bro/Zeek log is a tab-separated file with additional header information, and it is the other format we support besides CSV files. With very few arguments, you can build a tensor for further analysis. The arguments provided in this call are:

  • filepaths: The location of the CSVs or Bro/Zeek logs with relevant data. This may be a single path, a list of paths, or a path with wildcards. This allows for easy specification and minimal data pre-processing.
  • columns: The names of the columns to correspond to modes in the tensor. In this example, we construct the "conn" tensor, which has modes corresponding to the timestamp of the connection, the origin and destination hosts, and the destination port.
  • types: The datatypes of the modes so that the tool knows how to validate and discretize the data
  • binning: The desired descretization scheme for each mode. In this case, we only round the timestamps to the nearest minute.
  • bro_log: Used to indicate that the tool should read in the files as Bro/Zeek logs
  • gen_backtrack: Generate mapping from tensor entries to log entries
In [3]:
tensor = csv2tensor(filepaths='data/cyber_data.log',
                    columns=['timestamp', 'src.ip', 'dst.ip', 'dst.port'],
                    types=['timestamp', 'ip', 'ip', 'int64'],
                    binning=['minute', 'none', 'none', 'none'],
                    bro_log=True,
                    gen_backtrack=True)

Tensor Decomposition

As csv2tensor by default produces a count tensor, a tensor whose indices are counts of the the number of times the corresponding labels co-occur in the data, we decompose it with CP-APR, which assumes the tensor entries are distributed according to Poisson distributions. We decompose at rank 100 as this is often a useful starting point for analysis of a dataset of this size. We also generate a mapping from tensor components to tensor entries.

In [4]:
decomp = cp_apr(tensor, 100, mem_limit_gb=12, gen_backtrack=True)
In [5]:
tensor.write('cyber_decomposition')
write_cp_decomp_dir('cyber_decomposition', decomp)

Visualizing Components

Using plot_component we can visualize the high-scoring labels in each mode in a given component. As the components are sorted by weight and the first component is plotted below, the activity it describes corresponds to a large portion of the log entries.

The component describes a behavior that occurs throughout the day (the timestamp mode is nonzero througout its domain). The behavior consists of machines on the network connecting to 10.106.187.106 on port 53. This is indicated by the non-zero scoring labels in each of the other modes. This component is coherent as 10.106.187.106 is the network's DNS server. It is also reasonable that this is component has a large weight.

In [6]:
plot_component(decomp, 0)
Out[6]:

There is a notable 50 minute gap in the time mode of the above component suggesting that something different was happening between the involved hosts at that time. We can investigate this by using the ENSIGN tool query_decomp to search for other components involving the same hosts.

In [7]:
query_decomp([decomp.factors], [decomp.labels], [[1]], '10.55.89.190')[0].sort_values(by='Score', ascending=False).head()
Out[7]:
Score Mode Component
2 0.715004 1 2
0 0.638488 1 0
17 0.074074 1 99
7 0.070562 1 8
14 0.059359 1 47

The host we queried also appears predominately in component 2, plotted below. Its time mode clearly account for the time missing in the above DNS component. Based on the last two modes, this component also represents DNS traffic. The subtle difference between the DNS traffic during these two time periods is the higher representation by the top two scoring origin hosts. Indeed, these hosts are involved in inspecting the previous day's network traffic, and this job is kicked off at the time of the peak.

In [8]:
plot_component(decomp, 2)
Out[8]:

Searching for Anomalies

While the component plotted above is coherent, it does not tell the user anything they don't already know about their network. Using ENSIGN's suite of tools, we can post-process the tensor decomposition components and flag those that have interesting features that may indicate some malicious activity.

In the following example, we run a Recurrent Activity Detection (RAD) module in order to automate the determination of which components describe periodic behavior. The RAD module returns a list of components, whether or not they have been flagged for periodic behavior, and the coefficient of variation (mean-normalized standard deviation) of the distances between peaks. First, we filter out non-recurrent activity and sort the components by the coefficient of variation. Then, we rank the components by coefficient of variation.

In [9]:
rad = run_rad('cyber_decomposition', 'rad/rad.txt', {"method" : "cluster", "eps" : 5, "nn" : 5})
rad = list(filter(lambda l : l[1], rad))
rad = sorted(rad, key=lambda l:l[2])
rad[:5]
Out[9]:
[[13, True, 0.0],
 [16, True, 0.19001038821453575],
 [70, True, 0.2049174915029987],
 [45, True, 0.24362606232294617],
 [51, True, 0.25678650249761253]]

Port Scanning

In component 13, we see that approximately once an hour, 10.86.87.14 connects to 10.248.255.254 on many ports up to 1500. This indicates that 10.86.87.14 is periodically port scanning 10.248.255.254. This component, which the RAD module flagged due to its periodic time mode, may warrant further investigation.

In [10]:
plot_component(decomp, 13)
Out[10]:

Saving Network Logs

Component 70 describes 10.163.28.175 connecting to 10.32.28.53 and 10.248.255.254 on port 22 once an hour. An analyst who knows their network would understand that this component corresponds to the network admin's design of dumping network logs once an hour from the machine that produces them to the two locations where they are stored. Therefore, it probably does not warrant further investigation.

In [11]:
plot_component(decomp, 70)
Out[11]:

Network Mapping

The investigation with the RAD module turned up an interesting activity: 10.86.87.14 port scanning 10.248.255.254. We now ask whether 10.86.87.14 is involved in any other suspicious or malicious activity. We can use the ENSIGN tool query_decomp in order to see which other components have a high score for 10.86.87.14 in the origin host mode. Besides component 13, the port scanning component, 10.86.87.14 also turns up in component 58.

In [12]:
query_decomp([decomp.factors], [decomp.labels], [[1]], '10.86.87.14')[0].sort_values(by='Score', ascending=False).head()
Out[12]:
Score Mode Component
11 1.000000 1 13
27 0.993031 1 83
12 0.987844 1 14
24 0.952745 1 58
17 0.562519 1 26

Plotting component 58 shows 10.86.87.14 connecting to several machines on the network throughout the day mainly on port 135. The actor is scanning systems on the network that provide remote services, hence why we see a spike in destination port 135 remote procedure call (RPC).

In [13]:
plot_component(decomp, 58)
Out[13]:

In order to confirm that most of the destination machines are on the company network, we can use the get_top_k module to list top labels in the destination host mode for this component. Below shows the top 50 destination hosts reached by 10.86.87.14 as described by component 58.

In [14]:
get_top_k(decomp.factors, decomp.labels, [58], 50)[58][2]
Out[14]:
[('10.174.10.63', 453, 0.045896565645214306),
 ('10.242.45.114', 461, 0.04560306847410038),
 ('10.253.190.117', 43, 0.03298211843619398),
 ('10.80.20.25', 213, 0.032979568128297734),
 ('10.242.17.57', 37, 0.032979568128297734),
 ('10.137.77.219', 358, 0.03297956812829772),
 ('10.141.246.158', 403, 0.03297956812829769),
 ('10.157.227.28', 42, 0.03270486376123928),
 ('10.180.241.171', 561, 0.03270473839389522),
 ('10.70.218.94', 3, 0.03270473839389522),
 ('10.152.253.48', 323, 0.03270473839389521),
 ('10.126.29.111', 283, 0.03242990865949273),
 ('10.163.28.175', 129, 0.031880249190687764),
 ('10.100.182.180', 828, 0.031055759987480364),
 ('10.3.7.256', 63, 0.030506100518675385),
 ('10.78.83.240', 133, 0.007695232563269464),
 ('10.162.244.70', 208, 0.007695232563269464),
 ('10.76.70.218', 132, 0.007420402828866987),
 ('10.81.130.99', 36, 0.007145573094464501),
 ('10.19.169.151', 211, 0.0071455730944645),
 ('10.73.202.226', 29, 0.006870743360062024),
 ('10.136.149.70', 535, 0.006870743360062024),
 ('10.24.210.95', 517, 0.006870743360062022),
 ('10.91.30.80', 31, 0.006870743360062022),
 ('10.170.66.230', 0, 0.006870743360062022),
 ('10.248.255.254', 13, 0.006618827356957439),
 ('10.119.197.249', 16, 0.006595913625659546),
 ('10.179.121.72', 24, 0.006595913625659545),
 ('10.198.229.162', 39, 0.006595913625659545),
 ('10.90.83.70', 94, 0.006595913625659545),
 ('10.253.53.93', 531, 0.006595913625659545),
 ('10.190.22.140', 528, 0.006595913625659545),
 ('10.169.247.26', 526, 0.006595913625659545),
 ('10.13.160.182', 524, 0.006595913625659545),
 ('10.128.156.35', 523, 0.006595913625659545),
 ('10.117.10.98', 522, 0.006595913625659545),
 ('10.114.101.206', 521, 0.006595913625659545),
 ('10.166.38.179', 525, 0.006595913625659545),
 ('10.75.181.12', 284, 0.006595913625659541),
 ('10.229.232.57', 376, 0.006595913625659541),
 ('10.119.63.148', 17, 0.006595913625659541),
 ('10.93.29.238', 472, 0.006595913625659541),
 ('10.156.124.97', 467, 0.006595913625659541),
 ('10.117.56.84', 470, 0.006595913625659541),
 ('10.209.57.198', 130, 0.006595913625659541),
 ('10.182.242.216', 210, 0.00659591362565954),
 ('10.57.147.197', 28, 0.00659591362565954),
 ('10.164.124.172', 34, 0.00659591362565954),
 ('10.92.238.125', 518, 0.00659591362565954),
 ('10.167.89.236', 23, 0.00659591362565954)]

Backtracking to Log Entries

Because ENSIGN can construct mappings from tensor entries to log entries and from decomposition components to tensor entries, we can compose these mappings and produce original log entries associated with any given components. So to further our investigation into activities from host 10.86.87.14 we can use Ensign's backtrack capabilities to return the original network logs with activities surrounding the events associcated to the tensor components we've analyzed.

In [15]:
idxs = []
for comp_id in [13, 58]:
    for entry in decomp.cpd_backtrack[comp_id]:
        idxs += [line for [log, line] in tensor.spt_backtrack[entry]]
conn.loc[idxs]
Out[15]:
timestamp uid src.ip src.port dst.ip dst.port proto service duration orig_bytes ... mcore_resp_pkts mcore_orig_bytes mcore_resp_bytes mcore_orig_ip_bytes mcore_resp_ip_bytes mcore_hist_pktsize mcore_hist_pkttime orig_cc resp_cc pcr
63748 1.520314e+09 C6k0fK1F83yG1Bie6k 10.86.87.14 35410 10.248.255.254 1 tcp - 0.000214 0 ... 1 0 0 60 40 - - - - -
63808 1.520314e+09 C2fXRtdmwasOrH0O3 10.86.87.14 40248 10.248.255.254 2 tcp - 0.000175 0 ... 1 0 0 60 40 - - - - -
63867 1.520314e+09 CE4ctLx8t9Hc9T7Z5 10.86.87.14 53996 10.248.255.254 3 tcp - 0.000193 0 ... 1 0 0 60 40 - - - - -
63893 1.520314e+09 CYSyj713qv4WN6iI67 10.86.87.14 45135 10.248.255.254 4 tcp - 0.000238 0 ... 1 0 0 60 40 - - - - -
63936 1.520314e+09 CkIluE49Kj0fsda826 10.86.87.14 60628 10.248.255.254 5 tcp - 0.000209 0 ... 1 0 0 60 40 - - - - -
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3182703 1.520399e+09 Cb13yJ2zlPh4i9FzSd 10.86.87.14 47824 10.180.241.171 135 tcp - - - ... 0 0 0 60 0 - - - - -
3182926 1.520399e+09 CXyGc93JY2itoNyV3k 10.86.87.14 47824 10.180.241.171 135 tcp - - - ... 0 0 0 60 0 - - - - -
3181501 1.520399e+09 CbDL2p2B7uT1p14v8b 10.86.87.14 60840 10.32.28.53 135 tcp - 0.000123 0 ... 1 0 0 60 40 - - - - -
3183450 1.520399e+09 CJ1dDN3YYqVtAJBMX4 10.86.87.14 47824 10.180.241.171 135 tcp - - - ... 0 0 0 60 0 - - - - -
3184384 1.520399e+09 Cw7uio2D6YPc24ukTl 10.86.87.14 47824 10.180.241.171 135 tcp - - - ... 0 0 0 60 0 - - - - -

29078 rows × 34 columns

Key Takeaways

In this notebook, we used tensor decompositions to find threats in cyber network logs. We showed how to use ENSIGN to mine unlabeled, multidimensional data for patterns that cue investigations into possible malicious activity. Notably, our investigation did not require labeled data or the use of signature-based methods. In general, tensor decompositions isolate discrete activities in the data and help to discover the "unknown unknowns." While in this notebook we processed the components with a recurrent activity detector to locate a threat, what we described is a flexible workflow that can incorporate numerous detectors in order to find a number of potential threat indicators. Using similar methods, we have previously uncovered and visualized patterns indicative of:

  • Distributed port scans evolving to machine takeover
  • Distributed denial of service attacks
  • DNS-based data exfiltration/insider threat
  • SSH password guessing (apart from scanning)
  • Network policy violations
  • Exploitation of application-specific port vulnerabilities
  • Patterns of traffic indicative of scans for printers or IoT devices
  • Broken or misconfigured network services
  • Selective, persistent use of cryptographic methods in point-to-point communication

This workflow can be extended such that the patterns are examined on a daily basis to discover “what has changed” and support skilled hunt teams to make directed, efficient use of big-graph platforms and search tools. ENSIGN’s advanced unsupervised machine learning capability connects key dots that make clear who the relevant actors are.