hypergraph

Network Analysis and Anomaly Detection¶

In this notebook, we use ENSIGN to build and decompose a Bro/Zeek connections ("conn") tensor in order to gain insight into the activity occurring on a medium-sized business network. The tensor is constructed from a Bro/Zeek log, which is formed by sensors collecting data on connections made on the network. The decomposition separates behaviors into different components that reveal both benign background traffic and potentially malicious activity. Using a detector and ENSIGN's Python tools, we investigate the components to quickly find possible threats. Using ENSIGN's backtracking capabilities, we recover original log entries associated with the malicious activity.

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np
import pandas as pd

from ensign.csv2tensor import csv2tensor
from ensign.cp_decomp import cp_apr, read_cp_decomp_dir, write_cp_decomp_dir
from ensign.comp_top_k import get_top_k
from ensign.query_decomp import query_decomp
from ensign.visualize import plot_component
from rad.rad import run_rad

import matplotlib
%matplotlib inline

Data ¶

The data we will investigate in this notebook is a standard Bro/Zeek conn.log containing information about network connections made during one business day.

columns = ['timestamp', 'uid', 'src.ip', 'src.port', 'dst.ip', 'dst.port', 'proto', 'service', 'duration', 'orig_bytes', 'resp_bytes', 'conn_state', 'local_orig', 'local_resp', 'missed_bytes', 'history', 'orig_pkts', 'orig_ip_bytes', 'resp_pkts', 'resp_ip_bytes', 'tunnel_parents', 'ts_mcore', 'last_ts_mcore', 'mcore_orig_pkts', 'mcore_resp_pkts', 'mcore_orig_bytes', 'mcore_resp_bytes', 'mcore_orig_ip_bytes', 'mcore_resp_ip_bytes', 'mcore_hist_pktsize', 'mcore_hist_pkttime', 'orig_cc', 'resp_cc', 'pcr']
conn = pd.read_csv('data/cyber_data.log', sep='\t', skiprows=8, header=None)
conn.columns = columns
conn

Tensor Construction ¶

We use our ETL tool csv2tensor in order to read a Bro/Zeek log and construct a tensor out of relevant features. A Bro/Zeek log is a tab-separated file with additional header information, and it is the other format we support besides CSV files. With very few arguments, you can build a tensor for further analysis. The arguments provided in this call are:

filepaths: The location of the CSVs or Bro/Zeek logs with relevant data. This may be a single path, a list of paths, or a path with wildcards. This allows for easy specification and minimal data pre-processing.
columns: The names of the columns to correspond to modes in the tensor. In this example, we construct the "conn" tensor, which has modes corresponding to the timestamp of the connection, the origin and destination hosts, and the destination port.
types: The datatypes of the modes so that the tool knows how to validate and discretize the data
binning: The desired descretization scheme for each mode. In this case, we only round the timestamps to the nearest minute.
bro_log: Used to indicate that the tool should read in the files as Bro/Zeek logs
gen_backtrack: Generate mapping from tensor entries to log entries

tensor = csv2tensor(filepaths='data/cyber_data.log',
                    columns=['timestamp', 'src.ip', 'dst.ip', 'dst.port'],
                    types=['timestamp', 'ip', 'ip', 'int64'],
                    binning=['minute', 'none', 'none', 'none'],
                    bro_log=True,
                    gen_backtrack=True)

Tensor Decomposition ¶

As csv2tensor by default produces a count tensor, a tensor whose indices are counts of the the number of times the corresponding labels co-occur in the data, we decompose it with CP-APR, which assumes the tensor entries are distributed according to Poisson distributions. We decompose at rank 100 as this is often a useful starting point for analysis of a dataset of this size. We also generate a mapping from tensor components to tensor entries.

decomp = cp_apr(tensor, 100, mem_limit_gb=12, gen_backtrack=True)

tensor.write('cyber_decomposition')
write_cp_decomp_dir('cyber_decomposition', decomp)

Visualizing Components ¶

Using plot_component we can visualize the high-scoring labels in each mode in a given component. As the components are sorted by weight and the first component is plotted below, the activity it describes corresponds to a large portion of the log entries.

The component describes a behavior that occurs throughout the day (the timestamp mode is nonzero througout its domain). The behavior consists of machines on the network connecting to 10.106.187.106 on port 53. This is indicated by the non-zero scoring labels in each of the other modes. This component is coherent as 10.106.187.106 is the network's DNS server. It is also reasonable that this is component has a large weight.

plot_component(decomp, 0)

There is a notable 50 minute gap in the time mode of the above component suggesting that something different was happening between the involved hosts at that time. We can investigate this by using the ENSIGN tool query_decomp to search for other components involving the same hosts.

query_decomp([decomp.factors], [decomp.labels], [[1]], '10.55.89.190')[0].sort_values(by='Score', ascending=False).head()

The host we queried also appears predominately in component 2, plotted below. Its time mode clearly account for the time missing in the above DNS component. Based on the last two modes, this component also represents DNS traffic. The subtle difference between the DNS traffic during these two time periods is the higher representation by the top two scoring origin hosts. Indeed, these hosts are involved in inspecting the previous day's network traffic, and this job is kicked off at the time of the peak.

plot_component(decomp, 2)

Searching for Anomalies ¶

While the component plotted above is coherent, it does not tell the user anything they don't already know about their network. Using ENSIGN's suite of tools, we can post-process the tensor decomposition components and flag those that have interesting features that may indicate some malicious activity.

In the following example, we run a Recurrent Activity Detection (RAD) module in order to automate the determination of which components describe periodic behavior. The RAD module returns a list of components, whether or not they have been flagged for periodic behavior, and the coefficient of variation (mean-normalized standard deviation) of the distances between peaks. First, we filter out non-recurrent activity and sort the components by the coefficient of variation. Then, we rank the components by coefficient of variation.

rad = run_rad('cyber_decomposition', 'rad/rad.txt', {"method" : "cluster", "eps" : 5, "nn" : 5})
rad = list(filter(lambda l : l[1], rad))
rad = sorted(rad, key=lambda l:l[2])
rad[:5]

[[13, True, 0.0],
 [16, True, 0.19001038821453575],
 [70, True, 0.2049174915029987],
 [45, True, 0.24362606232294617],
 [51, True, 0.25678650249761253]]

Port Scanning¶

In component 13, we see that approximately once an hour, 10.86.87.14 connects to 10.248.255.254 on many ports up to 1500. This indicates that 10.86.87.14 is periodically port scanning 10.248.255.254. This component, which the RAD module flagged due to its periodic time mode, may warrant further investigation.

plot_component(decomp, 13)

Saving Network Logs¶

Component 70 describes 10.163.28.175 connecting to 10.32.28.53 and 10.248.255.254 on port 22 once an hour. An analyst who knows their network would understand that this component corresponds to the network admin's design of dumping network logs once an hour from the machine that produces them to the two locations where they are stored. Therefore, it probably does not warrant further investigation.

plot_component(decomp, 70)

Network Mapping¶

The investigation with the RAD module turned up an interesting activity: 10.86.87.14 port scanning 10.248.255.254. We now ask whether 10.86.87.14 is involved in any other suspicious or malicious activity. We can use the ENSIGN tool query_decomp in order to see which other components have a high score for 10.86.87.14 in the origin host mode. Besides component 13, the port scanning component, 10.86.87.14 also turns up in component 58.

query_decomp([decomp.factors], [decomp.labels], [[1]], '10.86.87.14')[0].sort_values(by='Score', ascending=False).head()

Plotting component 58 shows 10.86.87.14 connecting to several machines on the network throughout the day mainly on port 135. The actor is scanning systems on the network that provide remote services, hence why we see a spike in destination port 135 remote procedure call (RPC).

plot_component(decomp, 58)

In order to confirm that most of the destination machines are on the company network, we can use the get_top_k module to list top labels in the destination host mode for this component. Below shows the top 50 destination hosts reached by 10.86.87.14 as described by component 58.

get_top_k(decomp.factors, decomp.labels, [58], 50)[58][2]

[('10.174.10.63', 453, 0.045896565645214306),
 ('10.242.45.114', 461, 0.04560306847410038),
 ('10.253.190.117', 43, 0.03298211843619398),
 ('10.80.20.25', 213, 0.032979568128297734),
 ('10.242.17.57', 37, 0.032979568128297734),
 ('10.137.77.219', 358, 0.03297956812829772),
 ('10.141.246.158', 403, 0.03297956812829769),
 ('10.157.227.28', 42, 0.03270486376123928),
 ('10.180.241.171', 561, 0.03270473839389522),
 ('10.70.218.94', 3, 0.03270473839389522),
 ('10.152.253.48', 323, 0.03270473839389521),
 ('10.126.29.111', 283, 0.03242990865949273),
 ('10.163.28.175', 129, 0.031880249190687764),
 ('10.100.182.180', 828, 0.031055759987480364),
 ('10.3.7.256', 63, 0.030506100518675385),
 ('10.78.83.240', 133, 0.007695232563269464),
 ('10.162.244.70', 208, 0.007695232563269464),
 ('10.76.70.218', 132, 0.007420402828866987),
 ('10.81.130.99', 36, 0.007145573094464501),
 ('10.19.169.151', 211, 0.0071455730944645),
 ('10.73.202.226', 29, 0.006870743360062024),
 ('10.136.149.70', 535, 0.006870743360062024),
 ('10.24.210.95', 517, 0.006870743360062022),
 ('10.91.30.80', 31, 0.006870743360062022),
 ('10.170.66.230', 0, 0.006870743360062022),
 ('10.248.255.254', 13, 0.006618827356957439),
 ('10.119.197.249', 16, 0.006595913625659546),
 ('10.179.121.72', 24, 0.006595913625659545),
 ('10.198.229.162', 39, 0.006595913625659545),
 ('10.90.83.70', 94, 0.006595913625659545),
 ('10.253.53.93', 531, 0.006595913625659545),
 ('10.190.22.140', 528, 0.006595913625659545),
 ('10.169.247.26', 526, 0.006595913625659545),
 ('10.13.160.182', 524, 0.006595913625659545),
 ('10.128.156.35', 523, 0.006595913625659545),
 ('10.117.10.98', 522, 0.006595913625659545),
 ('10.114.101.206', 521, 0.006595913625659545),
 ('10.166.38.179', 525, 0.006595913625659545),
 ('10.75.181.12', 284, 0.006595913625659541),
 ('10.229.232.57', 376, 0.006595913625659541),
 ('10.119.63.148', 17, 0.006595913625659541),
 ('10.93.29.238', 472, 0.006595913625659541),
 ('10.156.124.97', 467, 0.006595913625659541),
 ('10.117.56.84', 470, 0.006595913625659541),
 ('10.209.57.198', 130, 0.006595913625659541),
 ('10.182.242.216', 210, 0.00659591362565954),
 ('10.57.147.197', 28, 0.00659591362565954),
 ('10.164.124.172', 34, 0.00659591362565954),
 ('10.92.238.125', 518, 0.00659591362565954),
 ('10.167.89.236', 23, 0.00659591362565954)]

Backtracking to Log Entries ¶

Because ENSIGN can construct mappings from tensor entries to log entries and from decomposition components to tensor entries, we can compose these mappings and produce original log entries associated with any given components. So to further our investigation into activities from host 10.86.87.14 we can use Ensign's backtrack capabilities to return the original network logs with activities surrounding the events associcated to the tensor components we've analyzed.

idxs = []
for comp_id in [13, 58]:
    for entry in decomp.cpd_backtrack[comp_id]:
        idxs += [line for [log, line] in tensor.spt_backtrack[entry]]
conn.loc[idxs]

Key Takeaways ¶

In this notebook, we used tensor decompositions to find threats in cyber network logs. We showed how to use ENSIGN to mine unlabeled, multidimensional data for patterns that cue investigations into possible malicious activity. Notably, our investigation did not require labeled data or the use of signature-based methods. In general, tensor decompositions isolate discrete activities in the data and help to discover the "unknown unknowns." While in this notebook we processed the components with a recurrent activity detector to locate a threat, what we described is a flexible workflow that can incorporate numerous detectors in order to find a number of potential threat indicators. Using similar methods, we have previously uncovered and visualized patterns indicative of:

Distributed port scans evolving to machine takeover
Distributed denial of service attacks
DNS-based data exfiltration/insider threat
SSH password guessing (apart from scanning)
Network policy violations
Exploitation of application-specific port vulnerabilities
Patterns of traffic indicative of scans for printers or IoT devices
Broken or misconfigured network services
Selective, persistent use of cryptographic methods in point-to-point communication

This workflow can be extended such that the patterns are examined on a daily basis to discover “what has changed” and support skilled hunt teams to make directed, efficient use of big-graph platforms and search tools. ENSIGN’s advanced unsupervised machine learning capability connects key dots that make clear who the relevant actors are.

	timestamp	uid	src.ip	src.port	dst.ip	dst.port	proto	service	duration	orig_bytes	...	mcore_resp_pkts	mcore_orig_bytes	mcore_resp_bytes	mcore_orig_ip_bytes	mcore_resp_ip_bytes	mcore_hist_pktsize	mcore_hist_pkttime	orig_cc	resp_cc	pcr
0	1.520312e+09	CVB0Ci2bigU7WwkdIa	10.55.89.190	60311	10.106.187.106	53	udp	dns	0.000570	94	...	212	4982	17278	7950	23214	-	-	-	-	-0.552381
1	1.520312e+09	CCH8XH3aiB7yaVqsfb	10.29.128.152	48656	10.106.187.106	53	udp	dns	0.000604	100	...	4	100	338	156	450	-	-	-	-	-0.543379
2	1.520312e+09	Cqlj26rE1jBTRm5Ui	10.29.128.152	56478	10.106.187.106	53	udp	dns	0.000610	62	...	4	110	458	166	570	-	-	-	-	-0.610063
3	1.520312e+09	CYZsqO1OfUc6EjQ8Yc	10.29.128.152	33011	10.18.10.43	49403	tcp	-	0.153709	608	...	18	1887	608	2571	1544	-	-	-	-	-0.512625
4	1.520312e+09	CVCh7Mur0eT0BiLe3	10.29.128.152	57289	10.18.10.43	28518	tcp	-	0.001273	147	...	5	50	147	266	407	-	-	-	-	0.492386
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3186424	1.520399e+09	CYjK3CoWCsAHkM05j	10.204.231.249	45121	10.106.187.106	53	udp	dns	0.000563	72	...	60	1080	6120	1920	7800	-	-	-	-	-0.7
3186425	1.520399e+09	CWxwLc1Ck7W3RrCYRd	10.70.96.59	60117	10.106.187.106	53	udp	dns	0.000457	40	...	12	233	872	401	1208	-	-	-	-	-0.646018
3186426	1.520399e+09	CHidjjwDQeDcNxJn	10.246.131.128	58646	10.106.187.106	53	udp	dns	0.000334	72	...	34	1234	2560	2186	3512	-	-	-	-	-0.323944
3186427	1.520399e+09	CEVYtfmCpArL8C0l6	10.55.89.190	37423	10.106.187.106	53	udp	dns	0.000515	94	...	420	9870	34230	15750	45990	-	-	-	-	-0.552381
3186428	1.520399e+09	CimIdk3MjEjt0zFNqg	10.55.89.190	41830	10.106.187.106	53	udp	dns	0.000601	94	...	392	9306	31948	14850	42924	-	-	-	-	-0.552381

	timestamp	uid	src.ip	src.port	dst.ip	dst.port	proto	service	duration	orig_bytes	...	mcore_resp_pkts	mcore_orig_bytes	mcore_resp_bytes	mcore_orig_ip_bytes	mcore_resp_ip_bytes	mcore_hist_pktsize	mcore_hist_pkttime	orig_cc	resp_cc	pcr
63748	1.520314e+09	C6k0fK1F83yG1Bie6k	10.86.87.14	35410	10.248.255.254	1	tcp	-	0.000214	0	...	1	0	0	60	40	-	-	-	-	-
63808	1.520314e+09	C2fXRtdmwasOrH0O3	10.86.87.14	40248	10.248.255.254	2	tcp	-	0.000175	0	...	1	0	0	60	40	-	-	-	-	-
63867	1.520314e+09	CE4ctLx8t9Hc9T7Z5	10.86.87.14	53996	10.248.255.254	3	tcp	-	0.000193	0	...	1	0	0	60	40	-	-	-	-	-
63893	1.520314e+09	CYSyj713qv4WN6iI67	10.86.87.14	45135	10.248.255.254	4	tcp	-	0.000238	0	...	1	0	0	60	40	-	-	-	-	-
63936	1.520314e+09	CkIluE49Kj0fsda826	10.86.87.14	60628	10.248.255.254	5	tcp	-	0.000209	0	...	1	0	0	60	40	-	-	-	-	-
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3182703	1.520399e+09	Cb13yJ2zlPh4i9FzSd	10.86.87.14	47824	10.180.241.171	135	tcp	-	-	-	...	0	0	0	60	0	-	-	-	-	-
3182926	1.520399e+09	CXyGc93JY2itoNyV3k	10.86.87.14	47824	10.180.241.171	135	tcp	-	-	-	...	0	0	0	60	0	-	-	-	-	-
3181501	1.520399e+09	CbDL2p2B7uT1p14v8b	10.86.87.14	60840	10.32.28.53	135	tcp	-	0.000123	0	...	1	0	0	60	40	-	-	-	-	-
3183450	1.520399e+09	CJ1dDN3YYqVtAJBMX4	10.86.87.14	47824	10.180.241.171	135	tcp	-	-	-	...	0	0	0	60	0	-	-	-	-	-
3184384	1.520399e+09	Cw7uio2D6YPc24ukTl	10.86.87.14	47824	10.180.241.171	135	tcp	-	-	-	...	0	0	0	60	0	-	-	-	-	-

	Score	Mode	Component
11	1.000000	1	13
27	0.993031	1	83
12	0.987844	1	14
24	0.952745	1	58
17	0.562519	1	26

	Score	Mode	Component
2	0.715004	1	2
0	0.638488	1	0
17	0.074074	1	99
7	0.070562	1	8
14	0.059359	1	47