ensign.csv2tensor.csv2tensor

csv2tensor(filepaths, distributed=False, columns=None, binning=None, types=None, sort=None, entries='count', fuse_columns=None, joiner='__', delimiter=', ', bro_log=False, header=None, validate_bro_log=True, gen_backtrack=False, gen_queries=False, drop_missing_values=False, missing_vals_limit=None, verbose=False)[source]

Creates a sparse tensor from one or more CSV files or Bro/Zeek logs.

The columns of the DataFrame(s) will become the modes (dimensions) of the tensor. It is important to carefully choose these columns as it’s recommended to only use 3-6 columns. This is specified with the columns argument.

The set of indices of each mode correspond to the unique set of values in the corresponding column. Therefore the values of each column need to be discretized. This is done with the binning argument.

Each binning scheme requires the associated column to be of a particular type. Specify the types with the types argument.

Parameters
filepathslist of str

Path(s) to input file(s). If multiple input files are specified, other options such as ‘types’ and ‘binning’ will be applied the same to all files.

distributedbool or str

Whether or not to use the Dask Distributed scheduler. If True, the Dask scheduler address is assumed to be ‘127.0.0.1:8786’. If False, the local threaded scheduler is used. If str, it should contain a Dask scheduler address. Default: use threaded scheduler. The distributed scheduler is strongly recommended.

columnslist of str

A list containing names of columns to be chosen for tensor construction. Default: None

typeslist of str

The expected type of the columns list entry at the corresponding position. Options are: ‘str’, ‘float64’, ‘int64’, ‘datetime’, ‘date’, ‘time’, ‘timestamp’, and ‘ip’. Columns typed as ‘datetime’, ‘date’, ‘time’ or ‘timestamp’ will be sorted automatically if sort is None. Default: None

binninglist of str

The binning technique to use for the columns list entry at the corresponding position. Options are: ‘none’, ‘binsize=<float>’, ‘cyclic=<int:float>’, ‘log10’, ‘round=<int>’, ‘ipv6_hextets=<num_hextets>:[‘MSB’|’LSB’]’ ‘[<ipv4_mask>+<ipv6mask> | <ipv4_mask> | <ipv6mask>]’, ‘second’, ‘minute’, ‘hour’, ‘day’, ‘month’, ‘year’, ‘minute-of-hour’, ‘hour-of-day’, ‘day-of-week’, ‘day-of-month’, and ‘month-of-year’. Default: ‘none’

sortlist of str

List of column names to sort. Mode labels of these columns will be sorted when mapped to indices. Sorting columns can increase run time. Default: only ‘timestamp’, ‘datetime’, ‘date’ and ‘time’ columns are sorted.

fuse_columnslist of list of str

Lists of columns to fuse into single columns. e.g. [[‘col1’, ‘col2’], [‘col3’, ‘col4’]], would fuse col1 and col2 into a single column named col1__col2 and col3 and col4 into a column named col3__col4. Default: no columns are fused.

joinerstr

Delimiter separating the values in a fused column. Default: ‘__’

delimiterstr

Delimiter separating the columns in the CSV file(s). Interpreted as a regular expression if longer than a single character. Default: ‘,’

entriesstr

Tensor entry calculation method. Legal values are ‘count’ and ‘boolean’ and ‘value=<column_name>:<aggregation_method>’. Valid aggregation methods are ‘sum’, ‘max’, ‘min’, ‘max_abs’, ‘min_abs’, ‘first’, ‘last’, ‘mean’, ‘prod’, ‘idxmin’, and ‘idxmax’. Modes that are used as value columns will be typed as ‘float64’. Default: ‘count’

bro_logbool

If True, treat input as a Bro/Zeek log. Default: False (treat input as a CSV)

headerlist of str

For use with files that don’t have headers. Specifies the names of the columns of the input file. Must have the same number of column names as there are columns in the input file. Default: None.

validate_bro_log: bool

If the input bro_log is true, this check will validate whether the input is indeed a Bro/Zeek log. Simply checks for the header of the file. Default is True only when bro_log is True.

gen_backtrackbool

If True, generate backtracking information from tensor to input files. This is a map from tensor entries to source lines in the original CSV file(s). Information is helpful for pulling data associated with specific sets of entries. This method has known scalability limitations and this option will be ignored for data over 1GB. For large files use gen_queries as an alternative. Default: False.

gen_queriesbool

If True, generate map from each bin in each mode to a set of selection criteria that can be parsed to construct a query for finding original datalines. Scalable alternative to gen_backtrack. Default: False.

drop_missing_valuesbool

If True, drop rows where any entry fails to be typed as a float64, int64, date, time, datetime, or timestamp. Otherwise, bucket the missing/corrupted values as NaN or NaT. Default: bucket missing values.

missing_vals_limitint

Cap the number of values that fail to be typed. After the quantity is reached, csv2tensor will exit with an error. None means do not cap the amount of failures. Default: do not limit.

verbosebool

Verbose output. Default False.

Returns
tensorensign.sptensor.SPTensor

The sparse tensor produced from the CSV input file(s) and input parameters.