ensign.csv2tensor.csv2tensor¶

csv2tensor(filepaths, distributed=False, columns=None, binning=None, types=None, sort=None, entries='count', fuse_columns=None, joiner='__', delimiter=', ', bro_log=False, header=None, validate_bro_log=True, gen_backtrack=False, gen_queries=False, drop_missing_values=False, missing_vals_limit=None, verbose=False)[source]¶

Creates a sparse tensor from one or more CSV files or Bro/Zeek logs.

The columns of the DataFrame(s) will become the modes (dimensions) of the tensor. It is important to carefully choose these columns as it’s recommended to only use 3-6 columns. This is specified with the columns argument.

The set of indices of each mode correspond to the unique set of values in the corresponding column. Therefore the values of each column need to be discretized. This is done with the binning argument.

Each binning scheme requires the associated column to be of a particular type. Specify the types with the types argument.

Parameters

filepathslist of str: Path(s) to input file(s). If multiple input files are specified, other options such as ‘types’ and ‘binning’ will be applied the same to all files.
distributedbool or str: Whether or not to use the Dask Distributed scheduler. If True, the Dask scheduler address is assumed to be ‘127.0.0.1:8786’. If False, the local threaded scheduler is used. If str, it should contain a Dask scheduler address. Default: use threaded scheduler. The distributed scheduler is strongly recommended.
columnslist of str: A list containing names of columns to be chosen for tensor construction. Default: None
typeslist of str: The expected type of the columns list entry at the corresponding position. Options are: ‘str’, ‘float64’, ‘int64’, ‘datetime’, ‘date’, ‘time’, ‘timestamp’, and ‘ip’. Columns typed as ‘datetime’, ‘date’, ‘time’ or ‘timestamp’ will be sorted automatically if sort is None. Default: None
binninglist of str: The binning technique to use for the columns list entry at the corresponding position. Options are: ‘none’, ‘binsize=<float>’, ‘cyclic=<int:float>’, ‘log10’, ‘round=<int>’, ‘ipv6_hextets=<num_hextets>:[‘MSB’|’LSB’]’ ‘[<ipv4_mask>+<ipv6mask> | <ipv4_mask> | <ipv6mask>]’, ‘second’, ‘minute’, ‘hour’, ‘day’, ‘month’, ‘year’, ‘minute-of-hour’, ‘hour-of-day’, ‘day-of-week’, ‘day-of-month’, and ‘month-of-year’. Default: ‘none’
sortlist of str: List of column names to sort. Mode labels of these columns will be sorted when mapped to indices. Sorting columns can increase run time. Default: only ‘timestamp’, ‘datetime’, ‘date’ and ‘time’ columns are sorted.
fuse_columnslist of list of str: Lists of columns to fuse into single columns. e.g. [[‘col1’, ‘col2’], [‘col3’, ‘col4’]], would fuse col1 and col2 into a single column named col1__col2 and col3 and col4 into a column named col3__col4. Default: no columns are fused.
joinerstr: Delimiter separating the values in a fused column. Default: ‘__’
delimiterstr: Delimiter separating the columns in the CSV file(s). Interpreted as a regular expression if longer than a single character. Default: ‘,’
entriesstr: Tensor entry calculation method. Legal values are ‘count’ and ‘boolean’ and ‘value=<column_name>:<aggregation_method>’. Valid aggregation methods are ‘sum’, ‘max’, ‘min’, ‘max_abs’, ‘min_abs’, ‘first’, ‘last’, ‘mean’, ‘prod’, ‘idxmin’, and ‘idxmax’. Modes that are used as value columns will be typed as ‘float64’. Default: ‘count’
bro_logbool: If True, treat input as a Bro/Zeek log. Default: False (treat input as a CSV)
headerlist of str: For use with files that don’t have headers. Specifies the names of the columns of the input file. Must have the same number of column names as there are columns in the input file. Default: None.
validate_bro_log: bool: If the input bro_log is true, this check will validate whether the input is indeed a Bro/Zeek log. Simply checks for the header of the file. Default is True only when bro_log is True.
gen_backtrackbool: If True, generate backtracking information from tensor to input files. This is a map from tensor entries to source lines in the original CSV file(s). Information is helpful for pulling data associated with specific sets of entries. This method has known scalability limitations and this option will be ignored for data over 1GB. For large files use gen_queries as an alternative. Default: False.
gen_queriesbool: If True, generate map from each bin in each mode to a set of selection criteria that can be parsed to construct a query for finding original datalines. Scalable alternative to gen_backtrack. Default: False.
drop_missing_valuesbool: If True, drop rows where any entry fails to be typed as a float64, int64, date, time, datetime, or timestamp. Otherwise, bucket the missing/corrupted values as NaN or NaT. Default: bucket missing values.
missing_vals_limitint: Cap the number of values that fail to be typed. After the quantity is reached, csv2tensor will exit with an error. None means do not cap the amount of failures. Default: do not limit.
verbosebool: Verbose output. Default False.

Returns

tensorensign.sptensor.SPTensor: The sparse tensor produced from the CSV input file(s) and input parameters.

ensign.csv2tensor.csv2tensor¶

Previous topic

Next topic