ensign.csv2tensor.df2tensor

df2tensor(dfs, dask_client=None, columns=None, types=None, binning=None, entries='count', sort=None, fuse_columns=None, joiner='__', gen_backtrack=False, gen_queries=False, failure_counter=None, verbose=False, in_memory=True)[source]

Variant of csv2tensor where in-memory DataFrames are passed instead of paths to files on disk. csv2tensor is preferred when possible.

Parameters
dfslist of dask.dataframe.DataFrame

Dataframes to convert into a tensor. Each column must be typed as specified with the types argument. Any missing values must be bucketed as np.inf or np.nan in float64 columns, sys.maxsize in int64 columns, and pd.NaT in datetime/timestamp columns.

dask_clientdask.distributed.Client

Client object connected to a distributed Dask scheduler. If None, the default local threaded scheduler is used. Default: None.

columnslist of str

A list containing names of columns to be chosen for tensor construction. Default: None

typeslist of str

The expected type of the columns list entry at the corresponding position. Options are: ‘str’, ‘float64’, ‘int64’, ‘datetime’, ‘timestamp’, and ‘ip’. Columns typed as ‘datetime’ or ‘timestamp’ will be sorted automatically if sort is None. Default: None

binninglist of str

The binning technique to use for the columns list entry at the corresponding position. Options are: ‘none’, ‘binsize=<float>’, ‘cyclic=<int:float>’, ‘log10’, ‘round=<int>’, ‘ipv6_hextets=<num_hextets>:[‘MSB’|’LSB’]’ ‘[<ipv4_mask>+<ipv6mask> | <ipv4_mask> | <ipv6mask>]’, ‘second’, ‘minute’, ‘hour’, ‘day’, ‘month’, ‘year’, ‘minute-of-hour’, ‘hour-of-day’, ‘day-of-week’, ‘day-of-month’, and ‘month-of-year’. Default: ‘none’

sortlist of str

List of column names to sort. Mode labels of these columns will be sorted when mapped to indices. Sorting columns can increase run time. Default: only ‘timestamp’ and ‘datetime’ columns are sorted.

fuse_columnslist of list of str

Lists of columns to fuse into single columns. e.g. [[‘col1’, ‘col2’], [‘col3’, ‘col4’]], would fuse col1 and col2 into a single column named col1__col2 and col3 and col4 into a column named col3__col4. Default: no columns are fused.

joinerstr

Delimiter separating the values in a fused column. Default: ‘__’

entriesstr

Tensor entry calculation method. Legal values are ‘count’ and ‘boolean’ and ‘value=<column_name>:<aggregation_method>’. Valid aggregation methods are ‘sum’, ‘max’, ‘min’, ‘max_abs’, ‘min_abs’, ‘first’, ‘last’, ‘mean’, ‘prod’, ‘idxmin’, and ‘idxmax’. Modes that are used as value columns will be typed as ‘float64’. Default: ‘count’

gen_backtrackbool

If True, generate backtracking information from tensor to input files. This is a map from tensor entries to source lines in the original CSV file(s). Information is helpful for pulling data associated with specific sets of entries. This method has known scalability limitations and this option will be ignored for data over 1GB. For large files use the -q option as an alternative. Default: False.

gen_queriesbool

If True, generate map from each bin in each mode to a set of selection criteria that can be parsed to construct a query for finding original datalines. Scalable alternative to gen_backtrack. Default: False.

verbosebool

Verbose output. Default False.

in_memorybool

Whether or not df2tensor() is called standalone or as part of the csv2tensor() function.

Returns
tensorensign.sptensor.SPTensor

The sparse tensor produced from the CSV input file(s) and input parameters.