ensign.csv2tensor.df2tensor¶

df2tensor(dfs, dask_client=None, columns=None, types=None, binning=None, entries='count', sort=None, fuse_columns=None, joiner='__', gen_backtrack=False, gen_queries=False, failure_counter=None, verbose=False, in_memory=True)[source]¶

Variant of csv2tensor where in-memory DataFrames are passed instead of paths to files on disk. csv2tensor is preferred when possible.

Parameters

dfslist of dask.dataframe.DataFrame: Dataframes to convert into a tensor. Each column must be typed as specified with the types argument. Any missing values must be bucketed as np.inf or np.nan in float64 columns, sys.maxsize in int64 columns, and pd.NaT in datetime/timestamp columns.
dask_clientdask.distributed.Client: Client object connected to a distributed Dask scheduler. If None, the default local threaded scheduler is used. Default: None.
columnslist of str: A list containing names of columns to be chosen for tensor construction. Default: None
typeslist of str: The expected type of the columns list entry at the corresponding position. Options are: ‘str’, ‘float64’, ‘int64’, ‘datetime’, ‘timestamp’, and ‘ip’. Columns typed as ‘datetime’ or ‘timestamp’ will be sorted automatically if sort is None. Default: None
binninglist of str: The binning technique to use for the columns list entry at the corresponding position. Options are: ‘none’, ‘binsize=<float>’, ‘cyclic=<int:float>’, ‘log10’, ‘round=<int>’, ‘ipv6_hextets=<num_hextets>:[‘MSB’|’LSB’]’ ‘[<ipv4_mask>+<ipv6mask> | <ipv4_mask> | <ipv6mask>]’, ‘second’, ‘minute’, ‘hour’, ‘day’, ‘month’, ‘year’, ‘minute-of-hour’, ‘hour-of-day’, ‘day-of-week’, ‘day-of-month’, and ‘month-of-year’. Default: ‘none’
sortlist of str: List of column names to sort. Mode labels of these columns will be sorted when mapped to indices. Sorting columns can increase run time. Default: only ‘timestamp’ and ‘datetime’ columns are sorted.
fuse_columnslist of list of str: Lists of columns to fuse into single columns. e.g. [[‘col1’, ‘col2’], [‘col3’, ‘col4’]], would fuse col1 and col2 into a single column named col1__col2 and col3 and col4 into a column named col3__col4. Default: no columns are fused.
joinerstr: Delimiter separating the values in a fused column. Default: ‘__’
entriesstr: Tensor entry calculation method. Legal values are ‘count’ and ‘boolean’ and ‘value=<column_name>:<aggregation_method>’. Valid aggregation methods are ‘sum’, ‘max’, ‘min’, ‘max_abs’, ‘min_abs’, ‘first’, ‘last’, ‘mean’, ‘prod’, ‘idxmin’, and ‘idxmax’. Modes that are used as value columns will be typed as ‘float64’. Default: ‘count’
gen_backtrackbool: If True, generate backtracking information from tensor to input files. This is a map from tensor entries to source lines in the original CSV file(s). Information is helpful for pulling data associated with specific sets of entries. This method has known scalability limitations and this option will be ignored for data over 1GB. For large files use the -q option as an alternative. Default: False.
gen_queriesbool: If True, generate map from each bin in each mode to a set of selection criteria that can be parsed to construct a query for finding original datalines. Scalable alternative to gen_backtrack. Default: False.
verbosebool: Verbose output. Default False.
in_memorybool: Whether or not df2tensor() is called standalone or as part of the csv2tensor() function.

Returns

tensorensign.sptensor.SPTensor: The sparse tensor produced from the CSV input file(s) and input parameters.

ensign.csv2tensor.df2tensor¶

Previous topic

Next topic