splitutils
splitutils ¶
Module for utils which are used for split data
Classes¶
DataSetInfo
dataclass
¶
Dataclass which represents a data set info including a set name (e.g. test) and a probability to split the dataset
Functions¶
clear_folder ¶
Deletes every file or folder inside a given location
Source code in niceml/utilities/splitutils.py
create_copy_files_container ¶
create_copy_files_container(
folder_list,
input_location,
recursive,
dataset_info_list,
delimiter_maxsplit,
name_delimiter,
output_location,
)
Creates a list of CopyFileInfo objects. The CopyFileInfo object contains the input and output locations for each file, as well as the checksum of that file. This function also takes in a list of DataSetInfo objects, which contain information about how many files should be copied into each set (train/validation/test). It then uses this information to randomly assign files into sets based on their probability.
Parameters:
-
folder_list
(List[str]
) –Folders to copy from
-
input_location
(Union[dict, LocationConfig]
) –Location of the input data (path) with corresponding LocationConfig
-
recursive
(bool
) –Indicate whether to recursively search for files in the input_location
-
dataset_info_list
(List[DataSetInfo]
) –Determine the name and probability of each dataset
-
delimiter_maxsplit
(int
) –Maxsplit to split the file name into a set and an identifier
-
name_delimiter
(str
) –Delimiter to split the file name into a set and an identifier
-
output_location
(Union[dict, LocationConfig]
) –Location of the output files (path) with corresponding LocationConfig
Returns: A list of CopyFileInfo objects with input and output location and the files checksum
Source code in niceml/utilities/splitutils.py
generate_set_and_prob_list ¶
Takes a list of DataSetInfo objects and returns two lists
- List of Names of each set in the order they appear in the input list.
- Corresponding list containing each set's probability, also in order.
Parameters:
-
set_info_list
(List[DataSetInfo]
) –List of DataSetInfo objects
Returns: A list of names (str) and a list of probabilities (float)
Source code in niceml/utilities/splitutils.py
init_dataset_info ¶
Returns a DataSetInfo for the given information
Parameters:
-
info
(str
) –string with the format [set_name,probability], e.g. 'test,0.1'
Returns:
-
DataSetInfo
–DataSetInfo object with set and probability information