Overview of DagsterJobs¶
Job: job_train¶
Job for training an experiment
graph LR
experiment --> train;
acquire_locks --> train;
train --> prediction;
train --> prediction;
prediction --> analysis;
prediction --> analysis;
prediction --> analysis;
analysis --> release_locks;
analysis --> exptests;
Op: acquire_locks¶
op for acquiring locks
| ConfigKey | Description |
|---|---|
filelock_dict |
Abstract base class for file locks. |
Op: experiment¶
This Op creates the experiment params
| ConfigKey | Description |
|---|---|
exp_folder_pattern |
Folder pattern of the experiment. Can use $RUN_ID and $SHORT_ID to make the name unique |
exp_out_location |
Folder to store the experiments |
Op: train¶
DagsterOp that trains the model
| ConfigKey | Description |
|---|---|
data_description |
This class is used to describe the data. E.g. how big the input image size is, or what target classes are used. |
data_train |
Dataset to load, transform, shuffle the data before training |
data_validation |
Dataset to load, transform, shuffle the data before training |
exp_initializer |
This class creates the first folder and files for an experiment |
learner |
Wrapper to do the training |
model |
ABC for model factories. Used to create the model before training |
remove_key_list |
These key are removed from any config recursively before it is saved. |
train_params |
TrainParams are used to select the amount of steps and epochs for training |
Op: prediction¶
Dagster op to predict the stored model with the given datasets
| ConfigKey | Description |
|---|---|
datasets |
Dataset to load, transform, shuffle the data before training |
model_loader |
Callable that loads models |
prediction_function |
Abstract class for prediction functions |
prediction_handler |
Abstract PredictionHandler class to implement your own prediction handler |
prediction_steps |
If None the whole datasets are processed. Otherwise only prediction_steps are evaluated. |
remove_key_list |
These key are removed from any config recursively before it is saved. |
Op: analysis¶
This dagster op analysis the previous predictions applied by the model
| ConfigKey | Description |
|---|---|
remove_key_list |
These key are removed from any config recursively before it is saved. |
result_analyzer |
After the prediction is done all data can be analyzed with a specific implementation of the ResultAnalyzer |
Op: release_locks¶
op for releasing locks
| ConfigKey | Description |
|---|---|
Op: exptests¶
op to run the experiment tests
| ConfigKey | Description |
|---|---|
remove_key_list |
These key are removed from any config recursively before it is saved. |
tests |
Class to execute a list of ExperimentTests |
Job: job_eval¶
Job for evaluating experiment
graph LR
localize_experiment --> eval_copy_exp;
eval_copy_exp --> prediction;
acquire_locks --> prediction;
prediction --> analysis;
prediction --> analysis;
prediction --> analysis;
analysis --> release_locks;
analysis --> exptests;
Op: acquire_locks¶
op for acquiring locks
| ConfigKey | Description |
|---|---|
filelock_dict |
Abstract base class for file locks. |
Op: localize_experiment¶
This op localizes the experiment and returns the experiment context
| ConfigKey | Description |
|---|---|
existing_experiment |
Used to define the experiment id. This is an alpha numeric str with the lenth of 4 |
exp_folder_pattern |
Unused. Only required due to easier configuration |
exp_out_location |
Folder to store the experiments |
Op: eval_copy_exp¶
Copy experiment from one to another.
| ConfigKey | Description |
|---|---|
description |
Description of the experiment. Replaces the training description |
exp_folder_pattern |
Folder pattern of the experiment. Can use $RUN_ID and $SHORT_ID to make the name unique |
exp_out_location |
Folder to store the experiments |
Op: prediction¶
Dagster op to predict the stored model with the given datasets
| ConfigKey | Description |
|---|---|
datasets |
Dataset to load, transform, shuffle the data before training |
model_loader |
Callable that loads models |
prediction_function |
Abstract class for prediction functions |
prediction_handler |
Abstract PredictionHandler class to implement your own prediction handler |
prediction_steps |
If None the whole datasets are processed. Otherwise only prediction_steps are evaluated. |
remove_key_list |
These key are removed from any config recursively before it is saved. |
Op: analysis¶
This dagster op analysis the previous predictions applied by the model
| ConfigKey | Description |
|---|---|
remove_key_list |
These key are removed from any config recursively before it is saved. |
result_analyzer |
After the prediction is done all data can be analyzed with a specific implementation of the ResultAnalyzer |
Op: release_locks¶
op for releasing locks
| ConfigKey | Description |
|---|---|
Op: exptests¶
op to run the experiment tests
| ConfigKey | Description |
|---|---|
remove_key_list |
These key are removed from any config recursively before it is saved. |
tests |
Class to execute a list of ExperimentTests |
Job: job_copy_exp¶
Copy an experiment from one location to another
Op: copy_exp¶
Copy experiment from one to another.
| ConfigKey | Description |
|---|---|
experiment_id |
Alphanumeric 4-char string to identify the experiment |
input_loc_name |
Name of the input location ressource |
output_loc_name |
Name of the output location ressource |
Job: job_data_generation¶
Job for data generation
graph LR
data_generation --> split_data;
split_data --> crop_numbers;
crop_numbers --> image_to_tabular_data;
image_to_tabular_data --> df_normalization;
Op: data_generation¶
Generates random test image dataset based on a given data_generator
| ConfigKey | Description |
|---|---|
data_generator |
Generator of images with numbers for an object detection test dataset |
Op: split_data¶
Splits the data in input_location into subsets (set_infos)
| ConfigKey | Description |
|---|---|
clear_folder |
Flag if the output folder should be cleared before the split. |
max_split |
Maximum split of the name (e.g. 1) |
name_delimiter |
Character to seperate names. |
output_location |
Folder to save the split images |
recursive |
Flag if the input folder should be searched recursively. |
set_infos |
Split information how to split the data |
sub_dir |
Subdirectory to save the split images |
Op: crop_numbers¶
Crops the numbers from the input images and stores them separately
| ConfigKey | Description |
|---|---|
clear_folder |
Flag if the output folder should be cleared before the split |
name_delimiter |
Delimiter used within the filenames |
output_location |
Foldername where the images are stored |
recursive |
Flag if the input folder should be searched recursively |
sub_dir |
Subdirectory to save the split images |
Op: image_to_tabular_data¶
The image_to_tabular_data function takes in a location of images and converts them to tabular data.
Args: context: OpExecutionContext: Pass in the configuration of the operation input_location: dict: Specify the location of the input data
Returns: The output_location where the parquet files with the table values are stored. The files are still divided into test, train and validation.
| ConfigKey | Description |
|---|---|
clear_folder |
Flag if the output folder should be cleared before the split |
name_delimiter |
Delimiter used within the filenames |
output_location |
Foldername where the images are stored |
recursive |
Flag if the input folder should be searched recursively |
sub_dir |
Subdirectory to save the split images |
target_image_size |
Image size to which the images should be scaled |
use_dirs_as_subsets |
Flag if the subdirectories should be used as subset names |
Op: df_normalization¶
The df_normalization function takes in a dataframe and normalizes the features
specified in scalar_feature_keys, categorical_feature_keys and binary_feature_keys.
The parameters for the feature keys can be a function that returns the feature keys as
a list or a list of feature keys. The function returns a normalized parquet file with
all columns normalized specified in feature_keys, as well as an output yaml file
containing information about how each feature was normalized. The input_parq_location
is where the input parquet files are located, while output_parq_location is where you
want to save your new dataframes and norm info yaml.
Args: context: OpExecutionContext: Get the op_config input_location: dict: Specify the location of the input data
Returns: The output_parq_location, which is the location of the normalized parquet files and norm info
| ConfigKey | Description |
|---|---|
binary_feature_keys |
Column names to be normalized with binary values (list or function) |
categorical_feature_keys |
Column names to be normalized with categorical values (list or function) |
output_norm_feature_info_file_name |
File name for the file containing the normalization information of the features |
output_parq_location |
Target location for the normalized parq files |
recursive |
|
scalar_feature_keys |
Column names to be normalized with scalar values (list or function) |