Generating a Test Dataset with niceML¶
In addition to providing pipelines for various machine learning tasks, niceML also offers a convenient way to generate synthetic test datasets. This can be particularly useful when you need to quickly create a sample dataset for testing or prototyping your models. In this tutorial, we will
- set the amount of test data,
- set the output directory of the test data,
- configure the targets (= numbers) generated on the test data and
- generate a test dataset using the
niceml gendata
command.
Generate a Test data set¶
Follow the steps below to generate a test dataset:
Step 1: Run the niceml gendata
Command¶
Open your terminal or command prompt and navigate to your project directory. Then, execute the following command:
This will generate the test data according to the default configuration
(configs/jobs/job_data_generation/job_data_generation.yaml
).
Below you find help to adjust these configurations to your needs.
Data will be overwritten
Each time you run niceml gendata
the folders and images generated
by this command will be exchanged. If you want to keep multiple
versions of generated data, make sure to rename the old folders or
set a new output directory before creating a new test data set.
Step 2: Explore the Generated Dataset¶
After the command execution is complete, you will find the generated
images in the DATA_URI
directory specified in your .env
file. Each
image will have numbers randomly placed on it, and the label information
will be available as well.
How does the Test Data generated by niceML look like?
More information about what kind of data is generated and the corresponding folder structure is provided here.
Step 3: Customizing Data Generation (Optional)¶
niceML generates sample images with numbers placed randomly. However, you can further customize the data generation process according to your requirements. This includes defining the number of images you need, specifying the maximum number to display on the images, and configuring other generation options.
Make sure to rerun the niceml gendata
command, to create your new test
data set.
Set the output directory of the test data¶
The test data will be written to your data directory set in the .env
file. To change the direcory, modify the .env
file in your project
directory. Uncomment and adjust the following parameter as needed:
Set the number of test images to be generated¶
Modify the .env
file in your project directory. Uncomment and adjust
the following parameter as needed:
# Optional for number data generation
# SAMPLE_COUNT=<number of sample images to generate; e.g., 500>
Set the highest number to be placed on the images (= maximum number of target classes)¶
The MAX_NUMBER
parameter defines the highest number, which should
be drawn on the images. Therefore, it also represents the highest number
of target classes present in the test dataset.
Modify the .env
file in your project directory. Uncomment and adjust
the following parameter as needed:
# Optional for number data generation
...
# MAX_NUMBER=<highest number to display on test images; e.g., 5>
Other configurations¶
Besides the three main configurations described above, you can adjust the data generation by modifying the corresponding YAML-file.
Configurable parameters are:
configs/ops/data_generation/op_data_generation_number.yaml
:
seed
= seed of the random image generationimg_size:width
= output test image widthimg_size:height
= output test image heightfont_size_min
= minimum font size of the numbers generatedfont_size_max
= maximum font size of the numbers generatedmax_amount
= maximum amount of numbers on one imagedetection_labels
= whether the label JSON-file should contain bounding box informationrotate
= whether the numbers should be rotated on the image
configs/ops/split_data/op_split_data_number.yaml
:
set_infos
:probability = probability of a split (train, test, validation)
Recap¶
Congratulations! You have learned how to
- generate a test dataset using the
niceml gendata
command - configure the test dataset generation to your needs.
This feature allows you to quickly create sample images with numbers placed randomly and obtain label information for each image.
You now know how to bootstrap your machine learning project using niceMLs convenient data generation.