Generating a Test Dataset with niceML¶

In addition to providing pipelines for various machine learning tasks, niceML also offers a convenient way to generate synthetic test datasets. This can be particularly useful when you need to quickly create a sample dataset for testing or prototyping your models. In this tutorial, we will

set the amount of test data,
set the output directory of the test data,
configure the targets (= numbers) generated on the test data and
generate a test dataset using the niceml gendata command.

Generate a Test data set¶

Follow the steps below to generate a test dataset:

Step 1: Run the `niceml gendata` Command¶

Open your terminal or command prompt and navigate to your project directory. Then, execute the following command:

niceml gendata

This will generate the test data according to the default configuration (configs/jobs/job_data_generation/job_data_generation.yaml). Below you find help to adjust these configurations to your needs.

Data will be overwritten

Each time you run niceml gendata the folders and images generated by this command will be exchanged. If you want to keep multiple versions of generated data, make sure to rename the old folders or set a new output directory before creating a new test data set.

Step 2: Explore the Generated Dataset¶

After the command execution is complete, you will find the generated images in the DATA_URI directory specified in your .env file. Each image will have numbers randomly placed on it, and the label information will be available as well.

How does the Test Data generated by niceML look like?

More information about what kind of data is generated and the corresponding folder structure is provided here.

Step 3: Customizing Data Generation (Optional)¶

niceML generates sample images with numbers placed randomly. However, you can further customize the data generation process according to your requirements. This includes defining the number of images you need, specifying the maximum number to display on the images, and configuring other generation options.

Make sure to rerun the niceml gendata command, to create your new test data set.

Set the output directory of the test data¶

The test data will be written to your data directory set in the .env file. To change the direcory, modify the .env file in your project directory. Uncomment and adjust the following parameter as needed:

# Access data
DATA_URI=./data  # path to your data

Set the number of test images to be generated¶

Modify the .env file in your project directory. Uncomment and adjust the following parameter as needed:

# Optional for number data generation
# SAMPLE_COUNT=<number of sample images to generate; e.g., 500>

Set the highest number to be placed on the images (= maximum number of target classes)¶

The MAX_NUMBER parameter defines the highest number, which should be drawn on the images. Therefore, it also represents the highest number of target classes present in the test dataset.

Modify the .env file in your project directory. Uncomment and adjust the following parameter as needed:

# Optional for number data generation
...
# MAX_NUMBER=<highest number to display on test images; e.g., 5>

Other configurations¶

Besides the three main configurations described above, you can adjust the data generation by modifying the corresponding YAML-file.

Configurable parameters are:

configs/ops/data_generation/op_data_generation_number.yaml:

seed = seed of the random image generation
img_size:width = output test image width
img_size:height = output test image height
font_size_min = minimum font size of the numbers generated
font_size_max = maximum font size of the numbers generated
max_amount = maximum amount of numbers on one image
detection_labels = whether the label JSON-file should contain bounding box information
rotate = whether the numbers should be rotated on the image

configs/ops/split_data/op_split_data_number.yaml:

set_infos: probability = probability of a split (train, test, validation)

Recap¶

Congratulations! You have learned how to

generate a test dataset using the niceml gendata command
configure the test dataset generation to your needs.

This feature allows you to quickly create sample images with numbers placed randomly and obtain label information for each image.

You now know how to bootstrap your machine learning project using niceMLs convenient data generation.