Skip to content

dfdataset

dfdataset

Module for dfdataset

Classes

DfDataset

DfDataset(
    id_key,
    subset_name,
    data_location,
    df_filename=ExperimentFilenames.SUBSET_NAME,
    shuffle=False,
    data_shuffler=None,
    dataframe_filters=None,
    feature_combiners=None,
    extra_key_list=None,
)

Bases: Dataset, ABC

Dataset for dataframes

Initializes an instance of DfDataset with the given arguments

Parameters:

  • id_key (str) –

    Column name of the id column in your dataframe

  • subset_name (str) –

    Name of the dataset

  • data_location (Union[dict, LocationConfig]) –

    Location of the data used in the data set

  • df_filename (str, default: SUBSET_NAME ) –

    Specify the file name of the dataframe

  • shuffle (bool, default: False ) –

    Flag to shuffle the data or not

  • dataframe_filters (Optional[List[DataframeFilter]], default: None ) –

    Optional list of dataframe filters to filter the data

Source code in niceml/data/datasets/dfdataset.py
def __init__(  # ruff: noqa: PLR0913
    self,
    id_key: str,
    subset_name: str,
    data_location: Union[dict, LocationConfig],
    df_filename: str = ExperimentFilenames.SUBSET_NAME,
    shuffle: bool = False,
    data_shuffler: Optional[DataShuffler] = None,
    dataframe_filters: Optional[List[DataframeFilter]] = None,
    feature_combiners: Optional[List[FeatureCombiner]] = None,
    extra_key_list: Optional[List[str]] = None,
):
    """
    Initializes an instance of DfDataset with the given arguments

    Args:
        id_key: Column name of the id column in your dataframe
        subset_name: Name of the dataset
        data_location: Location of the data used in the data set
        df_filename: Specify the file name of the dataframe
        shuffle: Flag to shuffle the data or not
        dataframe_filters: Optional list of dataframe filters
                            to filter the data
    """
    super().__init__()
    self.data_shuffler = data_shuffler or DefaultDataShuffler()
    self.dataframe_filters = dataframe_filters or []
    self.df_path = df_filename
    self.data_location = data_location
    self.subset_name = subset_name
    self.id_key = id_key
    self.index_list = []
    self.shuffle = shuffle
    self.data: Optional[pd.DataFrame] = None
    self.inputs: List[dict] = []
    self.targets: List[dict] = []
    self.extra_key_list: List[str] = extra_key_list or []
    self.feature_combiners: List[FeatureCombiner] = feature_combiners or []
Functions
__getitem__
__getitem__(index)

The getitem function returns the indexed data item.

Args: index: Specify index of the item

Returns: An item of input data and target data

Source code in niceml/data/datasets/dfdataset.py
def __getitem__(self, index):
    """
    The __getitem__ function returns the indexed data item.

     Args:
         index: Specify `index` of the item

     Returns:
         An item of input data and target data
    """
    input_data, target_data = self.get_data(index, index + 1)

    return input_data, target_data
__len__
__len__()

The len function is used to determine the number of steps in a dataset.

Returns:

  • The number of items

Source code in niceml/data/datasets/dfdataset.py
def __len__(self):
    """
    The __len__ function is used to determine the number of steps in a dataset.

    Returns:
        The number of items
    """
    return self.get_items_per_epoch()
extract_data
extract_data(cur_indexes, cur_input)

The extract_data function takes in a list of indexes and an input dictionary. The function then extracts the data from self.data using the key provided by the input dictionary, and returns it as a numpy array. If the type is categorical, it will convert it to one-hot encoding.

Parameters:

  • self

    Bind the method to an object

  • cur_indexes (List[int]) –

    Select the rows of data that are needed for the current batch

  • cur_input (dict) –

    Get the key and type of the input

Returns:

  • The data of the current key

Source code in niceml/data/datasets/dfdataset.py
def extract_data(self, cur_indexes: List[int], cur_input: dict):
    """
    The extract_data function takes in a list of indexes and an input dictionary.
    The function then extracts the data from `self.data` using the key provided by
    the input dictionary, and returns it as a numpy array. If the type is categorical,
    it will convert it to one-hot encoding.

    Args:
        self: Bind the method to an object
        cur_indexes: Select the rows of data that are needed for the current batch
        cur_input: Get the key and type of the input

    Returns:
        The data of the current key
    """
    cur_key = cur_input["key"]
    cur_data = self.data.iloc[cur_indexes][cur_key]
    if cur_input["type"] == FeatureType.CATEGORICAL:
        cur_data = to_categorical(cur_data, cur_input["value_count"])
    return cur_data
get_all_data
get_all_data()

loads all data

Source code in niceml/data/datasets/dfdataset.py
def get_all_data(self):
    """loads all data"""
    return self.get_data_from_idx_list(self.index_list)
get_all_data_info
get_all_data_info()

The get_all_data_info function returns a list of RegDataInfo objects for all data in self.data.

Returns:

Source code in niceml/data/datasets/dfdataset.py
def get_all_data_info(self) -> List[RegDataInfo]:
    """
    The get_all_data_info function returns a list of `RegDataInfo` objects for
    all data in `self.data`.

    Returns:
        A list of `RegDataInfo` objects

    """
    input_keys = [input_dict["key"] for input_dict in self.inputs]
    target_keys = [target_dict["key"] for target_dict in self.targets]
    data_subset = self.data[
        [self.id_key] + input_keys + target_keys + self.extra_key_list
    ]
    data_info_dicts: List[dict] = data_subset.to_dict("records")
    data_info_list: List[RegDataInfo] = []

    for data_info_dict in data_info_dicts:
        key = data_info_dict[self.id_key]
        data_info_dict.pop(self.id_key)
        data_info_list.append(RegDataInfo(key, data_info_dict))

    return data_info_list
get_data
get_data(start_idx, end_idx)

loads data between indexes

Source code in niceml/data/datasets/dfdataset.py
def get_data(self, start_idx: int, end_idx: int):
    """loads data between indexes"""
    cur_indexes = self.index_list[start_idx:end_idx]
    return self.get_data_from_idx_list(cur_indexes)
get_data_by_key
get_data_by_key(data_key)

Returns all rows of the data, whose 'id_key' matches the 'data_key'.

Parameters:

  • data_key

    Identify the data that is being requested

Returns:

  • A dataframe of the rows where the self.id_key column matches the data_key parameter

Source code in niceml/data/datasets/dfdataset.py
def get_data_by_key(self, data_key):
    """
    Returns all rows of the data, whose 'id_key' matches the 'data_key'.

    Args:
        data_key: Identify the data that is being requested

    Returns:
        A dataframe of the rows where the `self.id_key` column matches the `data_key` parameter
    """
    return self.data.loc[self.data[self.id_key] == data_key]
get_data_from_idx_list
get_data_from_idx_list(index_list)

returns data with a given index_list

Source code in niceml/data/datasets/dfdataset.py
def get_data_from_idx_list(self, index_list: List[int]):
    """returns data with a given `index_list`"""
    input_data = []
    for cur_input in self.inputs:
        cur_data = self.extract_data(index_list, cur_input)
        input_data.append(np.array(cur_data))
    target_data = []
    for cur_target in self.targets:
        cur_data = self.extract_data(index_list, cur_target)
        target_data.append(np.array(cur_data))

    input_array = np.column_stack(input_data)
    target_array = np.column_stack(target_data)
    return input_array, target_array
get_item_count
get_item_count()

Get the number of items in the dataset

Source code in niceml/data/datasets/dfdataset.py
def get_item_count(self) -> int:
    """Get the number of items in the dataset"""
    return len(self.data)
get_items_per_epoch
get_items_per_epoch()

Get the number of items per epoch

Source code in niceml/data/datasets/dfdataset.py
def get_items_per_epoch(self) -> int:
    """Get the number of items per epoch"""
    return len(self.index_list)
get_set_name
get_set_name()

The get_set_name function returns the name of the set.

Returns:

  • str

    The name of the data set

Source code in niceml/data/datasets/dfdataset.py
def get_set_name(self) -> str:
    """
    The get_set_name function returns the name of the set.

    Returns:
       The name of the data set
    """
    return self.subset_name
initialize
initialize(data_description, exp_context)

The initialize function should read in all the necessary files from disk and store them as attributes on this class instance. This function is called when the data set is created. It takes in a RegDataDescription object, which contains information about the inputs and targets of your dataset. The initialize function should also take in an ExperimentContext object, which contains information about where to find your data on disk. The ExperimentContext is not used in this class.

Parameters:

  • data_description (RegDataDescription) –

    RegDataDescription: Pass the data description of the dataset to this class

  • exp_context (ExperimentContext) –

    ExperimentContext: Pass the experiment context.

Source code in niceml/data/datasets/dfdataset.py
def initialize(
    self, data_description: RegDataDescription, exp_context: ExperimentContext
):
    """
    The initialize function should read in all the necessary files from disk and store them as
    attributes on this class instance.
    This function is called when the data set is created.
    It takes in a `RegDataDescription` object, which contains information about the
    inputs and targets of your dataset. The initialize function should also take in an
    `ExperimentContext` object, which contains information about where to find your
    data on disk. The `ExperimentContext` is not used in this class.


    Args:
        data_description: RegDataDescription: Pass the data description of the dataset
                            to this class
        exp_context: ExperimentContext: Pass the experiment context.
    """
    self.inputs = data_description.inputs
    self.targets = data_description.targets

    with open_location(self.data_location) as (data_fs, data_root):
        data_path = join_fs_path(
            data_fs, data_root, self.df_path.format(subset_name=self.subset_name)
        )
        self.data = read_parquet(filepath=data_path, file_system=data_fs)

    for feature_combiner in self.feature_combiners:
        self.data = feature_combiner.combine_features(self.data)

    for df_filter in self.dataframe_filters:
        df_filter.initialize(data_description=data_description)
        self.data = df_filter.filter(data=self.data)

    self.data = self.data.reset_index(drop=True)
    self.index_list = list(range(len(self.data)))

    self.on_epoch_end()
iter_with_info
iter_with_info()

The iter_with_info function is a generator that yields the next batch of data, along with some additional information about the batch. The additional information is useful for various diagnostic purposes. The function returns an object of type DataIterator, which has two fields: * 'batch' contains the next batch of data. * 'info' contains additional information about that batch.

Returns:

  • A dataiterator object

Source code in niceml/data/datasets/dfdataset.py
def iter_with_info(self):
    """
    The iter_with_info function is a generator that yields the next batch of data,
    along with some additional information about the batch.
    The additional information is useful for various diagnostic purposes.
    The function returns an object of type DataIterator, which has two fields:
        * 'batch' contains the next batch of data.
        * 'info' contains additional information about that batch.

    Returns:
        A dataiterator object
    """
    return DataIterator(self)
on_epoch_end
on_epoch_end()

Execute logic to be performed at the end of an epoch (e.g. shuffling the data)

Source code in niceml/data/datasets/dfdataset.py
def on_epoch_end(self):
    """
    Execute logic to be performed at the end of an epoch (e.g. shuffling the data)
    """
    if self.shuffle:
        self.index_list = self.data_shuffler.shuffle(
            data_infos=self.get_all_data_info()
        )

RegDataInfo dataclass

Bases: DataInfo

Datainfo for Regression data

Functions
__getattr__
__getattr__(item)

The getattr function is called when an attribute lookup has not found the attribute in the usual places (i.e. it is not an instance attribute nor is it found in the class tree of self). In this case it is the value of the key (item) in the self.data dictionary

Parameters:

  • item

    Access the value of a key in self.data

Returns:

  • Any

    The value of the key (item) in the self.data dictionary

Source code in niceml/data/datasets/dfdataset.py
def __getattr__(self, item) -> Any:
    """
    The __getattr__ function is called when an attribute lookup has not found the attribute
    in the usual places (i.e. it is not an instance attribute nor is it
    found in the class tree of self). In this case it is the value of the key (`item`)
    in the `self.data` dictionary

    Args:
        item: Access the value of a key in `self.data`

    Returns:
        The value of the key (`item`) in the `self.data` dictionary
    """
    return self.data[item]
get_identifier
get_identifier()

The get_identifier function returns the dataid of this object.

Returns:

  • str

    The dataid

Source code in niceml/data/datasets/dfdataset.py
def get_identifier(self) -> str:
    """
    The get_identifier function returns the dataid of this object.

    Returns:
        The dataid
    """
    return self.dataid
get_info_dict
get_info_dict()

The get_info_dict function returns a dictionary containing the dataid and all the data in self.data.

Returns:

  • dict

    A dictionary containing the dataid and all the other key-value pairs in self

Source code in niceml/data/datasets/dfdataset.py
def get_info_dict(self) -> dict:
    """
    The get_info_dict function returns a dictionary containing the dataid
    and all the data in self.data.

    Returns:
        A dictionary containing the dataid and all the other key-value pairs in self
    """
    info_dict = dict(dataid=self.dataid)
    info_dict.update(self.data)
    return info_dict

Functions