mlops.dataset package

Submodules

mlops.dataset.data_processor module

Contains the DataProcessor class.

class mlops.dataset.data_processor.DataProcessor

Bases: object

Transforms a raw dataset into features and labels for downstream model training, prediction, etc.

get_preprocessed_features(dataset_path: str) Dict[str, numpy.ndarray]

Transforms the raw data at the given file or directory into features that can be used by downstream models. The data in the directory may be the training/validation/test data, or it may be a batch of user data that is intended for prediction, or data in some other format. Downstream models can expect the features returned by this function to be preprocessed in any way required for model consumption.

Parameters

dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset.

Returns

A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).

get_preprocessed_features_and_labels(dataset_path: str) Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]

Returns the preprocessed feature and label tensors from the dataset path. This method is specifically used for the train/val/test sets and not input data for prediction, because in some cases the features and labels need to be read simultaneously to ensure proper ordering of features and labels.

Parameters

dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset, specifically train/val/test and not prediction data.

Returns

A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.

abstract get_raw_features(dataset_path: str) Dict[str, numpy.ndarray]

Returns the raw feature tensors from the dataset path. The raw features are how training/validation/test as well as prediction data enter the data pipeline. For example, when handling image data, the raw features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 255].

Parameters

dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset.

Returns

A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).

abstract get_raw_features_and_labels(dataset_path: str) Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]

Returns the raw feature and label tensors from the dataset path. This method is specifically used for the train/val/test sets and not input data for prediction, because in some cases the features and labels need to be read simultaneously to ensure proper ordering of features and labels.

For example, when handling image data, the raw features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 255]. The raw labels may be tensors of shape m, where m is the number of examples, with all values in the set {0, …, k - 1} indicating the class.

Parameters

dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset, specifically train/val/test and not prediction data.

Returns

A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.

abstract preprocess_features(raw_feature_tensor: numpy.ndarray) numpy.ndarray

Returns the preprocessed feature tensor from the raw tensor. The preprocessed features are how training/validation/test as well as prediction data are fed into downstream models. For example, when handling image data, the preprocessed features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 1].

Parameters

raw_feature_tensor – The raw features to be preprocessed.

Returns

The preprocessed feature tensor. This tensor is ready for downstream model consumption.

abstract preprocess_labels(raw_label_tensor: numpy.ndarray) numpy.ndarray

Returns the preprocessed label tensor from the raw tensor. The preprocessed labels are how training/validation/test as well as prediction data are fed into downstream models. For example, in a classification task, the preprocessed labels may be tensors of shape m x k, where m is the number of examples, and k is the number of classes, where each of the k-length vectors are one-hot encoded.

Parameters

raw_label_tensor – The raw labels to be preprocessed.

Returns

The preprocessed label tensor. This tensor is ready for downstream model consumption.

mlops.dataset.invertible_data_processor module

Contains the InvertibleDataProcessor class.

class mlops.dataset.invertible_data_processor.InvertibleDataProcessor

Bases: mlops.dataset.data_processor.DataProcessor

A DataProcessor that can invert any preprocessing to transform preprocessed data back into raw, real-world values for analysis and interpretability.

abstract unpreprocess_features(feature_tensor: numpy.ndarray) numpy.ndarray

Returns the raw feature tensor from the preprocessed tensor; inverts preprocessing. Improves model interpretability by enabling users to transform model inputs into real-world values.

Parameters

feature_tensor – The preprocessed features to be inverted.

Returns

The raw feature tensor.

abstract unpreprocess_labels(label_tensor: numpy.ndarray) numpy.ndarray

Returns the raw label tensor from the preprocessed tensor; inverts preprocessing. Improves model interpretability by enabling users to transform model outputs into real-world values.

Parameters

label_tensor – The preprocessed labels to be inverted.

Returns

The raw label tensor.

mlops.dataset.pathless_data_processor module

Contains the PathlessDataProcessor class.

class mlops.dataset.pathless_data_processor.PathlessDataProcessor(features: Dict[str, numpy.ndarray], labels: Dict[str, numpy.ndarray])

Bases: mlops.dataset.data_processor.DataProcessor

Loads preset features and labels.

get_raw_features(dataset_path: str) Dict[str, numpy.ndarray]

Returns the training features.

Parameters

dataset_path – Unused.

Returns

A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).

get_raw_features_and_labels(dataset_path: str) Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]

Returns the training features and labels.

Parameters

dataset_path – Unused

Returns

A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.

preprocess_features(raw_feature_tensor: numpy.ndarray) numpy.ndarray

Returns the identity function on the input features.

Parameters

raw_feature_tensor – The raw features to be preprocessed.

Returns

The preprocessed feature tensor. This tensor is ready for downstream model consumption.

preprocess_labels(raw_label_tensor: numpy.ndarray) numpy.ndarray

Returns the the identity function on the input labels.

Parameters

raw_label_tensor – The raw labels to be preprocessed.

Returns

The preprocessed label tensor. This tensor is ready for downstream model consumption.

mlops.dataset.pathless_versioned_dataset_builder module

Contains the PathlessVersionedDatasetBuilder class.

class mlops.dataset.pathless_versioned_dataset_builder.PathlessVersionedDatasetBuilder(features: Dict[str, numpy.ndarray], labels: Dict[str, numpy.ndarray])

Bases: mlops.dataset.versioned_dataset_builder.VersionedDatasetBuilder

Builds a versioned dataset directly from feature and label tensors.

publish(path: str, *args: Any, name: str = 'dataset', version: Optional[str] = None, dataset_copy_strategy: str = 'link', tags: Optional[List[str]] = None, **kwargs: Any) str

Saves the versioned dataset files to the given path. If the path and appended version already exists, this operation will raise a PublicationPathAlreadyExistsError.

The following files will be created:
path/version/ (the publication path and version)

X_train.npy (and other feature tensors by their given names) y_train.npy (and other label tensors by their given names) data_processor.pkl (DataProcessor object) meta.json (metadata) raw.tar.bz2 (bz2-zipped directory with the raw dataset files)

The contents of meta.json will be:
{

name: (dataset name) version: (dataset version) hash: (MD5 hash of all objects apart from data_processor.pkl

and meta.json)

created_at: (timestamp) tags: (optional list of tags)

}

Parameters
  • path – The path, either on the local filesystem or in a cloud store such as S3, to which the dataset should be saved. The version will be appended to this path as a subdirectory. An S3 path should be a URL of the form “s3://bucket-name/path/to/dir”. It is recommended to use this same path to publish all datasets, since it will prevent the user from creating two different datasets with the same version.

  • name – The name of the dataset, e.g., “mnist”.

  • version – A string indicating the dataset version. The version should be unique to this dataset. If None, the publication timestamp will be used as the version.

  • dataset_copy_strategy – The strategy by which to copy the original, raw dataset to the published path. STRATEGY_COPY recursively copies all files and directories from the dataset path supplied at instantiation to the published path so that the dataset can be properly versioned. STRATEGY_COPY_ZIP is identical in behavior, but zips the directory upon completion. STRATEGY_LINK will instead create a file ‘link.txt’ containing the supplied dataset path; this is desirable if the raw dataset is already stored in a versioned repository, and copying would create an unnecessary duplicate.

  • tags – An optional list of string tags to add to the dataset metadata.

Returns

The versioned dataset’s publication path.

mlops.dataset.versioned_dataset module

Contains the VersionedDataset class.

class mlops.dataset.versioned_dataset.VersionedDataset(path: str)

Bases: mlops.artifact.versioned_artifact.VersionedArtifact

Represents a versioned dataset.

property md5: str

Returns the artifact’s MD5 hash.

Returns

The artifact’s MD5 hash.

property metadata_path: str

Returns the local or remote path to the artifact’s metadata.

Returns

The local or remote path to the artifact’s metadata.

property name: str

Returns the artifact’s name.

Returns

The artifact’s name.

property path: str

Returns the local or remote path to the artifact.

Returns

The local or remote path to the artifact.

property version: str

Returns the artifact’s version.

Returns

The artifact’s version.

mlops.dataset.versioned_dataset_builder module

Contains the VersionedDatasetBuilder class.

class mlops.dataset.versioned_dataset_builder.VersionedDatasetBuilder(dataset_path: str, data_processor: mlops.dataset.data_processor.DataProcessor)

Bases: mlops.artifact.versioned_artifact_builder.VersionedArtifactBuilder

An object containing all of the components that form a versioned dataset. This object is only used to ensure a standard format for datasets stored in a dataset archive (such as the local filesystem or S3), and is not meant for consumption by downstream models.

publish(path: str, *args: Any, name: str = 'dataset', version: Optional[str] = None, dataset_copy_strategy: str = 'copy_zip', tags: Optional[List[str]] = None, **kwargs: Any) str

Saves the versioned dataset files to the given path. If the path and appended version already exists, this operation will raise a PublicationPathAlreadyExistsError.

The following files will be created:
path/version/ (the publication path and version)

X_train.npy (and other feature tensors by their given names) y_train.npy (and other label tensors by their given names) data_processor.pkl (DataProcessor object) meta.json (metadata) raw.tar.bz2 (bz2-zipped directory with the raw dataset files)

The contents of meta.json will be:
{

name: (dataset name) version: (dataset version) hash: (MD5 hash of all objects apart from data_processor.pkl

and meta.json)

created_at: (timestamp) tags: (optional list of tags)

}

Parameters
  • path – The path, either on the local filesystem or in a cloud store such as S3, to which the dataset should be saved. The version will be appended to this path as a subdirectory. An S3 path should be a URL of the form “s3://bucket-name/path/to/dir”. It is recommended to use this same path to publish all datasets, since it will prevent the user from creating two different datasets with the same version.

  • name – The name of the dataset, e.g., “mnist”.

  • version – A string indicating the dataset version. The version should be unique to this dataset. If None, the publication timestamp will be used as the version.

  • dataset_copy_strategy – The strategy by which to copy the original, raw dataset to the published path. STRATEGY_COPY recursively copies all files and directories from the dataset path supplied at instantiation to the published path so that the dataset can be properly versioned. STRATEGY_COPY_ZIP is identical in behavior, but zips the directory upon completion. STRATEGY_LINK will instead create a file ‘link.txt’ containing the supplied dataset path; this is desirable if the raw dataset is already stored in a versioned repository, and copying would create an unnecessary duplicate.

  • tags – An optional list of string tags to add to the dataset metadata.

Returns

The versioned dataset’s publication path.

Module contents

Contains dataset modules.