mlops.dataset package

Submodules

mlops.dataset.data_processor module

Contains the DataProcessor class.

class mlops.dataset.data_processor.DataProcessor

Bases: object

Transforms a raw dataset into features and labels for downstream model training, prediction, etc.

get_preprocessed_features(dataset_path: str) → Dict[str, numpy.ndarray]

Transforms the raw data at the given file or directory into features that can be used by downstream models. The data in the directory may be the training/validation/test data, or it may be a batch of user data that is intended for prediction, or data in some other format. Downstream models can expect the features returned by this function to be preprocessed in any way required for model consumption.

Parameters: dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset.
Returns: A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).

get_preprocessed_features_and_labels(dataset_path: str) → Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]

Returns the preprocessed feature and label tensors from the dataset path. This method is specifically used for the train/val/test sets and not input data for prediction, because in some cases the features and labels need to be read simultaneously to ensure proper ordering of features and labels.

Parameters: dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset, specifically train/val/test and not prediction data.
Returns: A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.

abstract get_raw_features(dataset_path: str) → Dict[str, numpy.ndarray]

Returns the raw feature tensors from the dataset path. The raw features are how training/validation/test as well as prediction data enter the data pipeline. For example, when handling image data, the raw features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 255].

Parameters: dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset.
Returns: A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).

abstract get_raw_features_and_labels(dataset_path: str) → Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]

Returns the raw feature and label tensors from the dataset path. This method is specifically used for the train/val/test sets and not input data for prediction, because in some cases the features and labels need to be read simultaneously to ensure proper ordering of features and labels.

For example, when handling image data, the raw features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 255]. The raw labels may be tensors of shape m, where m is the number of examples, with all values in the set {0, …, k - 1} indicating the class.

Parameters: dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset, specifically train/val/test and not prediction data.
Returns: A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.

abstract preprocess_features(raw_feature_tensor: numpy.ndarray) → numpy.ndarray

Returns the preprocessed feature tensor from the raw tensor. The preprocessed features are how training/validation/test as well as prediction data are fed into downstream models. For example, when handling image data, the preprocessed features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 1].

Parameters: raw_feature_tensor – The raw features to be preprocessed.
Returns: The preprocessed feature tensor. This tensor is ready for downstream model consumption.

abstract preprocess_labels(raw_label_tensor: numpy.ndarray) → numpy.ndarray

Returns the preprocessed label tensor from the raw tensor. The preprocessed labels are how training/validation/test as well as prediction data are fed into downstream models. For example, in a classification task, the preprocessed labels may be tensors of shape m x k, where m is the number of examples, and k is the number of classes, where each of the k-length vectors are one-hot encoded.

Parameters: raw_label_tensor – The raw labels to be preprocessed.
Returns: The preprocessed label tensor. This tensor is ready for downstream model consumption.

mlops.dataset.invertible_data_processor module

Contains the InvertibleDataProcessor class.

class mlops.dataset.invertible_data_processor.InvertibleDataProcessor

Bases: mlops.dataset.data_processor.DataProcessor

A DataProcessor that can invert any preprocessing to transform preprocessed data back into raw, real-world values for analysis and interpretability.

abstract unpreprocess_features(feature_tensor: numpy.ndarray) → numpy.ndarray

Returns the raw feature tensor from the preprocessed tensor; inverts preprocessing. Improves model interpretability by enabling users to transform model inputs into real-world values.

Parameters: feature_tensor – The preprocessed features to be inverted.
Returns: The raw feature tensor.

abstract unpreprocess_labels(label_tensor: numpy.ndarray) → numpy.ndarray

Returns the raw label tensor from the preprocessed tensor; inverts preprocessing. Improves model interpretability by enabling users to transform model outputs into real-world values.

Parameters: label_tensor – The preprocessed labels to be inverted.
Returns: The raw label tensor.

mlops.dataset.pathless_data_processor module

Contains the PathlessDataProcessor class.

class mlops.dataset.pathless_data_processor.PathlessDataProcessor(features: Dict[str, numpy.ndarray], labels: Dict[str, numpy.ndarray])

Bases: mlops.dataset.data_processor.DataProcessor

Loads preset features and labels.

get_raw_features(dataset_path: str) → Dict[str, numpy.ndarray]

Returns the training features.

Parameters: dataset_path – Unused.
Returns: A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).

get_raw_features_and_labels(dataset_path: str) → Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]

Returns the training features and labels.

Parameters: dataset_path – Unused
Returns: A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.

preprocess_features(raw_feature_tensor: numpy.ndarray) → numpy.ndarray

Returns the identity function on the input features.

Parameters: raw_feature_tensor – The raw features to be preprocessed.
Returns: The preprocessed feature tensor. This tensor is ready for downstream model consumption.

preprocess_labels(raw_label_tensor: numpy.ndarray) → numpy.ndarray

Returns the the identity function on the input labels.

Parameters: raw_label_tensor – The raw labels to be preprocessed.
Returns: The preprocessed label tensor. This tensor is ready for downstream model consumption.

mlops.dataset.pathless_versioned_dataset_builder module

Contains the PathlessVersionedDatasetBuilder class.

class mlops.dataset.pathless_versioned_dataset_builder.PathlessVersionedDatasetBuilder(features: Dict[str, numpy.ndarray], labels: Dict[str, numpy.ndarray])

Bases: mlops.dataset.versioned_dataset_builder.VersionedDatasetBuilder

Builds a versioned dataset directly from feature and label tensors.

publish(path: str, *args: Any, name: str = 'dataset', version: Optional[str] = None, dataset_copy_strategy: str = 'link', tags: Optional[List[str]] = None, **kwargs: Any) → str

Saves the versioned dataset files to the given path. If the path and appended version already exists, this operation will raise a PublicationPathAlreadyExistsError.

The following files will be created:

path/version/ (the publication path and version): X_train.npy (and other feature tensors by their given names) y_train.npy (and other label tensors by their given names) data_processor.pkl (DataProcessor object) meta.json (metadata) raw.tar.bz2 (bz2-zipped directory with the raw dataset files)

The contents of meta.json will be:

{

name: (dataset name) version: (dataset version) hash: (MD5 hash of all objects apart from data_processor.pkl

and meta.json)

created_at: (timestamp) tags: (optional list of tags)

}

Parameters

path – The path, either on the local filesystem or in a cloud store such as S3, to which the dataset should be saved. The version will be appended to this path as a subdirectory. An S3 path should be a URL of the form “s3://bucket-name/path/to/dir”. It is recommended to use this same path to publish all datasets, since it will prevent the user from creating two different datasets with the same version.
name – The name of the dataset, e.g., “mnist”.
version – A string indicating the dataset version. The version should be unique to this dataset. If None, the publication timestamp will be used as the version.
dataset_copy_strategy – The strategy by which to copy the original, raw dataset to the published path. STRATEGY_COPY recursively copies all files and directories from the dataset path supplied at instantiation to the published path so that the dataset can be properly versioned. STRATEGY_COPY_ZIP is identical in behavior, but zips the directory upon completion. STRATEGY_LINK will instead create a file ‘link.txt’ containing the supplied dataset path; this is desirable if the raw dataset is already stored in a versioned repository, and copying would create an unnecessary duplicate.
tags – An optional list of string tags to add to the dataset metadata.

Returns

The versioned dataset’s publication path.

mlops.dataset.versioned_dataset module

Contains the VersionedDataset class.

class mlops.dataset.versioned_dataset.VersionedDataset(path: str)

Bases: mlops.artifact.versioned_artifact.VersionedArtifact

Represents a versioned dataset.

property md5: str

Returns the artifact’s MD5 hash.

Returns: The artifact’s MD5 hash.

property metadata_path: str

Returns the local or remote path to the artifact’s metadata.

Returns: The local or remote path to the artifact’s metadata.

property name: str

Returns the artifact’s name.

Returns: The artifact’s name.

property path: str

Returns the local or remote path to the artifact.

Returns: The local or remote path to the artifact.

property version: str

Returns the artifact’s version.

Returns: The artifact’s version.

mlops.dataset.versioned_dataset_builder module

Contains the VersionedDatasetBuilder class.

class mlops.dataset.versioned_dataset_builder.VersionedDatasetBuilder(dataset_path: str, data_processor: mlops.dataset.data_processor.DataProcessor)

Bases: mlops.artifact.versioned_artifact_builder.VersionedArtifactBuilder

An object containing all of the components that form a versioned dataset. This object is only used to ensure a standard format for datasets stored in a dataset archive (such as the local filesystem or S3), and is not meant for consumption by downstream models.

publish(path: str, *args: Any, name: str = 'dataset', version: Optional[str] = None, dataset_copy_strategy: str = 'copy_zip', tags: Optional[List[str]] = None, **kwargs: Any) → str