mlops.dataset package
Submodules
mlops.dataset.data_processor module
Contains the DataProcessor class.
- class mlops.dataset.data_processor.DataProcessor
Bases:
object
Transforms a raw dataset into features and labels for downstream model training, prediction, etc.
- get_preprocessed_features(dataset_path: str) Dict[str, numpy.ndarray]
Transforms the raw data at the given file or directory into features that can be used by downstream models. The data in the directory may be the training/validation/test data, or it may be a batch of user data that is intended for prediction, or data in some other format. Downstream models can expect the features returned by this function to be preprocessed in any way required for model consumption.
- Parameters
dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset.
- Returns
A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).
- get_preprocessed_features_and_labels(dataset_path: str) Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]
Returns the preprocessed feature and label tensors from the dataset path. This method is specifically used for the train/val/test sets and not input data for prediction, because in some cases the features and labels need to be read simultaneously to ensure proper ordering of features and labels.
- Parameters
dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset, specifically train/val/test and not prediction data.
- Returns
A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.
- abstract get_raw_features(dataset_path: str) Dict[str, numpy.ndarray]
Returns the raw feature tensors from the dataset path. The raw features are how training/validation/test as well as prediction data enter the data pipeline. For example, when handling image data, the raw features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 255].
- Parameters
dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset.
- Returns
A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).
- abstract get_raw_features_and_labels(dataset_path: str) Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]
Returns the raw feature and label tensors from the dataset path. This method is specifically used for the train/val/test sets and not input data for prediction, because in some cases the features and labels need to be read simultaneously to ensure proper ordering of features and labels.
For example, when handling image data, the raw features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 255]. The raw labels may be tensors of shape m, where m is the number of examples, with all values in the set {0, …, k - 1} indicating the class.
- Parameters
dataset_path – The path to the file or directory on the local or remote filesystem containing the dataset, specifically train/val/test and not prediction data.
- Returns
A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.
- abstract preprocess_features(raw_feature_tensor: numpy.ndarray) numpy.ndarray
Returns the preprocessed feature tensor from the raw tensor. The preprocessed features are how training/validation/test as well as prediction data are fed into downstream models. For example, when handling image data, the preprocessed features would likely be tensors of shape m x h x w x c, where m is the number of images, h is the image height, w is the image width, and c is the number of channels (3 for RGB), with all values in the interval [0, 1].
- Parameters
raw_feature_tensor – The raw features to be preprocessed.
- Returns
The preprocessed feature tensor. This tensor is ready for downstream model consumption.
- abstract preprocess_labels(raw_label_tensor: numpy.ndarray) numpy.ndarray
Returns the preprocessed label tensor from the raw tensor. The preprocessed labels are how training/validation/test as well as prediction data are fed into downstream models. For example, in a classification task, the preprocessed labels may be tensors of shape m x k, where m is the number of examples, and k is the number of classes, where each of the k-length vectors are one-hot encoded.
- Parameters
raw_label_tensor – The raw labels to be preprocessed.
- Returns
The preprocessed label tensor. This tensor is ready for downstream model consumption.
mlops.dataset.invertible_data_processor module
Contains the InvertibleDataProcessor class.
- class mlops.dataset.invertible_data_processor.InvertibleDataProcessor
Bases:
mlops.dataset.data_processor.DataProcessor
A DataProcessor that can invert any preprocessing to transform preprocessed data back into raw, real-world values for analysis and interpretability.
- abstract unpreprocess_features(feature_tensor: numpy.ndarray) numpy.ndarray
Returns the raw feature tensor from the preprocessed tensor; inverts preprocessing. Improves model interpretability by enabling users to transform model inputs into real-world values.
- Parameters
feature_tensor – The preprocessed features to be inverted.
- Returns
The raw feature tensor.
- abstract unpreprocess_labels(label_tensor: numpy.ndarray) numpy.ndarray
Returns the raw label tensor from the preprocessed tensor; inverts preprocessing. Improves model interpretability by enabling users to transform model outputs into real-world values.
- Parameters
label_tensor – The preprocessed labels to be inverted.
- Returns
The raw label tensor.
mlops.dataset.pathless_data_processor module
Contains the PathlessDataProcessor class.
- class mlops.dataset.pathless_data_processor.PathlessDataProcessor(features: Dict[str, numpy.ndarray], labels: Dict[str, numpy.ndarray])
Bases:
mlops.dataset.data_processor.DataProcessor
Loads preset features and labels.
- get_raw_features(dataset_path: str) Dict[str, numpy.ndarray]
Returns the training features.
- Parameters
dataset_path – Unused.
- Returns
A dictionary whose values are feature tensors and whose corresponding keys are the names by which those tensors should be referenced. For example, the training features (value) may be called ‘X_train’ (key).
- get_raw_features_and_labels(dataset_path: str) Tuple[Dict[str, numpy.ndarray], Dict[str, numpy.ndarray]]
Returns the training features and labels.
- Parameters
dataset_path – Unused
- Returns
A 2-tuple of the features dictionary and labels dictionary, with matching keys and ordered tensors.
- preprocess_features(raw_feature_tensor: numpy.ndarray) numpy.ndarray
Returns the identity function on the input features.
- Parameters
raw_feature_tensor – The raw features to be preprocessed.
- Returns
The preprocessed feature tensor. This tensor is ready for downstream model consumption.
- preprocess_labels(raw_label_tensor: numpy.ndarray) numpy.ndarray
Returns the the identity function on the input labels.
- Parameters
raw_label_tensor – The raw labels to be preprocessed.
- Returns
The preprocessed label tensor. This tensor is ready for downstream model consumption.
mlops.dataset.pathless_versioned_dataset_builder module
Contains the PathlessVersionedDatasetBuilder class.
- class mlops.dataset.pathless_versioned_dataset_builder.PathlessVersionedDatasetBuilder(features: Dict[str, numpy.ndarray], labels: Dict[str, numpy.ndarray])
Bases:
mlops.dataset.versioned_dataset_builder.VersionedDatasetBuilder
Builds a versioned dataset directly from feature and label tensors.
- publish(path: str, *args: Any, name: str = 'dataset', version: Optional[str] = None, dataset_copy_strategy: str = 'link', tags: Optional[List[str]] = None, **kwargs: Any) str
Saves the versioned dataset files to the given path. If the path and appended version already exists, this operation will raise a PublicationPathAlreadyExistsError.
- The following files will be created:
- path/version/ (the publication path and version)
X_train.npy (and other feature tensors by their given names) y_train.npy (and other label tensors by their given names) data_processor.pkl (DataProcessor object) meta.json (metadata) raw.tar.bz2 (bz2-zipped directory with the raw dataset files)
- The contents of meta.json will be:
- {
name: (dataset name) version: (dataset version) hash: (MD5 hash of all objects apart from data_processor.pkl
and meta.json)
created_at: (timestamp) tags: (optional list of tags)
}
- Parameters
path – The path, either on the local filesystem or in a cloud store such as S3, to which the dataset should be saved. The version will be appended to this path as a subdirectory. An S3 path should be a URL of the form “s3://bucket-name/path/to/dir”. It is recommended to use this same path to publish all datasets, since it will prevent the user from creating two different datasets with the same version.
name – The name of the dataset, e.g., “mnist”.
version – A string indicating the dataset version. The version should be unique to this dataset. If None, the publication timestamp will be used as the version.
dataset_copy_strategy – The strategy by which to copy the original, raw dataset to the published path. STRATEGY_COPY recursively copies all files and directories from the dataset path supplied at instantiation to the published path so that the dataset can be properly versioned. STRATEGY_COPY_ZIP is identical in behavior, but zips the directory upon completion. STRATEGY_LINK will instead create a file ‘link.txt’ containing the supplied dataset path; this is desirable if the raw dataset is already stored in a versioned repository, and copying would create an unnecessary duplicate.
tags – An optional list of string tags to add to the dataset metadata.
- Returns
The versioned dataset’s publication path.
mlops.dataset.versioned_dataset module
Contains the VersionedDataset class.
- class mlops.dataset.versioned_dataset.VersionedDataset(path: str)
Bases:
mlops.artifact.versioned_artifact.VersionedArtifact
Represents a versioned dataset.
- property md5: str
Returns the artifact’s MD5 hash.
- Returns
The artifact’s MD5 hash.
- property metadata_path: str
Returns the local or remote path to the artifact’s metadata.
- Returns
The local or remote path to the artifact’s metadata.
- property name: str
Returns the artifact’s name.
- Returns
The artifact’s name.
- property path: str
Returns the local or remote path to the artifact.
- Returns
The local or remote path to the artifact.
- property version: str
Returns the artifact’s version.
- Returns
The artifact’s version.
mlops.dataset.versioned_dataset_builder module
Contains the VersionedDatasetBuilder class.
- class mlops.dataset.versioned_dataset_builder.VersionedDatasetBuilder(dataset_path: str, data_processor: mlops.dataset.data_processor.DataProcessor)
Bases:
mlops.artifact.versioned_artifact_builder.VersionedArtifactBuilder
An object containing all of the components that form a versioned dataset. This object is only used to ensure a standard format for datasets stored in a dataset archive (such as the local filesystem or S3), and is not meant for consumption by downstream models.
- publish(path: str, *args: Any, name: str = 'dataset', version: Optional[str] = None, dataset_copy_strategy: str = 'copy_zip', tags: Optional[List[str]] = None, **kwargs: Any) str
Saves the versioned dataset files to the given path. If the path and appended version already exists, this operation will raise a PublicationPathAlreadyExistsError.
- The following files will be created:
- path/version/ (the publication path and version)
X_train.npy (and other feature tensors by their given names) y_train.npy (and other label tensors by their given names) data_processor.pkl (DataProcessor object) meta.json (metadata) raw.tar.bz2 (bz2-zipped directory with the raw dataset files)
- The contents of meta.json will be:
- {
name: (dataset name) version: (dataset version) hash: (MD5 hash of all objects apart from data_processor.pkl
and meta.json)
created_at: (timestamp) tags: (optional list of tags)
}
- Parameters
path – The path, either on the local filesystem or in a cloud store such as S3, to which the dataset should be saved. The version will be appended to this path as a subdirectory. An S3 path should be a URL of the form “s3://bucket-name/path/to/dir”. It is recommended to use this same path to publish all datasets, since it will prevent the user from creating two different datasets with the same version.
name – The name of the dataset, e.g., “mnist”.
version – A string indicating the dataset version. The version should be unique to this dataset. If None, the publication timestamp will be used as the version.
dataset_copy_strategy – The strategy by which to copy the original, raw dataset to the published path. STRATEGY_COPY recursively copies all files and directories from the dataset path supplied at instantiation to the published path so that the dataset can be properly versioned. STRATEGY_COPY_ZIP is identical in behavior, but zips the directory upon completion. STRATEGY_LINK will instead create a file ‘link.txt’ containing the supplied dataset path; this is desirable if the raw dataset is already stored in a versioned repository, and copying would create an unnecessary duplicate.
tags – An optional list of string tags to add to the dataset metadata.
- Returns
The versioned dataset’s publication path.
Module contents
Contains dataset modules.