Data Science on AWS

books

Data Science on AWS

Chris Fregly

71 highlights

Highlights & Annotations

Data quality can halt a data processing pipeline in its tracks. If these issues are not caught early, they can lead to misleading reports (i.e., double-counted revenue), biased AI/ML models (skewed towards/against a single gender or race), and other unintended data products.

Ref. 569B-A

To catch these data issues early, we use two open source libraries from AWS, Deequ and PyDeequ. These libraries use Apache Spark to analyze data quality, detect anomalies, and even “notify the Data Scientist at 3 a.m.” about a data issue. Deequ continuously analyzes data throughout the complete, end-to-end lifetime of the model from feature engineering to model training to model serving in production. Figure 5-16 shows a high-level overview of the Deequ architecture and components.

Ref. F7E6-B

Highlighted Image: Overview of Deequ components: constraints, metrics, and suggestions

Ref. 1DB4-C

Highlighted Image: Overview of Deequ components: constraints, metrics, and suggestions

Ref. 1DB4-D

Learning from run to run, Deequ will suggest new rules to apply during the next pass through the dataset. Deequ learns the baseline statistics of our dataset at model training time, for example,then detects anomalies as new data arrives for model prediction. This problem is classically called “training-serving skew.” Essentially, a model is trained with one set of learned constraints, then the model sees new data that does not fit those existing constraints. This is a sign that the data has shifted—or skewed—from the original, expected distribution used during training.

Ref. B303-E

We can also run Clarify as a SageMaker Processing Job to continually analyze our dataset at scale and calculate bias metrics as new data arrives.

Ref. 2796-F

One drawback to undersampling is that the training dataset size is sampled down to the size of the smallest category. This can reduce the predictive power and robustness of the trained models by reducing the signal from undersampled classes. In this example, we reduced the number of reviews by 65% from approximately 100,000 to 35,000.

Ref. D637-G

Oversampling will artificially create new data for the underrepresented class. In our case, star_rating 2 and 3 are underrepresented. One common technique is called the Synthetic Minority Oversampling Technique (SMOTE), which uses statistical methods to synthetically generate new data from existing data. They tend to work better when we have a larger dataset, so be careful when using oversampling on small datasets with a low number of minority class examples.

Ref. EA9B-H

To align with these three phases, we split the balanced data into separate train, validation, and test datasets. The train dataset is used for model training. The validation dataset is used to validate the model training configuration called the “hyper-parameters.” And the test dataset is used to test the chosen hyper-parameters. For our model, we chose 90% train, 5% validation, and 5% test as this breakdown, shown in Figure 6-8, works well for our dataset and model.

Ref. 60DB-I

In this case, we are not using k-folds cross-validation—a classic machine learning technique that reuses each row of data across different splits including train, validation, and test. K-folds cross-validation is traditionally applied to smaller datasets and, in our case, we have a large amount of data so we can avoid the downside of k-folds: data “leakage” between the train, validation, and test phases. Data leakage can lead to artificially inflated model accuracy for our trained models. These models may don’t perform well on real-world data outside of the lab. In summary, each of the three phases, train, validation, and test, should use separate and independent datasets otherwise leakage may occur

Ref. 70E5-J

On a related note, time-series data is often prone to leakage across splits. Companies often want to validate a new model using “back-in-time” historical information before pushing the model to production. When working with time-series data, make sure our model does not peek into the future accidentally. Otherwise, these models may appear more accurate than they really are.

Ref. DC16-K

Feature stores are data lakes for machine learning features. Since features sometimes require heavy compute processing as we demonstrated earlier with our BERT features using SageMaker Processing Jobs, we would like to store and reuse these features, if possible, throughout the organization.

Ref. 4ED4-L

It’s important to note that “parameters” (aka “weights”) are what the model learns during training. And that “hyper-parameters” are how the model learns the parameters. Every algorithm supports a set of hyper-parameters that alter the algorithm’s behavior while learning the dataset. Hyper-parameters can be anything from the depth of a decision tree to the number of layers in our neural network.

Ref. 3956-M

SageMaker Clarify helps us to detect bias and evaluate model fairness in each step of our machine learning pipeline. We saw in Chapter 5 how to use Clarify to detect bias and class imbalances in our dataset. We now use Clarify to analyze our trained model.

Ref. 21A0-N

Clarify performs the post-training bias analysis by comparing the model predictions against the labels in the training data with respect to the chosen facet.

Ref. 39AD-O

SageMaker Clarify also supports SHAP, a concept from game theory applied to machine learning content, to determine the contribution that each feature makes to a model’s prediction

Ref. F9C9-P

We can use this information to select features or create new feature combinations. Following is the code to perform feature attribution

Ref. 266E-Q

Profile Training Jobs with SageMaker Debugger

Ref. 7FFE-R

Debugger also suggests using a smaller instance or increase the batch size since our GPU utilization is low:

Ref. 975A-S

Start with a Pre-Trained Model

Ref. 3518-T

Spot Instances and Checkpoints

Ref. 68B1-U

Spot Instances may be terminated while the training job is running. Using the max_wait parameter, SageMaker will wait max_wait seconds for new Spot Instances to replace the previously terminated Spot Instances. After max_wait seconds, the job will end. The latest checkpoint is used to begin training from the point in time when the Spot Instances were terminated.

Ref. 82B0-V

Early Stopping Rule in SageMaker Debugger

Ref. C573-W

we can automatically find the best hyper-parameters for our dataset and algorithm using a scalable process called hyper-parameter tuning (HPT) or hyper-parameter optimization (HPO). SageMaker natively supports hyper-parameter tuning jobs.

Ref. EF88-X

We have already learned that hyper-parameters control how our machine learning algorithm learns the model parameters during model training. When tuning our hyper-parameters, we need to define an objective to optimize such as model accuracy. In other words, we need to find a set of hyper-parameters that meets or exceeds our given objective.

Ref. 3C81-Y

SageMaker supports the random-search and Bayesian hyper-parameter optimization strategies. With random search, we randomly keep picking combinations of hyper-parameters until we find a well performing combination. This approach is very fast and very easy to parallelize, but we might miss the best set of hyper-parameters as we are picking randomly from the hyper-parameter space. With Bayesian optimization, we treat the task as a regression problem.

Ref. 46D4-Z

Instead, we recommend using the random-search and Bayesian optimization strategies.

Ref. 1838-A

To keep this example simple and avoid a combinatorial explosion of trial runs, we will freeze most hyper-parameters and explore only a limited set for this particular optimization run. In a perfect world with unlimited resources and budget, we would explore every combination of hyper-parameters. For now, we will manually choose some of the following hyper-parameters and explore the rest in

Ref. 33EF-B

Next, let’s set up the hyper-parameter ranges that we wish to explore. We are choosing these hyper-parameters based on intuition, domain knowledge, and algorithm documentation. We may also find research papers useful—or other prior work from the community. At this point in the life cycle of machine learning and predictive analytics, we can almost always find relevant information on the problem we are trying to solve.

Ref. 228F-C

In this example, we are using the Bayesian optimization strategy with 10 jobs in parallel and 100 total. By only doing 10 at a time, we give the Bayesian strategy a chance to learn from previous runs. In other words, if we did all 100 in parallel, the Bayesian strategy could not use prior information to choose better values within the ranges provided

Ref. 3BCF-D

By setting early_stopping_type to Auto, SageMaker will stop the tuning job if the tuning job is not going to improve upon the objective metric. This helps save time, reduces the potential for overfitting to our training dataset, and reduces the overall cost of the tuning job.

Ref. 1573-E

SageMaker stopped two jobs early because their combinations of hyper-parameters were not improving the training-accuracy objective metric. This is an example of SageMaker saving us money by intelligently stopping jobs early when they are not adding value to our business objective.

Ref. 1E3D-F

SageMaker hyper-parameter tuning also supports automatic hyper-parameter tuning across multiple algorithms by adding a list of algorithms to the tuning job definition. We can specify different hyper-parameters and ranges for each algorithm. Similarly, SageMaker Autopilot uses multialgorithm tuning to find the best model across different algorithms based on our problem type, dataset, and objective function

Ref. 2E07-G

Warm start is particularly useful when we want to change the hyper-parameter tuning ranges from the previous job or add new hyper-parameters. Both scenarios benefit from the knowledge of the previous tuning job to find the best model faster. The two scenarios are implemented with two warm start types, IDENTICAL

Ref. A220-H

Any distributed computation requires that the cluster instances communicate and share information with each other. This cluster communication benefits from higher-bandwidth connections between the instances. Therefore, the instances should be physically close to each other in the cloud data center, if possible. Fortunately, SageMaker handles all of this heavy lifting for us so we can focus on creating our review classifier and address our business problem of classifying product reviews in the wild. SageMaker supports distributed computations with many distributed-native frameworks including Apache Spark, TensorFlow, PyTorch, and MXNet.

Ref. 52CE-I

Parameter server” is a primitive distributed training strategy supported by most distributed machine learning frameworks. Remember that parameters are what the algorithm is learning. Parameter servers store the learned parameters and share them with every instance during the training process. Since parameter servers store the state of the parameters, SageMaker runs a parameter server on every instance for higher availability as shown in Figure 8-3.

Ref. 55AD-J

By researching the work of others, we can likely find a range of hyper-parameters that will narrow the search space and speed up our SageMaker Hyper-Parameter Tuning Jobs. If we don’t have a good starting point, we can use the Logarithmic scaling strategy to determine the scale within which we should explore. Just knowing the power of 10 can make a big difference in reducing the time to find the best hyper-parameters for our algorithm and dataset.

Ref. A5A6-K

In addition to sharding, we can also use a SageMaker feature called Pipe mode to load the data on the fly and as needed. Up until now, we’ve been using the default File mode, which copies all of the data to all the instances when the training job starts. This creates a long pause at the start of the training job as the data is copied. Pipe mode provides the most significant performance boost when using large datasets in the 10, 100, or 1,000 GB range. If our dataset is smaller, we should use File mode

Ref. 5F3E-L

By streaming only the data that is needed when it’s needed, our training and tuning jobs start quicker, complete faster, and use less disk space overall. This directly leads to lower cost for our training and tuning jobs

Ref. 0F61-M

Enable Enhanced Networking

Ref. C161-N

ENA works well with the AWS deep learning instance types, including the C, M, P, and X series. These instance types offer a large number of CPUs, so they benefit greatly from efficient sharing of the network adapter. By performing various network-level optimizations such as hardware-based checksum generation and software-based routing, ENA reduces overhead, improves scalability, and maximizes consistency. All of these optimizations are designed to reduce bottlenecks, offload work from the CPUs, and create an efficient path for the network packets

Ref. E70E-O

We deploy our model to serve online, real-time predictions, and show how to run offline, batch predictions. For real-time predictions, we deploy our model via SageMaker Endpoints. We discuss best practices and deployment strategies such as canary rollouts and blue/green deployments. We show how to test and compare new models using A/B tests and how to implement reinforcement learning with multiarmed bandit (MAB) tests. We demonstrate how to automatically scale our model hosting infrastructure with changes in model-prediction traffic. We show how to continuously monitor the deployed model to detect concept drift, drift in model quality or bias, and drift in feature importance. We also touch on serving model predictions via serverless APIs using AWS Lambda and how to optimize and manage models at the edge. We conclude the chapter with tips on how to reduce our model size, reduce inference cost, and increase our prediction performance using various hardware, services, and tools such as the AWS Inferentia hardware, SageMaker Neo service, and TensorFlow Lite library.

Ref. D5A8-P

Are we trying to optimize for latency or throughput? Does the application require our models to scale automatically throughout the day to handle cyclic traffic requirements? Do we plan to compare models in production through A/B tests?

Ref. 1CF5-Q

For less-latency-sensitive applications that require high throughput, we should deploy our model as a batch job to perform batch predictions on large amounts of data in S3, for example. We will use SageMaker Batch Transformations to perform the batch predictions along with a data store like RDS or DynamoDB to productionize the predictions

Ref. 7113-R

Testing and Comparing New Models

Ref. E896-S

When testing our models in production, we need to define and track the business metrics that we wish to optimize. The business metric is usually tied to revenue or user engagement such as orders purchased, movies watched, or ads clicked. We can store the metrics in any database such as DynamoDB as shown in Figure 9-10. Analysts and scientists will use this data to determine the winning model from our tests

Ref. 2604-T

Similar to canary rollouts, we can use traffic splitting to direct subsets of users to different model variants for the purpose of comparing and testing different models in live production. The goal is to see which variants perform better. Often, these tests need to run for a long period of time (weeks) to be statistically significant. Figure 9-11 shows two different recommendation models

Ref. FBD1-U

While A/B testing seems similar to canary rollouts, they are focused on gathering data about different variants of a model. A/B tests are targeted to larger user groups, take more traffic, and run for longer periods of time. Canary rollouts are focused more on risk mitigation and smooth upgrades

Ref. 774F-V

Monitor’s data-quality monitoring feature. And we can also detect concept shifts using Model Monitor’s model-quality monitoring feature that compares live predictions against ground truth labels for the same model inputs captured by Model Monitor on live predictions. These ground truth labels are provided by humans in an offline human-in-the-loop workflow using something like Amazon

Ref. DB91-W

Our model learns and adapts the statistical characteristics of our training data. If the statistical characteristics of the data that our online model receives drifts from that baseline, the model quality will degrade. We can create a data quality baseline using Deequ, as discussed in Chapter 5. Deequ analyzes the input data and creates schema constraints and statistics for each input feature. We can identify missing values and detect covariate shifts relative to that baseline. Model Monitor uses Deequ to create baselines for data-quality monitoring.

Ref. 2402-X

To find the root cause of this data quality drift, we want to examine the model inputs and examine any upstream application bugs (or features) that may have been recently introduced. For example, if the application team adds a new set of product categories that our model was not trained on, the model may predict poorly for those particular product categories. In this case, Model Monitor would detect the covariate shift in model inputs, notify us, and potentially retrain and redeploy the model.

Ref. E5CD-Y

Input data is captured by Model Monitor using the real-time data capture feature. This data is saved into S3 and labeled by humans offline. A Model Quality Job then compares the offline data at a schedule that we define. If the model quality decays, Model Monitor will notify us and potentially retrain and redeploy the model including the ground truth data labeled by humans. Note that the availability of the ground truth labels might be delayed because of the required human interaction. Figure 9-26 shows the high-level overview of model-quality drift detection using offline, ground-truth labels provided by a human workforce

Ref. EBAF-Z

While AWS offers a wide range of instance types with different GPU, CPU, network bandwidth, and memory combinations, our model may use a custom combination. With EIAs, we can start by choosing a base CPU instance and add GPUs until we find the right balance for our model inference needs. Otherwise, we may be forced to optimize one set of resources like CPU and RAM, but underutilize other resources like GPU and network bandwidth.

Ref. 44D6-A

SageMaker Neo takes a trained model and performs a series of hardware-specific optimizations such as 16-bit quantization, graph pruning, layer fusing, and constant folding for up to 2x model-prediction speedups with minimal accuracy loss. Neo works across popular AI and machine learning frameworks including TensorFlow, PyTorch, MXNet, and XGBoost.

Ref. 35C7-B

Typically, the data scientist delivers the trained model, the DevOps engineer manages the infrastructure that hosts the model as a REST AP, and the application developer integrates the REST API into their applications. Each team must understand the needs and requirements of the other teams in order to implement an efficient workflow and smooth hand-off process.

Ref. 2472-C

While the model may appear to train successfully with poor-quality data, the model could negatively affect our business if pushed to production. By automating the data-quality checks before model training, we could raise a pipeline exception, notify the application team of the bad data, and save the cost of training a bad model

Ref. A181-D

Experiment tracking records the hyper-parameters used during training as well as the training results such as model accuracy. The SageMaker Experiments and Lineage APIs are integrated throughout SageMaker to handle these scenarios.

Ref. 0AEF-E

Verifiable ML pipelines can help solve the problem of model degradation. Model degradation is a relatively common and underengineered scenario due to the complexity of monitoring models in production. Degrading model predictions results in poorly classified reviews and missed business opportunities.

Ref. 65EA-F

By continually monitoring our model predictions with SageMaker Model Monitor and Clarify, we can detect shifts in data distributions, model bias, and model explainability—triggering a pipeline to retrain and deploy a new review-classifier model.

Ref. 4EA4-G

Machine learning is continuous. The more we automate the process, the more we are free to solve additional business problems. Otherwise, we find ourselves manually rerunning one-off scripts every time new data arrives. While running a script is fairly simple, monitoring or restarting the script requires cognitive load that we could likely apply to higher-value tasks.

Ref. D901-H

Effective ML pipelines should include the following:

Ref. 18E6-I

In our experience, data-quality issues are the number-one cause of bad ML pipelines. In Chapter 5, we demonstrated how to use the AWS Deequ open source library to perform data-quality checks on our data as “step 0” of the ML pipeline. Without consistent and expected quality, our ML pipeline will, at best, fail quickly and minimize cost. At worst, poor-quality data will produce poor quality models that may include bias and negatively impact our business.

Ref. 40F3-J

Steps of an Effective Machine Learning Pipeline

Ref. 86F9-K

Pipeline Orchestration with SageMaker Pipelines

Ref. B578-L

Anytime a new file is uploaded to this S3 bucket, EventBridge will trigger the rule and start our pipeline execution. We can use the lambda_handler function’s event variable to find out the exact file that was uploaded and, perhaps, incrementally train our model on just that new file. Depending on our use case, we may not want to start a new pipeline for every file uploaded to S3. However, this is a good starting point to build our own rules and triggers from many AWS services.

Ref. 281C-M

Statistical Drift Trigger

Ref. 1578-N

The more accurate our model becomes, the less reviews are sent to our workers. This concept is also called “active learning” and is implemented in SageMaker Ground Truth.

Ref. 0575-O

Online, or incremental, machine learning is a small subset of machine learning and somewhat difficult to adapt to classical offline algorithms to effectively train online. With online learning, new data is incorporated into the model without requiring a complete retrain with the full dataset

Ref. 5619-Q

In general, linear algorithms such as linear regression, logistic regression, and k-means clustering are a bit easier to train with real-time data because of the relatively simple mathematics behind them. Scikit-learn supports incremental learning using the partial_fit() functions on certain linear algorithms. Apache Spark supports streaming versions of linear regression and k-means clustering

Ref. 104B-R

Since data quality is not always a priority for the upstream application teams, the downstream data engineering and data science teams need to handle bad or missing data. We want to make sure that our data is high quality for our downstream consumers including the business intelligence, ML engineering, and data science teams.

Ref. 8773-S