AWS SageMaker

09 June, 2021

Back

Managed Spot Training

Training and development happens on different machines. You need to 'Upload to S3' to prepare the data.
You can provision Spot Instances for training.
Training and notebook live on different instances. The two are decoupled. In practice, you can start the training and close the notebook. The training will continue, and you can monitor it from the console, or via direct API requests.
A training job will be interrupted if a spot instance is deprovisioned. Sagemaker will manage the reprovisioning of an instance when available and you can configure max wait time, etc.
Checkpointing depends on the framework. It makes sense for long training jobs (short training jobs like typically sklearn models will most likely not last long enough to be interrupted).
Helper code is a code that helps to build the machine learning on SageMaker, for example data extraction, transformation, image model etc. Sometimes, during ML deployment it’s not only the machine learning code that you want to include, but other codes as well.
SageMaker works on the principle that you run training/processing/batch jobs on separate infrastructure from the notebook itself: To help you save costs by keeping your infra modest and only paying for the seconds of (e.g. GPU, big RAM) time you need. This means the individual jobs need to access the data from somewhere, and for configurability and cost the standard option for this is S3. So when we download data extracts to our notebook, it's really just for interactive exploration/debugging/etc that we're not ready to run as a codified processing job yet

Quick Tutorial

https://aws.amazon.com/getting-started/hands-on/build-train-deploy-machine-learning-model-sagemaker/

References

Back