source: AWS

How to Misuse AWS Sagemaker Notebook Instances

MrDataPsycho
7 min readJun 5, 2022

--

I have been using Sagemaker as my primary dev. Environment for building and deploying Data Science Product for at least 2 years. Working with people of different experience level, I have noticed a lot of antipattern which should be avoided while using Sagemaker. In the following post, I will first mention the antipattern use cases and any misconception, then will explain why someone should avoid it and what is the alternative. Here we go!

I keep my notebook instance running 24/7 because if I shut it down, I will lose all the installed packages and environments.

This is exactly the situation where you should use the configuration script. If you attach a configuration script once to your instance, every time you start the instance, the configuration script will run first and make all the pre-setup available for you. Any shell script setup can be done through configuration script. For example:

  • Persist custom Conda environment according to your need
  • Proxy Setup for instance secured with VPC, Sub-Net and Security Group (Bit advance topic)
  • Organization’s git repository setup for — global username and email

The list goes on and on. I have even gone further and setup pipenv, poetry in my Sagemaker instances for MLOps tasks as an experiment.

Here is an example of Lifecycle Configuration Script, which will persist any newly created conda environment inside of Sagemaker Directory and also add the path of poetry for python version management (The script is stolen from internet I did not quite remember from whom I took it. It was a block post, and later I have made some simplification.):

set -ePERSISTED_CONDA_DIR="${PERSISTED_CONDA_DIR:-/home/ec2-user/SageMaker/.persisted_conda}"echo "Setting up persisted conda environments..."
mkdir -p ${PERSISTED_CONDA_DIR} && chown ec2-user:ec2-user ${PERSISTED_CONDA_DIR}
envdirs_clean=$(grep "envs_dirs:" /home/ec2-user/.condarc || echo "clean")
if [[ "${envdirs_clean}" != "clean" ]]; then
echo 'envs_dirs config already exists in /home/ec2-user/.condarc. No idea what to do. Exiting!'
exit 1
fi
echo "Adding ${PERSISTED_CONDA_DIR} to list of conda env locations"
cat << EOF >> /home/ec2-user/.condarc
envs_dirs:
- ${PERSISTED_CONDA_DIR}
- /home/ec2-user/anaconda3/envs
EOF
echo 'export PATH="$HOME/SageMaker/.poetry/bin:$PATH"' >> /home/ec2-user/.bashrc

I need a GPU instance because I want to train a Deep Learning Model.

No, we do not do that here! That's why, exactly, Sagemaker processing jobs and training jobs exist. You should write your code in such a way that it is capable of running on CPU and GPU, it is just an If-Else condition. Then you should develop your training pipeline in CPU and launch a GPU container for full scale training, with what ever biggest GPU instance you want. The benefits are:

  • You will pay just for the time your training job is using the GPU
  • The GPU cluster will be shut down after the training job automatically
  • It pushes you already to write code close to MLOps way, as you have to orchestrate your code in a way that final model from the training is saved in s3. Because nothing in the training job container persist unless you deliberate save them
  • You are not blotting your Sagemaker instance with Gigabytes of data as you can pull the data from s3 while training your model inside the training job container
  • It pushes you to start with smaller amount of data while prototyping and later move to full scale training with processing job a good MPV practice in deed

Here is a very simple price comparison based on my personal calculation if you run an instance 24/7 for a month:

# GPU Instances run 24/7
Cost of running p2.large (ec2) instance is 648.0 USD/Month
Cost of running p2.8xlarge (ec2) instance is 5184.0 USD/Month
Cost of running p3.2xlarge instance is 2754.0 USD/Month
# CPU Instances run 24/7
Cost of running ml.t2.large instance is 79.92 USD/Month
Cost of running ml.t2.xlarge instance is 160.56 USD/Month

Price according to US-EAST-1 from the official documentation (The hourly costs are collected on 05/06/2022). As you can see, running GPU instances are quite costlier than CPU instances.

Here is an example of Py-torch container definition :

from sagemaker.pytorch import PyTorchpytorch_estimator = PyTorch(
'pytorch-train.py',
instance_type='ml.p3.2xlarge',
instance_count=1,
framework_version='1.8.0',
py_version='py3',
hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1})

More details can be found here. In the example, instance type p3 stands for a GPU instance.

My attached volume is not enough, I need more disk space.

The default attached volume for a Sagemaker instance is 5 GB I guess, which might not be enough, and you might need 15 GB. But if you say I need 80 GB of disk space, then you are intending to run a full scale training in your Instance. I should suggest you to use Training Job instead, stop blotting your instance with large volume of data.

I need a better instance with more CPU and RAM.

The answer is similar, You should be using one of the cheap instances while developing your code for any ML product. When it's time for full scale training, you will launch your processing job container with high-end CPU and RAM instances. After the training, the container will be destroyed automatically, and you do not need to think about shutting down the expensive instance.

Here is an example Configuration of a training job container with Scikit-Learn. But nothing stopping you to define your own container with Pytorch or tensorflow. You can read more details about processing job here.

from sagemaker.sklearn.estimator import SKLearnsklearn = SKLearn(
entry_point="train.py",
framework_version="0.20.0",
instance_type="ml.m5.xlarge",
role=<your_sagemaker_role>
)

As you can see you can pick any instance for your processing or training job, have a look at the Amazon SageMaker Pricing list website for instance pricing, in the example it is using a m5 xlarge instance.

I need latest and greatest python version for my project. So I need a python 10 conda environment.

If you are developing codes for deployment of ML model also using Lambda or Sagemaker then you might get in to trouble during deployment if the version of Python for Lambda or Sagemaker API does not match with the Python you have used during development. Some syntax of Python 10 are not available in Python 7 or 8. So use the python version wisely and pick the version is compatible with the following environment:

  • Lambda Python Versions
  • Sagemaker Processing Job, Training Job Versions
  • Sagemaker API Python Versions

At the time of writing, Python 8 has support for all services and looks like a good match for all services.

Beginners Anti-Patterns:

I saved some files in the home directory, and after I shut down the instance, those files are gone.

The plain and simple explanation is, Any file at the `home/ec2-user/Sage maker` directory location persists in the Sagemaker, everything else is deleted after you shut down the cluster. So your git pulls and other config data must be saved in Sagemaker directory.

I will commit my Code later, I hope it will be safe in my Sagemaker instance.

The persistence of the volume (the Sagemaker Folder) is not 100% guaranteed. At least, I have not seen any confirmation from AWS saying that. Correct me if I am wrong, My understanding is the 5 GB volume attached to sagemaker is a SSD hard disk volume given to you. The hard disk can be defected and AWS could replace it with a new one in such cases, I am not quite sure if your data will also be moved to new disk space. So your code should always be in git and your data always should be persisted in s3.

There are lot other minor antipatterns I would like to ignore. Before I conclude, I would like to mention another misconception I faced in one of my workplace. There was miss conception that Sagemaker is not good enough for large scale training. So what data scientists did is, They provisioned an ec2 instance with GPU attached into it and usually for any project we ssh into to the instance and develop the code for any project. The problem is:

  • The ec2 instance will be running 24/7 when even we are not using the GPU during the development of the code which cost more than a CPU instance
  • When someone running a test run or full scale training, others have to wait for GPU to be free

What is the Alternate solution?

You can still use a cheap ec2 CPU instance for code development, and then you can have a Sagemaker role with required policy attached so that in can launch a Processing Job and your ec2 role will have policy attach so that I can envoke/use the Sagemaker Role as proxy. Thus, from your cheap ec2 instance, you can launch a Sagemaker training job with GPU as a separate entity. The benefit is:

  • The cost will be a lot less because The Instance is launched only during full scale training
  • No one need to wait for others for GPU to be free as each job will provision separate GPU instances
  • No need to monitor or panic that the GPU ec2 instance is running 24/7

Hope this document will help new users of Sagemaker while using notebook instance and will save some cost of the Organization. Let me know if you find some mistake or you think something is not quite right in the post.

--

--

MrDataPsycho

Data Science | Dev | Author @ Learnpub, Educative: Pandas to Pyspark