Data Science and Cloud Technology (Amazon Web Services)
Aakash Tandel, Former Data Scientist
Article Categories:
Posted on
Cloud technology is a rapidly-evolving and exciting component to modern technological applications.
As a data scientist, the cloud gives you access to highly capable virtual machines, allows you to integrate with cloud-based digital products, and offers managed services to take care of your data engineering needs. All of these benefits, and more, mean that data scientists can benefit from a more-than-basic understanding of cloud technology.
In this article, I’ll cover:
- The benefits of cloud computing
- A data scientist’s favorite AWS services (my opinion)
- Additional (maybe unfamiliar) cloud-related topics
For the remainder of this article, I’ll only talk about Amazon Web Services (AWS) because I have hands-on experience working with Amazon’s platform. That said, Google Cloud Platform (GCP) and Azure are great alternatives to AWS, so use whichever cloud provider you and your organization plan to use in the future.
The benefits of cloud computing
Point 1: Access to virtual machines with large compute capacities can speed up workflow.
With Amazon Elastic Compute Cloud (EC2), you can spin up highly capable virtual machines in seconds, speeding up your workflow and resulting in decreased processing time. You don’t need to train your neural network over the weekend or take a coffee break every time you tune your model. More capable machines mean you can iterate more often, giving you the flexibility to experiment in development environments and tune your models to near-perfection. You can stop running algorithms locally on your MacBook Pro and instead utilize the enormous compute capacity within the cloud.
Additionally, Amazon SageMaker is a fully managed solution that makes launching a Jupyter Notebook in the cloud extremely easy. All you need is to choose the right notebook instance for your use-case, open your notebook, and begin writing code. AWS handles all of the compute, you just supply the code in the Jupyter Notebook environment you are already familiar with. With cloud-based resources, you will never be limited by your local computer’s compute capacity.
Point 2: Integrating machine learning components to AWS-based application is straightforward.
At Viget, we have teams of front-end developers, backend developers, DevOps professionals, UXers, and designers building apps from the ground up. Our engineering teams leverage resources and technologies within AWS to move, parse, transform, warehouse, aggregate, query, and analyze data. Our data team needs to know how the applications work if they are going to deploy machine learning models into production environments. If your development team is already using the cloud to design, develop, and launch digital products, you can easily access databases, data warehouses, or files you need to conduct analysis and run models. Adding machine learning components to an already AWS-based application is straightforward.
For example, say your development team is working on an ecommerce site. They are already sending server logs to a Kinesis Data Stream and then to a data lake in an S3 bucket. You, the data scientist, can add a product recommendation model to the ecommerce website by using Amazon EMR. This can both improve the customer’s experience on the site and increase sales revenue for the ecommerce company.
Point 3: Managed services make data engineering more accessible to technologists.
For organizations that don’t have data engineers on staff, the data scientist may be required to develop pipelines, run ETL jobs, or sink data to a data lake. Data scientists less familiar with data engineering or DevOps practices can use managed services – like AWS Glue – to perform a lot of their data engineering needs. In fact, there are a whole suite of services that bridge the gap for the data scientist without a background in computer science and engineering, which leads us to the next section.
A data scientist’s favorite AWS services (my opinion)
I am going to skip over EC2. The virtual machine is the bread and butter of AWS computer power and for the vast majority of folks, EC2 is a familiar tool and the benefits are inherent. I will also skip over Amazon SageMaker’s notebook functionality. We already talked about how you can use a Jupyter Notebook in the cloud and leverage massive compute capacities (only if necessary, of course).
The bullet points below may be an aggressive introduction for the uninitiated cloud user. Go ahead and skim the bullets if you find them overly detailed or if you are unfamiliar with some of the technical jargon. Alternatively, if you want a deeper dive into any of the services, listen to the reInvent Deep Dives on YouTube. (reInvent is AWS’s yearly conference.) Listening to Deep Dives gave me a refresher on the latest and greatest from AWS and helped me pass AWS certification exams.
Before we jump into my favorite AWS tools, I want to add that AWS changes rapidly. If anything I say below is outdated, feel free to leave me a comment below and I’ll make the necessary changes.
- If you need highly available and insanely durable (99.999999999% durability for S3 Standard) object-based storage, use Simple Storage Service (S3). S3 allows you to store an unlimited amount of data in region-specific buckets with universal namespaces. S3 is only used for flat files and is object-based storage. (If you need block-based storage, look into Amazon Elastic Block Store.) You can configure versioning, server access logging, static website hosting, encryption, and other bucket features through the bucket policy. More granularly, access control lists allow you to control features around specific objects. S3 comes in a variety of storage classes that allow you to pay less for data that is accessed infrequently. And if you need object-based archival storage, look into Glacier. S3 also has a distributed data-store architecture so your objects are redundantly stored across multiple AWS availability zones. You can hook S3 up to AWS CloudFront – AWS’s content delivery network (CDN) – to reduce access latency or enable Transfer Acceleration to reduce upload time for objects in your bucket. Lastly, S3 can provide a centralized data lake, leveraging all of the benefits of S3 such as scalability, availability, and durability.
- AWS offers services for columnar, document, graph, and in-memory key-value non-relational databases. DynamoDB is AWS’s fully managed NoSQL document and key-value database. DyanmoDB is fast and great for low latency and high throughput applications. It has push-button scaling, allowing you to increase or decrease your read/write throughput easily. If you are working with semi-structured data in JSON or XML, DynamoDB can be a great service to leverage.
- Amazon Kinesis comes in a variety of flavors – Kinesis Data Streams, Kinesis Data Firehose, and Kinesis Data Analytics – all with their specific use-cases. In general, Kinesis is a fully managed service that is used to collect streaming or real-time data. Kinesis is great at ingesting a high volume of data. Kinesis Data Streams uses shards to process the data stream – within a specified retention period – and send data to services like DynamoDB, S3, EMR, or Redshift. Kinesis Data Firehose is primarily used for data ingestion. Data can be sent to destinations like S3, Redshift or Splunk. There are many big data use-cases for Kinesis. Netflix uses Kinesis to monitor and analyze the massive amounts of data coming from all of their AWS Virtual Private Cloud (VPC) flow logs. You can use Kinesis to ingest data from your IoT devices or run analytics queries on streaming event data from your video game.
- Amazon Redshift is AWS’s data warehouse. Querying a Redshift data warehouse is fast due to Redshift’s columnar storage and massive parallel processing. Columnar storage drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk. Redshift uses advanced compression and automatically chooses the best compression schema to make querying fast. Redshift is a great option for your OLAP database. Your Redshift database can be configured with a single node or multi-node setup (a minimum of two nodes is recommended), uses replication and continuous backups to enhance availability, and keeps three copies of your data. Redshift is commonly used to warehouse business analytics data.
- Elastic MapReduce (EMR) makes using distributed frameworks like Hadoop, Apache Spark, and HBase easy. Whether you are processing log data, predicting the stock market using real-time data from Kinesis (ha), or doing ETL on a large amount of data, EMR can make these big data processing problems easy. If you build the analytical portion of your streaming data application around the combination of Kinesis and Amazon EMR, you’ll benefit from the fully managed nature of both services.
- AWS Lambda is AWS’s serverless compute services that allows you to run function-based jobs. This can be a great tool in your toolbelt, especially for the random, rarely occurring tasks that may crop up in your work. For example, we run a cron job on Lambda to initiate an EMR job. We could run the cron job on an EC2 instance, but then we’d be paying for compute capacity when the code isn’t running, which is the vast majority of the time. You can author Lambda functions in Python, so you don’t have to learn a new language to use the serverless service. Also, there are a variety of events you can use to trigger Lambda functions.
AWS also offers a variety of Artificial Intelligence services like Amazon Comprehend, Amazon Polly, and Amazon Forecast. These services make machine learning more accessible to novice data scientists and developers. Honestly, I haven’t experimented with these fully managed services enough to determine whether or not I would use them in my day-to-day. But I wanted to mention them at least once in this article.
Additional (maybe unfamiliar) cloud-related topics
I was pretty unfamiliar with networking, security, backup, reliability, and disaster recovery, when I first started learning about cloud technology. These topics are rarely brought up in data science circles. But if you’re going to use the cloud, I highly recommend getting a crash-course in each subject.
- Networking within the cloud: In AWS, basic “networking” means you should have a solid grasp on Virtual Private Clouds and the many components under the VPC umbrella, such as security groups, network access control lists, route tables, internet gateways, NAT gateways, egress-only internet gateways, virtual private networks, and customer gateways. My background in statistics left me perplexed as to who LAN was and why network admins kept talking about CIDR. But gaining a solid grasp of networking in the cloud made integrating with cloud-based digital products significantly easier. The services you spin up need to be able to communicate with one another and that’s where networking comes in. At this point, I’m not qualified to step into the role of network administrator, but I have a basic grasp of networking. I know enough to configure VPCs that allow me to do my work and I know when to ask my DevOps coworkers for help.
- Security in the cloud: Even if you aren’t dealing with personally identifiable information, health records, or financial data, you should be concerned with the security of your cloud environment. AWS’s shared responsibility model basically says that they will take care of security of the cloud and you will take care of security in the cloud. Data scientists should understand the basics of AWS’s Identity and Access Management (IAM) and use security best practices like the principle of least privilege. Use IAM and teach your data team about basic security best practices. Lastly, one crucial component of security is encryption. Many of the services listed above use AWS Key Management Service (KMS) to encrypt data. Depending on your data stores and the sensitivity of your data, think about encrypting data in-transit and at rest.
- Backup, reliability, and disaster recovery: Data is valuable and if it’s lost it can be crippling to your organization. Data scientists should develop disaster recovery plans for important data stores and applications under their purview. Cloud providers like AWS have great reliability and data durability. Utilize their resources to make sure your team can continue functioning in the event of a database outage or disaster.
What is a good way to learn more about cloud computing?
AWS, GCP, and Azure have free tiers and promotional credits that allow you to get familiar with their tools. Get your hands dirty with services that interest you and your company. (But be careful not to rack up an enormous bill from your explorations. Not all services are available in the free tier.)
I’m a big fan of IT certifications. Consider getting certified with AWS! The Certified Cloud Practitioner certification is a very entry-level certification you can get for the AWS platform. There are a plethora of resources online that teach AWS principles and help you pass certification exams. I have used A Cloud Guru, Linux Academy, and courses on Udemy to studying for AWS certifications.
Leave a comment below about how you leverage cloud technology in your day-to-day.