Security and Cloud 24/7

Archive for the ‘Cloud computing’ Category

Trade-Offs When Designing Workloads in AWS

February 19th, 2024 |

Author: Eyal Estrin

When designing architectures for modern applications in the cloud, we have multiple ways to achieve similar goals.

No matter which alternative we choose (selecting a service, an architecture, or even a pricing option), it forces us to understand the trade-offs we are making in our decisions.

To decide on the best option, we need to evaluate things such as the cost to implement and operate, the solution’s resiliency, the learning curve, and perhaps even vendor lock-in (i.e., the ability to migrate a service between different cloud providers).

In this blog post, we will review some of the most common trade-offs organizations make when using the public cloud, specifically AWS services.

Buy vs. Build

One of the most common debates in many organizations is whether to buy a service or build it in-house.

For organizations with mature development teams, building your solutions (from pin-point service to a full-blown application) might be a better alternative, if the build process does not cost a lot of time and effort.

For many organizations, buying a service (such as a SaaS application, or a managed service), might be a better alternative, since a cloud provider is responsible for the scale and maintenance of the service itself, allowing the organization to be a consumer of a service, investing time and efforts in their business goals.

On-Premise vs. the Public Cloud

The cloud allows us to build modern highly scalable and elastic workloads, using the most up-to-date hardware, with a pay-as-you-go pricing option.

As much as the cloud allows us efficient ways to run applications and switch from running servers to consuming services, using the cloud requires a learning curve for many organizations.

In case an organization is still running legacy applications, sometimes with dedicated hardware or license requirements, or if there are regulatory or data residency requirements to keep data in a specific country, on-premise might be a better alternative for some organizations.

Multi-AZ vs. Multi-Region

Depending on the application’s business requirements, a workload can be deployed in a resilient architecture, spread across multiple AZs or multiple regions.

By design, most services in AWS are bound to a specific region and can be deployed in a multi-AZ architecture (for example Amazon S3, Amazon RDS, Amazon EKS, etc.)

When designing an architecture, we may want to consider multi-region architecture, but we need to understand:

Most services are limited to a specific region, and cannot be replicated outside the region.
Some services can be replicated to other regions (such as Amazon S3 cross-region replication, Amazon RDS cross-region read replica, etc.), but the other replicas will be read-only, and in case of failure, you will need to design a manual switch between primary and secondary replicas.
Multi-region architecture increases the overall workload’s cost, and naturally, the complexity of designing and maintaining such architecture.

Amazon EC2 vs. Containers

For most legacy or “lift & shift” applications, choosing an EC2 instance, is the easiest way to applications – customers have full control of the content of the EC2 instance, with the same experience as they used to on-prem.

Although developing and wrapping applications inside containers requires a learning curve, containers offer better horizontal scaling, better use of the underlying hardware, and easier upgrade, when using immutable infrastructure (where no session information is stored inside the container image), since an upgrade is simply replacing one container image with a newer version.

Amazon ECS vs. Amazon EKS

Both are managed orchestrators for running containers, and both a fully supported by AWS.

Amazon ECS can be a better alternative for organizations looking to run workloads with predictable scaling patterns, and it is easier to learn and maintain, compared to Amazon EKS.

Amazon EKS offers full-blown managed Kubernetes service for organizations who wish to deploy their applications on top of Kubernetes. As with any Kubernetes deployment, it takes time for the teams to learn how to deploy and maintain Kubernetes clusters, due to the large amount of configuration options.

Containers vs. AWS Lambda

Both alternatives offer organizations the ability to run production applications in a microservice architecture.

Containers allow development teams to develop their applications anywhere (from the developer’s IDE to running an entire development in the cloud), push container images to a container registry, and run them on any container environment (agnostic to the cloud provider’s ecosystem).

Containers also allow developers SSH access to control the running containers, mostly for troubleshooting purposes on a small scale.

AWS Lambda is running in a fully managed environment, where AWS takes care of the scale and maintenance of the underlying infrastructure, while developers focus on developing Lambda functions.

Although AWS allows customers to wrap their code inside containers and run them in a Lambda serverless environment, Lambda is considered a vendor lock-in, since it cannot run outside the AWS ecosystem (i.e., other cloud providers).

AWS Lambda does not allow customers access to the underlying infrastructure and is limited to a maximum of 15 minutes per invocation, meaning, long-running invocations are not suitable for Lambda.

On-demand vs. Spot vs. Saving Plans

AWS offers various alternatives to pay for running compute services (from EC2 instances, ECS or EKS, Lambda, Amazon RDS, and more).

Each alternative is slightly better for different use cases:

On-demand – Useful for unpredictable workloads such as development environments (may be running for an entire month, or a couple of hours)
Spot – Useful for workloads that can survive sudden interruption, such as loosely coupled HPC workloads, or stateless applications
Saving Plans – Useful for workloads that are expected to be running for a long period (1 or 3 years), with the ability to replace instance type according to needs

Amazon S3 lifecycle policies vs. Amazon S3 Intelligent-Tiering

When designing persistent storage solutions for workloads, AWS offers various storage tiers for storing objects inside Amazon S3 – from the standard tier to an archive tier.

Amazon S3 allows customers efficient ways to store objects:

Amazon S3 lifecycle policies – Allows customers to set up rules for moving objects from a real-time tier to an archive tier, according to the last time an object was accessed. It is a useful one-way solution, but it requires customers to set up the rules. Useful for expected and predictable data access patterns.
Amazon S3 Intelligent-Tiering – Uses machine learning to inspect each object’s last access time, and automatically move objects between tiers (from real-time to archive and vice versa). Useful for unpredictable data access patterns.

NAT Gateway vs. NAT Instances

When a service in a private subnet requires access to resources outside its subnet (for example the public Internet), we need to configure one of the NAT alternatives:

NAT Gateway – A fully managed NAT service, supporting automated scaling capability, high availability, and performance, but with high cost (compared to NAT instance).
NAT Instance – An EC2 instance, based on a generic AMI for allowing NAT capabilities. Requires customer maintenance (such as patching, manual resiliency, manual instance family size selection, and limited network bandwidth) at the cost of an EC2 instance (cheaper than NAT Gateway).

If an organization knows to automate the deployment and maintenance of NAT instances, they can use this alternative and save costs, otherwise, NAT Gateway is a much more resilient alternative.

Summary

Making an architectural design has its trade-offs.

In many cases, you will have more than a single solution for the same challenge, and you need to measure the cost and benefits of each alternative, as we showed in this blog post.

We need to understand the implications and consequences of our decisions to be able to prioritize our options.

Reference

AWS re:Invent 2023 – Advanced integration patterns & trade-offs for loosely coupled systems

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

Posted in AWS, Cloud computing |

No Comments »

Building Resilient Applications in the Cloud

January 15th, 2024 |

Author: Eyal Estrin

When building an application for serving customers, one of the questions raised is how do I know if my application is resilient and will survive a failure?

In this blog post, we will review what it means to build resilient applications in the cloud, and we will review some of the common best practices for achieving resilient applications.

What does it mean resilient applications?

AWS provides us with the following definition for the term resiliency:

“The ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.”

(Source: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/resiliency-and-the-components-of-reliability.html)

Resiliency is part of the Reliability pillar for cloud providers such as AWS, Azure, GCP, and Oracle Cloud.

AWS takes it one step further, and shows how resiliency is part of the shared responsibility model:

The cloud provider is responsible for the resilience of the cloud (i.e., hardware, software, computing, storage, networking, and anything related to their data centers)
The customer is responsible for the resilience in the cloud (i.e., selecting the services to use, building resilient architectures, backup strategies, data replication, and more).

Source: https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/shared-responsibility-model-for-resiliency.html

How do we build resilient applications?

This blog post assumes that you are building modern applications in the public cloud.

We have all heard of RTO (Recovery time objective).

Resilient workload (a combination of application, data, and the infrastructure that supports it), should not only recover automatically, but it must recover within a pre-defined RTO, agreed by the business owner.

Below are common best practices for building resilient applications:

Design for high-availability

The public cloud allows you to easily deploy infrastructure over multiple availability zones.

Examples of implementing high availability in the cloud:

Deploying multiple VMs behind an auto-scaling group and a front-end load-balancer
Spreading container load over multiple Kubernetes worker nodes, deploying in multiple AZs
Deploying a cluster of database instances in multiple AZs
Deploying global (or multi-regional) database services (such as Amazon Aurora Global Database, Azure Cosmos DB, Google Cloud Spanner, and Oracle Global Data Services (GDS)
Configure DNS routing rules to send customers’ traffic to more than a single region
Deploy global load-balancer (such as AWS Global Accelerator, Azure Cross-region Load Balancer, or Google Global external Application Load Balancer) to spread customers’ traffic across regions

Implement autoscaling

Autoscaling is one of the biggest advantages of the public cloud.

Assuming we built a stateless application, we can add or remove additional compute nodes using autoscaling capability, and adjust it to the actual load on our application.

In a cloud-native infrastructure, we will use a managed load-balancer service, to receive traffic from customers, and send an API call to an autoscaling group, to add or remove additional compute nodes.

Implement microservice architecture

Microservice architecture is meant to break a complex application into smaller parts, each responsible for certain functionality of the application.

By implementing microservice architecture, we are decreasing the impact of failed components on the rest of the application.

In case of high load on a specific component, it is possible to add more compute resources to the specific component, and in case we discover a bug in one of the microservices, we can roll back to a previous functioning version of the specific microservice, with minimal impact on the rest of the application.

Implement event-driven architecture

Event-driven architecture allows us to decouple our application components.

Resiliency can be achieved using event-driven architecture, by the fact that even if one component fails, the rest of the application continues to function.

Components are loosely coupled by using events that trigger actions.

Event-driven architectures are usually (but not always) based on services managed by cloud providers, who are responsible for the scale and maintenance of the managed infrastructure.

Event-driven architectures are based on models such as pub/sub model (services such as Amazon SQS, Azure Web PubSub, Google Cloud Pub/Sub, and OCI Queue service) or based on event delivery (services such as Amazon EventBridge, Azure Event Grid, Google Eventarc, and OCI Events service).

Implement API Gateways

If your application exposes APIs, use API Gateways (services such as Amazon API Gateway, Azure API Management, Google Apigee, or OCI API Gateway) to allow incoming traffic to your backend APIs, perform throttling to protect the APIs from spikes in traffic, and perform authorization on incoming requests from customers.

Implement immutable infrastructure

Immutable infrastructure (such as VMs or containers) are meant to run application components, without storing session information inside the compute nodes.

In case of a failed component, it is easy to replace the failed component with a new one, with minimal disruption to the entire application, allowing to achieve fast recovery.

Data Management

Find the most suitable data store for your workload.

A microservice architecture allows you to select different data stores (from object storage to backend databases) for each microservice, decreasing the risk of complete failure due to availability issues in one of the backend data stores.

Once you select a data store, replicate it across multiple AZs, and if the business requires it, replicate it across multiple regions, to allow better availability, closer to the customers.

Implement observability

By monitoring all workload components, and sending logs from both infrastructure and application components to a central logging system, it is possible to identify anomalies, anticipate failures before they impact customers, and act.

Examples of actions can be sending a command to restart a VM, deploying a new container instead of a failed one, and more.

It is important to keep track of measurements — for example, what is considered normal response time to a customer request, to be able to detect anomalies.

Implement chaos engineering

The base assumption is that everything will eventually fail.

Implementing chaos engineering, allows us to conduct controlled experiments, inject faults into our workloads, testing what will happen in case of failure.

This allows us to better understand if our workload will survive a failure.

Examples can be adding load on disk volumes, injecting timeout when an application tier connects to a backend database, and more.

Examples of services for implementing chaos engineering are AWS Fault Injection Simulator, Azure Chaos Studio, and Gremlin.

Create a failover plan

In an ideal world, your workload will be designed for self-healing, meaning, it will automatically detect a failure and recover from it, for example, replace failed components, restart services, or switch to another AZ or even another region.

In practice, you need to prepare a failover plan, keep it up to date, and make sure your team is trained to act in case of major failure.

A disaster recovery plan without proper and regular testing is worth nothing — your team must practice repeatedly, and adjust the plan, and hopefully, they will be able to execute the plan during an emergency with minimal impact on customers.

Resilient applications tradeoffs

Failure can happen in various ways, and when we design our workload, we need to limit the blast radius on our workload.

Below are some common failure scenarios, and possible solutions:

Failure in a specific component of the application — By designing a microservice architecture, we can limit the impact of a failed component to a specific area of our application (depending on the criticality of the component, as part of the entire application)
Failure or a single AZ — By deploying infrastructure over multiple AZs, we can decrease the chance of application failure and impact on our customers
Failure of an entire region — Although this scenario is rare, cloud regions also fail, and by designing a multi-region architecture, we can decrease the impact on our customers
DDoS attack — By implementing DDoS protection mechanisms, we can decrease the risk of impacting our application with a DDoS attack

Whatever solution we design for our workloads, we need to understand that there is a cost and there might be tradeoffs for the solution we design.

Multi-region architecture aspects

A multi-region architecture will allow the most highly available resilient solution for your workloads; however, multi-region adds high cost for cross-region egress traffic, most services are limited to a single region, and your staff needs to know to support such a complex architecture.

Another limitation of multi-region architecture is data residency — if your business or regulator demands that customers’ data be stored in a specific region, a multi-region architecture is not an option.

Service quota/service limits

When designing a highly resilient architecture, we must take into consideration service quotas or service limits.

Sometimes we are bound to a service quota on a specific AZ or region, an issue that we may need to resolve with the cloud provider’s support team.

Sometimes we need to understand there is a service limit in a specific region, such as a specific service that is not available in a specific region, or there is a shortage of hardware in a specific region.

Autoscaling considerations

Horizontal autoscale (the ability to add or remove compute nodes) is one of the fundamental capabilities of the cloud, however, it has its limitations.

Provisioning a new compute node (from a VM, container instance, or even database instance) may take a couple of minutes to spin up (which may impact customer experience) or to spin down (which may impact service cost).

Also, to support horizontal scaling, you need to make sure the compute nodes are stateless, and that the application supports such capability.

Failover considerations

One of the limitations of database failover is their ability to switch between the primary node and one of the secondary nodes, either in case of failure or in case of scheduled maintenance.

We need to take into consideration the data replication, making sure transactions were saved and moved from the primary to the read replica node.

Summary

In this blog post, we have covered many aspects of building resilient applications in the cloud.

When designing new applications, we need to understand the business expectations (in terms of application availability and customer impact).

We also need to understand the various architectural design considerations, and their tradeoffs, to be able to match the technology to the business requirements.

As I always recommend — do not stay on the theoretical side of the equation, begin designing and building modern and highly resilient applications to serve your customers — There is no replacement for actual hands-on experience.

References

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

Posted in AWS, Azure, Cloud computing, GCP, Google, Oracle |

No Comments »

Why choosing “Lift & Shift” is a bad migration strategy

December 12th, 2023 |

Author: Eyal Estrin

One of the first decisions organizations make before migrating applications to the public cloud is deciding on a migration strategy.

For many years, the most common and easy way to migrate applications to the cloud was choosing a rehosting strategy, also known as “Lift and shift”.

In this blog post, I will review some of the reasons, showing that strategically this is a bad decision.

Introduction

When reviewing the landscape of possibilities for migrating legacy or traditional applications to the public cloud, rehosting is the best option as a short-term solution.

Taking an existing monolith application, and migrating it as-is to the cloud, is supposed to be an easy task:

Map all the workload components (hardware requirements, operating system, software and licenses, backend database, etc.)
Choose similar hardware (memory/CPU/disk space) to deploy a new VM instance(s)
Configure network settings (including firewall rules, load-balance configuration, DNS, etc.)
Install all the required software components (assuming no license dependencies exist)
Restore the backend database from the latest full backup
Test the newly deployed application in the cloud
Expose the application to customers

From a time and required-knowledge perspective, this is considered a quick-win solution, but how efficient is it?

Cost-benefit

Using physical or even virtual machines does not guarantee us close to 100% of hardware utilization.

In the past organizations used to purchase hardware, and had to commit to 3–5 years (for vendor support purposes).

Although organizations could use the hardware 24×7, there were many cases where purchased hardware was consuming electricity and floor-space, without running at full capacity (i.e., underutilized).

Virtualization did allow organizations to run multiple VMs on the same physical hardware, but even then, it did not guarantee 100% hardware utilization — think about Dev/Test environments or applications that were not getting traffic from customers during off-peak hours.

The cloud offers organizations new purchase/usage methods (such as on-demand or Spot), allowing customers to pay just for the time they used compute resources.

Keeping a traditional data-center mindset, using virtual machines, is not efficient enough.

Switching to modern ways of running applications, such as the use of containers, Function-as-a-Service (FaaS), or event-driven architectures, allows organizations to make better use of their resources, at much better prices.

Right-sizing

On day 1, it is hard to predict the right VM instance size for the application.

When migrating applications as-is, organizations tend to select similar hardware (mostly CPU/Memory), to what they used to have in the traditional data center, regardless of the application’s actual usage.

After a legacy application is running for several weeks in the cloud, we can measure its actual performance, and switch to a more suitable VM instance size, gaining better utilization and price.

Tools such as AWS Compute Optimizer, Azure Advisor, or Google Recommender will allow you to select the most suitable VM instance size, but the VM still does not utilize 100% of the possible compute resources, compared to containers or Function-as-a-Service.

Scaling

Horizontal scaling is one of the main benefits of the public cloud.

Although it is possible to configure multiple VMs behind a load-balancer, with autoscaling capability, allowing adding or removing VMs according to the load on the application, legacy applications may not always support horizontal scaling, and even if they do support scale out (add more compute nodes), there is a very good chance they do not support scale in (removing unneeded compute nodes).

VMs do not support the ability to scale to zero — i.e., removing completely all compute nodes, when there is no customer demand.

Cloud-native applications deployed on top of containers, using a scheduler such as Kubernetes (such as Amazon EKS, Azure AKS, or Google GKE), can horizontally scale according to need (scale out as much as needed, or as many compute resources the cloud provider’s quota allows).

Functions as part of FaaS (such as AWS Lambda, Azure Functions, or Google Cloud Functions) are invoked as a result of triggers, and erased when the function’s job completes — maximum compute utilization.

Load time

Spinning up a new VM as part of auto-scaling activity (such as AWS EC2 Auto Scaling, Azure Virtual Machine Scale Sets, or Google Managed instance groups), upgrade, or reboot takes a long time — specifically for large workloads such as Windows VMs, databases (deployed on top of VM’s) or application servers.

Provisioning a new container (based on Linux OS), including all the applications and layers, takes a couple of seconds (depending on the number of software layers).

Invoking a new function takes a few seconds, even if you take into consideration cold start issues when downloading the function’s code.

Software maintenance

Every workload requires ongoing maintenance — from code upgrades, third-party software upgrades, and let us not forget security upgrades.

All software upgrade requires a lot of overhead from the IT, development, and security teams.

Performing upgrades of a monolith, where various components and services are tightly coupled together increases the complexity and the chances that something will break.

Switching to a microservice architecture, allows organizations to upgrade specific components (for example scale out, upgrade new version of code, new third-party software component), with small to zero impact on other components of the entire application.

Infrastructure maintenance

In the traditional data center, organizations used to deploy and maintain every component of the underlying infrastructure supporting the application.

Maintaining services such as databases or even storage arrays requires a dedicated trained staff, and requires a lot of ongoing efforts (from patching, backup, resiliency, high availability, and more).

In cloud-native environments, organizations can take advantage of managed services, from managed databases, storage services, caching, monitoring, and AI/ML services, without having to maintain the underlying infrastructure.

Unless an application relies on a legacy database engine, most of the chance, you will be able to replace a self-maintained database server, with a managed database service.

For storage services, most cloud providers already offer all the commodity storage services (from a managed NFS, SMB/CIFS, NetApp, and up to parallel file system for HPC workloads).

Most modern cloud-native services, use object storage services (such as Amazon S3, Azure Blob Storage, or Google Filestore), allowing scalable file systems for storing large amounts of data (from backups, and log files to data lake).

Most cloud providers offer managed networking services for load-balancing, firewalls, web application firewalls, and DDoS protection mechanisms, supporting workloads with unpredictable traffic.

SaaS services

Up until now, we mentioned lift & shift from the on-premise to VMs (mostly IaaS) and managed services (PaaS), but let us not forget there is another migration strategy — repurchasing, meaning, migrating an existing application, or selecting a managed platform such as Software-as-a-Service, allowing organizations to consume fully managed services, without having to take care of the on-going maintenance and resiliency.

Summary

Keeping a static data center mindset, and migrating using “lift & shift” to the public cloud, is the least cost-effective strategy and in most chances will end up with medium to low performance for your applications.

It may have been the common strategy a couple of years ago when organizations just began taking their first step in the public cloud, but as more knowledge is gained from both public cloud providers and all sizes of organizations, it is time to think about more mature cloud migration strategies.

It is time for organizations to embrace a dynamic mindset of cloud-native services and cloud-native applications, which provide organizations many benefits, from (almost) infinite scale, automated provisioning (using Infrastructure-as-Code), rich cloud ecosystem (with many managed services), and (if managed correctly) the ability to suit the workload costs to the actual consumption.

I encourage all organizations to expand their knowledge about the public cloud, assess their existing applications and infrastructure, and begin modernizing their existing applications.

Re-architecture may demand a lot of resources (both cost and manpower) in the short term but will provide an organization with a lot of benefits in the long run.

References:

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

Posted in AWS, Azure, Cloud Adoption, Cloud computing, GCP, Google |

No Comments »

Introduction to Serverless Container Services

November 20th, 2023 |

Author: Eyal Estrin

When developing modern applications, we almost immediately think about wrapping our application components inside Containers — it may not be the only architectural alternative, but a very common one.

Assuming our developers and DevOps teams have the required expertise to work with Containers, we still need to think about maintaining the underlying infrastructure — i.e., the Container hosts.

If our application has a steady and predictable load, and assuming we do not have experience maintaining Kubernetes clusters, and we do not need the capabilities of Kubernetes, it is time to think about an easy and stable alternative for deploying our applications on top of Containers infrastructure.

In the following blog post, I will review the alternatives of running Container workloads on top of Serverless infrastructure.

Why do we need Serverless infrastructure for running Container workloads?

Container architecture is made of a Container engine (such as Docker, CRI-O, etc.) deployed on top of a physical or virtual server, and on top of the Container engine, we deploy multiple Container images for our applications.

The diagram below shows a common Container architecture:

If we focus on the Container engine and the underlying operating system, we understand that we still need to maintain the operating system itself.

Common maintenance tasks for the operating system:

Make sure it has enough resources (CPU, memory, storage, and network connectivity) for running Containers
Make sure the operating system is fully patched and hardened from external attacks
Make sure our underlying infrastructure (i.e., Container host nodes), provides us with high availability in case one of the host nodes fails and needs to be replaced
Make sure our underlying infrastructure provides us the necessary scale our application requires (i.e., scale out or in according to application load)

Instead of having to maintain the underlying host nodes, we should look for a Serverless solution, that allows us to focus on application deployment and maintenance and decrease as much as possible the work on maintaining the infrastructure.

Comparison of Serverless Container Services

Each of the hyperscale cloud providers offers us the ability to consume a fully managed service for deploying our Container-based workloads.

Below is a comparison of AWS, Azure, and Google Cloud alternatives:

Side notes for Azure users

While researching for this blog post, I had a debate about whether to include Azure Containers Apps or Azure Container Instances.

Although both services allow customers to run Containers in a managed environment, Azure Container Instances is more suitable for running a single Container application, while Azure Container Apps allows customers to build a full microservice-based application.

Summary

In this blog post, I have compared alternatives for deploying microservice architecture on top of Serverless Container services offered by AWS, Azure, and GCP.

While designing your next application based on microservice architecture, and assuming you don’t need a full-blown Kubernetes cluster (with all of its features and complexities), consider using Serverless Container service.

References

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

Posted in AWS, Azure, Cloud computing, Containers, GCP |

No Comments »

Introduction to Break-Glass in Cloud Environments

November 13th, 2023 |

Author: Eyal Estrin

Using modern cloud environments, specifically production environments, decreases the need for human access.

It makes sense for developers to have access to Dev or Test environments, but in a properly designed production environment, everything should be automated – from deployment, and observability to self-healing. In most cases, no human access is required.

Production environments serve customers, require zero downtime, and in most cases contain customers’ data.

There are cases such as emergency scenarios where human access is required.

In mature organizations, this type of access is done by the Site reliability engineering (SRE) team.

The term break-glass is an analogy to breaking a glass to pull a fire alarm, which is supposed to happen only in case of emergency.

In the following blog post, I will review the different alternatives each of the hyperscale cloud providers gives their customers to handle break-glass scenarios.

Ground rules for using break-glass accounts

Before talking about how each of the hyperscale cloud providers handles break-glass, it is important to be clear – break-glass accounts should be used in emergency cases only.

Authentication – All access through the break-glass mechanism must be authenticated, preferred against a central identity provider, and not using local accounts
Authorization – All access must be authorized using role-based access control (RBAC), following the principle of least privilege
MFA – Since most break-glass scenarios require highly privileged access, it is recommended to enforce multi-factor authentication (MFA) for any interactive access
Just-in-time access – All access through break-glass mechanisms must be granted temporarily and must be revoked after a pre-define amount of time or when the emergency is declared as over
Approval process – Access through a break-glass mechanism should be manually approved
Auditing – All access through break-glass mechanisms must be audited and kept as evidence for further investigation
Documented process – Organizations must have a documented and tested process for requesting, approving, using, and revoking break-glass accounts

Handling break-glass scenarios in AWS

Below is a list of best practices provided by AWS for handling break-glass scenarios:

Identity Management

Identities in AWS are managed using AWS Identity and Access Management (IAM).

When working with AWS Organizations, customers have the option for central identity management for the entire AWS Organization using AWS IAM Identity Center – a single-sign-on (SSO) and federated identity management service (working with Microsoft Entra ID, Google Workspace, and more).

Since there might be a failure with a remote identity provider (IdP) or with AWS IAM Identity Center, AWS recommends creating two IAM users on the root of the AWS Organizations tree, and an IAM break-glass role on each of the accounts in the organization, to allow access in case of emergency.

The break-glass IAM accounts need to have console access, as explained in the documentation.

Authentication Management

When creating IAM accounts, enforce the use of a strong password policy, as explained in the documentation.

Passwords for the break-glass IAM accounts must be stored in a secured vault, and once the work on the break-glass accounts is over, the passwords must be replaced immediately to avoid reuse.

AWS recommends enforcing the use of MFA for any privileged access, as explained in the documentation.

Access Management

Access to resources inside AWS is managed using AWS IAM Roles.

AWS recommends creating a break-glass IAM role, as explained in the documentation.

Access using break-glass IAM accounts must be temporary, as explained in the documentation.

Auditing

All API calls within the AWS environment are logged into AWS CloudTrail by default, and stored for 90 days.

As best practices, it is recommended to send all CloudTrail logs to a central S3 bucket, from the entire AWS Organization, as explained in the documentation.

Since audit trail logs contain sensitive information, it is recommended to encrypt all data at rest using customer-managed encryption keys (as explained in the documentation) and limit access to the log files to the SOC team for investigation purposes.

Audit logs stored inside AWS CloudTrail can be investigated using Amazon GuardDuty, as explained in the documentation.

Resource Access

To allow secured access to EC2 instances, AWS recommends using EC2 Instance Connect or AWS Systems Manager Session Manager.

To allow secured access to Amazon EKS nodes, AWS recommends using AWS Systems Manager Agent (SSM Agent).

To allow secured access to Amazon ECS container instances, AWS recommends using AWS Systems Manager, and for debugging purposes, AWS recommends using Amazon ECS Exec.

To allow secured access to Amazon RDS, AWS recommends using AWS Systems Manager Session Manager.

Handling break-glass scenarios in Azure

Below is a list of best practices provided by Microsoft for handling break-glass scenarios:

Identity Management

Although Identities in Azure are managed using Microsoft Entra ID (formally Azure AD), Microsoft recommends creating two cloud-only accounts that use the *.onmicrosoft.com domain, to allow access in case of emergency and case of problems log-in using federated identities from the on-premise Active Directory, as explained in the documentation.

Authentication Management

Microsoft recommends enabling password-less login for the break-glass accounts using a FIDO2 security key, as explained in the documentation.

Microsoft does not recommend enforcing the use of MFA for emergency or break-glass accounts to prevent tenant-wide account lockout and exclude the break-glass accounts from Conditional Access policies, as explained in the documentation.

Access Management

Microsoft allows customers to manage privileged access to resources using Microsoft Entra Privileged Identity Management (PIM) and recommends assigning the break-glass accounts permanent access to the Global Administrator role, as explained in the documentation.

Microsoft Entra PIM allows to control of requests for privileged access, as explained in the documentation.

Auditing

Activity logs within the Azure environment are logged into Azure Monitor by default, and stored for 90 days.

As best practices, it is recommended to enable diagnostic settings for all audits and “allLogs” and send the logs to a central Log Analytics workspace, from the entire Azure tenant, as explained in the documentation.

Since audit trail logs contain sensitive information, it is recommended to encrypt all data at rest using customer-managed encryption keys (as explained in the documentation) and limit access to the log files to the SOC team for investigation purposes.

Audit logs stored inside a Log Analytics workspace can be queried for further investigation using Microsoft Sentinel, as explained in the documentation.

Microsoft recommends creating an alert when break-glass accounts perform sign-in attempts, as explained in the documentation.

Resource Access

To allow secured access to virtual machines (using SSH or RDP), Microsoft recommends using Azure Bastion.

To allow secured access to the Azure Kubernetes Service (AKS) API server, Microsoft recommends using Azure Bastion, as explained in the documentation.

To allow secured access to Azure SQL, Microsoft recommends creating an Azure Private Endpoint and connecting to the Azure SQL using Azure Bastion, as explained in the documentation.

Another alternative to allow secured access to resources in private networks is to use Microsoft Entra Private Access, as explained in the documentation.

Handling break-glass scenarios in Google Cloud

Below is a list of best practices provided by Google for handling break-glass scenarios:

Identity and Access Management

Identities in GCP are managed using Google Workspace or using Google Cloud Identity.

Access to resources inside GCP is managed using IAM Roles.

Google recommends creating a dedicated Google group for the break-glass IAM role, and configuring temporary access to this Google group as explained in the documentation.

The temporary access is done using IAM conditions, and it allows customers to implement Just-in-Time access, as explained in the documentation.

For break-glass access, add dedicated Google identities to the mentioned Google group, to gain temporary access to resources.

Authentication Management

Google recommends enforcing the use of MFA for any privileged access, as explained in the documentation.

Auditing

Admin Activity logs (configuration changes) within the GCP environment are logged into Google Cloud Audit logs by default, and stored for 90 days.

It is recommended to manually enable data access audit logs to get more insights about break-glass account activity, as explained in the documentation.

As best practices, it is recommended to send all Cloud Audit logs to a central Google Cloud Storage bucket, from the entire GCP Organization, as explained in the documentation.

Since audit trail logs contain sensitive information, it is recommended to encrypt all data at rest using customer-managed encryption keys (as explained in the documentation) and limit access to the log files to the SOC team for investigation purposes.

Audit logs stored inside Google Cloud Audit Logs can be sent to the Google Security Command Center for further investigation, as explained in the documentation.

Resource Access

To allow secured access to Google Compute Engine instances, Google recommends using an Identity-Aware Proxy, as explained in the documentation.

To allow secured access to Google App Engine instances, Google recommends using an Identity-Aware Proxy, as explained in the documentation.

To allow secured access to Google Cloud Run service, Google recommends using an Identity-Aware Proxy, as explained in the documentation.

To allow secured access to Google Kubernetes Engine (GKE) instances, Google recommends using an Identity-Aware Proxy, as explained in the documentation.

Summary

In this blog post, we have reviewed what break-glass accounts are, and how AWS, Azure, and GCP are recommending to secure break-glass accounts (from authentication, authorization, auditing, and secure access to cloud resources).

I recommend any organization that manages cloud production environments follow the vendors’ security best practices and keep the production environment secured.

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

Posted in AWS, Azure, Cloud computing, GCP |

No Comments »

Embracing Cloud-Native Mindset

November 6th, 2023 |

Author: Eyal Estrin

This post was originally published by the Cloud Security Alliance.

The use of the public cloud has become the new norm for any size organization.

Organizations are adopting cloud services, migrating systems to the cloud, consuming SaaS applications, and beginning to see the true benefits of the public cloud.

In this blog post, I will explain what it means to embrace a cloud-native mindset.

What is Cloud-Native?

When talking about cloud-native, there are two complimentary terms:

Cloud-Native Infrastructure — Services that were specifically built to run on public cloud environments, such as containers, API gateways, managed databases, and more.
Cloud-Native applications — Applications that take the full benefits of the public cloud, such as auto-scaling (up or down), microservice architectures, function as a service, and more.

Cloud First vs. Cloud-Native

For many years, there was a misconception among organizations and decision-makers, should we embrace a “cloud first” mindset, meaning, any new application we develop or consume must reside in the public cloud?

Cloud-first mindset is no longer relevant.

Cloud, like any other IT system, is meant to support the business, not to dictate business decisions.

One of the main reasons for any organization to create a cloud strategy is to allow decision-makers to align IT capabilities or services to business requirements.

There might be legacy systems generating value for the organization, and the cost to re-architect and migrate to the cloud is higher than the benefit of migration — in this case, the business should decide how to manage this risk.

When considering developing a new application or migrating an existing application to the cloud, consider the benefits of cloud-native (see below), and in any case, choosing the cloud makes sense (in terms of alignment to business goals, costs, performance, etc.), make it your first choice.

What are the benefits of Cloud-Native?

Since we previously mentioned cloud-native, let us review some of the main benefits of cloud-native:

Automation

One of the pre-requirements of cloud-native applications is the ability to deploy an entire workload in an automated manner using Infrastructure as Code.

In cloud environments, IaC comes naturally, but do not wait until your workloads are migrated or developed in the cloud — begin automating on-premise infrastructure deployments using scripts today.

Scale

Cloud-native applications benefit from the infinite scale of the public cloud.

Modern applications will scale up or down according to customers’ demand.

Legacy environments may have the ability to add more virtual machines in case of high load, but in most cases, they fail to release unneeded compute resources when the load on the application goes down, increasing resource costs.

Microservice architecture

One of the main benefits of cloud-native applications is the ability to break down complex architecture into small components (i.e., microservices)

Microservices allows development teams to own, develop, and maintain small portions of an application, making upgrading to newer versions an easy task.

If you are building new applications today, start architecting your applications using a microservices architecture, regardless if you are developing on-premise or in the public cloud.

It is important to note that microservices architecture increases the overall complexity of an application, by having many small components, so plan carefully.

Managed services

One of the main benefits when designing applications (or migrating an existing application) in the cloud, is to gain the benefit of managed services.

By consuming managed services (such as managed databases, storage, API gateways, etc.), you shift the overall maintenance, security, and stability to the cloud provider, which allows you to consume a service, without having to deal with the underlying infrastructure maintenance.

Whenever possible, prefer to choose a serverless managed service, which completely removes your requirement to deal with infrastructure scale (you simply do not specify how much computing power is required to run a service at any given time).

CI/CD pipeline

Modern applications are developed using a CI/CD pipeline, which creates a fast development lifecycle.

Each development team shares its code using a code repository, able to execute its build process, which ends up with an artifact ready to be deployed in any environment (Dev, Test, or Prod).

Modern compute services

Cloud-native applications allow us to have optimum use of the hardware.

Compute services such as containers and function as a service, make better use of hardware resources, when compared to physical or even virtual machines.

Containers can run on any platform (from on-premise to cloud environments), and although it may take some time for developers and DevOps to learn how to use them, they can suit most workloads (including AI/ML), and be your first step in embracing cloud-native applications.

Function as a Service is a different story — they suit specific tasks, and in most cases bound to a specific cloud environment, but if used wisely, they offer great efficiency when compared to other types of compute services.

Summary

What does it mean to embrace a cloud-native mindset?

Measuring the benefits of cloud-native applications, consuming cloud-native services, looking into the future of IT services, and wisely adopting the public cloud.

Will the public cloud suit 100% of scenarios? No, but it has more benefits than keeping legacy systems inside traditional data centers.

Whether you are a developer, DevOps, architect, or cybersecurity expert, I invite you to read, take online courses, practice, and gain experience using cloud-native infrastructure and applications, and consider them the better alternatives for running modern applications.

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.

Posted in Cloud Adoption, Cloud computing |

No Comments »

Security challenges with SaaS applications

September 11th, 2023 |

Author: Eyal Estrin

This post was originally published by the Cloud Security Alliance.

According to the Shared Responsibility Model, “The consumer does not manage or control the underlying cloud infrastructure”.

As customers, this leaves us with very little control over services managed by remote service providers, as compared to the amount of control we have over IaaS (Infrastructure as a Service), where we control the operating system and anything inside it (applications, configuration, etc.)

The fact that many modern applications are offered as a SaaS, has many benefits such as:

(Almost) zero maintenance (we are still in charge of authorization)
(Almost) zero requirements to deal with availability or performance issues (depending on business requirements and the maturity of the SaaS vendor)
(Almost) zero requirement to deal with security and compliance (at the end of the day, we are still responsible for complying with laws and regulations and we still have obligations to our customers and employees, depending on the data classification we are about to store in the cloud)
The minimum requirement to handle licensing (depending on the SaaS pricing offers)
As customers, we can consume a service and focus on our business (instead of infrastructure and application maintenance)

While there are many benefits of switching from maintaining servers to consuming (SaaS) applications, there are many security challenges we need to be aware of and risks to control.

In this blog post, I will review some of the security challenges facing SaaS applications.

Identity and Access Management

We may not control the underlining infrastructure, but as customers, we are still in charge of configuring proper authentication and authorization for our customers (internal or external).

As customers, we would like to take advantage of our current identities and leverage a federation mechanism to allow our end-users to log in once and through SSO to be able to access the SaaS application, all using standard protocols such as SAML, OAuth, or OpenID Connect.

Once the authentication phase is done, we need to take care of access permissions, following the role description/requirement.

We must always follow the principle of least privilege.

We should never accept a SaaS application that does not support granular role-based access control.

While working with SaaS applications, we need to make sure we can audit who had access to our data and what actions have been done.

The final phase is to make sure access is granted by business needs – once an employee no longer needs access to a SaaS application, we must revoke the access immediately.

Data Protection

Once we are using SaaS applications, we need to understand we no longer have “physical” control over our data – whether it is employee’s data, customers’ data, intellectual property, or any other type of data.

Once data is stored and processed by an external party, there is always a chance for a data breach, that may lead to data leakage, data tampering, encryption by ransomware, and more.

If we are planning to store sensitive data (PII’s, financial, healthcare, etc.) in the cloud, we must understand how data is being protected.

We must make sure data is encrypted both in transit and at rest (including backups, logs, etc.) and at any given time, access to data by anyone (from our employees, SaaS vendor employees, or even third-party companies), must be authenticated, authorized, and audited.

Misconfiguration

The most common vulnerability is misconfiguration.

The easiest way is for an employee with administrative privileges to make a configuration mistake and grant someone unnecessary access permissions, make data publicly available, forget to turn encryption at rest on (depending on specific SaaS applications), and more.

Some SaaS applications allow you to set configuration control using CASB (Cloud Access Security Brokers) or SSPM (SaaS Security Posture Management).

The problem is the lack of standardization in the SaaS industry.

There is no standard for allowing central configuration management using APIs.

If you are using common SaaS applications such as Office 365, Dropbox, SalesForce, or any other common SaaS application, you may be able to find many third-party security solutions that will allow you to mitigate misconfiguration.

Otherwise, if you are working with a small start-up vendor or with an immature SaaS vendor, your only options are a good legal contract (defining the obligations of the SaaS vendor), demand for certifications (such as SOC2 Type II reports) and accepting the risk (depending on the business risk tolerance).

Insecure API’s

Many SaaS applications allow you to connect using APIs (from audit logs to configuration management).

Regardless of the data classification, you must always make sure your SaaS vendor’s APIs support the following:

All APIs require authentication and perform a back-end authorization process.
All traffic to the API is encrypted in transit
All-access to API is audited (for further analysis)
If the SaaS application allows traffic initiation through API back to your organization, make sure you enforce input validation to avoid inserting malicious code into your internal systems

I recommend you never rely on third-party SaaS vendors – always coordinate penetration testing on exposed APIs to mitigate the risk of insecure APIs.

Third-Party Access

Some SaaS vendors allow (or rely on) third-party vendor access.

When conducting due diligence with SaaS vendors, make sure to check if it allows any third-party vendor access to customers’ data and how is data protected.

Also, make sure the contract specifies if data is transferred to third-party vendors, who are they and for which purpose.

Make sure everything is written in the contract with the SaaS vendor and that the SaaS vendor must notify you of any change regarding data access or transfer to third-party vendors.

Patch Management and System Vulnerabilities

Since we are only consumers of a managed service, we have no control or visibility to infrastructure or application layers.

Everything is made of software and software is vulnerable by design.

We may be able to coordinate vulnerability scanning or even short-term penetration testing with the SaaS vendor (depending on the SaaS vendor maturity), but we are still dependent on the transparency of the SaaS vendor and this is a risk we need to accept (depending on the business risk tolerance).

Lack of SaaS Vendor Transparency

This is very important.

Mature SaaS vendors will make sure we are up to date with information such as breach notifications, outages, and scheduled maintenance (at least when everybody on the Internet talks about critical software vulnerabilities requiring immediate patching, and assuming downtime is required).

As part of vendor transparency, I would expect the legal contract to force the SaaS vendor to keep us up to date with data breach incidents or potential unauthorized access to customers’ data.

Since in most cases, we do not have a real way to audit SaaS vendors’ security controls, I recommend working only with mature vendors who can provide proof of their maturity level (such as SOC 2 Type II reports every year) and coordinate your assessments on the SaaS vendor.

Mature SaaS vendors will allow us access to audit logs, to query who has access to our data and what actions have been done with the data.

Regulatory Compliance

Regardless of the cloud service model, we are always responsible for our data and we must always comply with laws and regulations, wherever our customers reside or wherever our SaaS vendor stores our data.

Mature SaaS vendors allow us to comply with data residency and make sure data does not leave a specific country or region.

Compliance goes for the entire lifecycle of our data – from upload/store, process, data backup or retention, to finally data destruction.

Make sure the legal contract specifies data residency and the vendor’s obligations regarding compliance.

From a customer’s point of view, make sure you get legal advice on how to comply with all relevant laws and regulations.

Summary

In this blog post, I have reviewed some of the most common security challenges working with SaaS applications.

SaaS applications have many benefits (from a customer point of view), but they also contain security risks that we need to be aware of and manage regularly.

Posted in Cloud Adoption, Cloud computing |

No Comments »

Identity and Access Management in Multi-Cloud Environments

June 19th, 2023 |

Author: Eyal Estrin

IAM (Identity and Access Management) is a crucial part of any cloud environment.

As organizations evolve, they may look at multi-cloud as a solution to consume cloud services in different cloud providers’ technologies (such as AI/ML, data analytics, and more), to have the benefit of using different pricing models, or to decrease the risk of vendor lock-in.

Before we begin the discussion about IAM, we need to understand the following fundamental concepts:

Identity – An account represents a persona (human) or service (non-interactive account)
Authentication – The act where an identity proves himself against a system (such as providing username and password, certificate, API key, and more)
Authorization – The act of validating granting an identity’s privileges to take actions on a system (such as view configuration, read database content, upload a file to object storage, and more)
Access Management – The entire lifecycle of IAM – from account provisioning, granting access, and validating privileges until account or privilege revocation.

Identity and Access Management Terminology

Authorization in the Cloud

Although all cloud providers have the same concept of identities, when we deep dive into the concept of authorization or access management to resources/services, we need to understand the differences between cloud providers.

Authorization in AWS

AWS has two concepts for managing permissions to resources:

IAM Role – Permissions assigned to an identity temporarily.
IAM Policy – A document defines a set of permissions assigned to an IAM role.

Permissions in AWS can be assigned to:

Identity – A policy attached to a user, group, or role.
Resource – A policy attached to a resource (such as Amazon S3 bucket).

Authorization in Azure

Permissions in Azure AD are controlled by roles.

A role defines the permissions an identity has over an Azure resource.

Within Azure AD, you control permissions using RBAC (Role-based access control).

Azure AD supports the following types of roles:

Built-in roles – A pre-defined role according to job function (as you can read on the link).
Custom roles – A role that we create ourselves to match the principle of least privilege.

Authorization in Google Cloud

Permissions in Google Cloud IAM are controlled by IAM roles.

Google Cloud IAM supports the following types of IAM roles:

Basic roles – The most permissive type of roles (Owner, Editor, and Viewer).
Predefined roles – Roles managed by Google, which provides granular access to specific services (as you can read on the link).
Custom roles – User-specific roles, which provide the most granular access to resources.

Authorization – Default behavior

As we can see below each cloud provider takes a different approach to default permissions:

AWS – By default, new IAM users have no permission to access any resource in AWS.
To allow access to resources or take actions, you need to manually assign the user an IAM role.
Azure – By default, all Azure AD users are granted a set of default permissions (such as listing all users, reading all properties of users and groups, registering new applications, and more).
Google Cloud – By default, a new service account is granted the Editor role on the project level.

Identity Federation

When we are talking about identity federation, there are two concepts:

Service Provider (SP) – Provide access to resources
Identity Provider (IdP) – Authenticate the identities

Identities (user accounts, service accounts, groups, etc.) are managed by an Identity Provider (IdP).

An IdP can exist in the local data center (such as Microsoft Active Directory) or the public cloud (such as AWS IAM, Azure AD, Google Cloud IAM, etc.)

Federation is the act of creating trust between separate IdP’s.

Federation allows us to keep identity in one repository (i.e., Identity Provider).

Once we set up an identity federation, we can grant an identity privilege to consume resources in a remote repository.

Example: a worker with an account in Microsoft Active Directory, reading a file from object storage in Azure, once a federation trust was established between Microsoft Active Directory and Azure Active Directory.

When federating between the on-premise and cloud environments, we need to recall the use of different protocols.

On-premise environments are using legacy authentication protocols such as Kerberos or LDAP.

In the public cloud, the common authentication protocols are SAML 2.0, Open ID Connect (OIDC), and OAuth 2.0

Each cloud provider has a list of supported external third-party identity providers to federate with, as you can read in the list below:

Single Sign-On

The concept behind SSO is to allow identities (usually end-users) access to resources in the cloud while having to sign (to an identity provider) once.

Over the past couple of years, the concept of SSO was extended and now it is possible to allow a single identity (who authenticated to a specific identity provider), access to resources over federated login to an external (mostly SAML) identity provider.

Each cloud provider has its own SSO service, supporting federation with external identity providers:

Steps for creating a federation between cloud providers

The process below explains (at a high level) the steps require to set up identity federation between different cloud providers:

Choose an IdP (where identities will be created and authenticated to).
Create a SAML identity provider.
Configure roles for your third-party identity provider.
Assign roles to the target users.
Create trust between SP and IdP.
Test the ability to authenticate and identify (user) to a resource in a remote/external cloud provider.

Additional References:

Summary

In this blog post, we had a deep dive into identity and access management in the cloud, comparing different aspects of IAM in AWS, Azure, and GCP.

After we have reviewed how authentication and authorization work for each of the three cloud providers, we have explained how federation and SSO work in a multi-cloud environment.

Important to keep in mind:

When we are building systems in the cloud, whether they are publicly exposed or even internal, we need to follow some basic rules:

All-access to resources/systems/applications must be authenticated
Permissions must follow the principle of least privileged and business requirements
All access must be audited (for future analysis, investigation purposes, etc.)

Posted in Authentication, AWS, Azure, Cloud computing, GCP |

No Comments »

Privacy by Design and Privacy by Default in the Cloud

June 12th, 2023 |

Author: Eyal Estrin

This post was originally published by the Cloud Security Alliance.

When we are talking about building new systems, in the context of privacy or data protection, we often hear two concepts – Privacy by Design (PbD) and Privacy by Default.

Dealing with human privacy is not something new.

We build applications that store and process personal data – from e-commerce sites, banking, healthcare, advertisement, and more.

The concept of Privacy by Design (PbD) was embraced by the GDPR (General Data Protection Regulation) in Article 5 and Article 25, the CCPA (California Consumer Protection Act) in W410-1, the LGPD (Brazilian Data Protection Law) in Article 46 and the Canadian PIPEDA (Personal Information Protection and Electronic Documents Act) in Recommendation 14.

When designing systems in the cloud, we must remember the Shared Responsibility Model.

The cloud provider is responsible for the underlining infrastructure layers and offers us many built-in security controls, but it is our responsibility as companies developing systems in the cloud, to use the security controls and design applications to meet all privacy requirements.

In this blog post, I will provide insights about how to implement those concepts when building new systems in the cloud.

What is Privacy by Design?

Privacy by Design (PbD) is based on seven “foundational principles”:

Principle 1: Proactive not reactive; preventive not remedial

To achieve this principle, we need to implement proactive security controls.

Examples of security controls that come built-in as part of major cloud providers:

Identity and Access Management – Enforce authentication (who the persona claims to be) and authorization (what actions can be done by authenticated identity).

Examples of services: AWS Identity and Access Management (IAM), Azure Active Directory (Azure AD), Google Cloud Identity and Access Management (IAM), and Oracle Cloud Infrastructure Identity and Access Management (IAM).

Network Protection – Enforce inbound/outbound access to services using access control mechanisms.

Examples of services: AWS Security groups, Azure Network security groups (NSG), GCP VPC firewall rules, and Oracle Cloud Infrastructure Security Lists.

Data Encryption – Enforce confidentiality by encrypting data in transit and at rest.

Examples of services: AWS Key Management Service (AWS KMS), Azure Key Vault, Google Cloud KMS, and Oracle Cloud Infrastructure Vault.

Principle 2: Privacy as the default setting

To achieve this principle, we need to implement default settings at the application level and on the infrastructure level.

Data minimization – When designing an application, we need to decide what is the minimum number of fields that will be stored (and perhaps processed) on data subjects in the application.
Data location – When designing an application, we need to take into consideration data residency, by selecting the target region to store data according to relevant laws and regulations.
Data retention – We need to set our application to keep data for as long as it is required and either delete or archive data when it is no longer needed (according to application/service capabilities).

Examples: Amazon S3 lifecycle management, Amazon EFS lifecycle management, Azure Storage lifecycle management, Google Cloud Storage Lifecycle Management, and Oracle Cloud Infrastructure Object Storage Lifecycle Management.

Keeping Audit Trail – By default administrative actions (usually using APIs) are logged by all major cloud providers. If we want to increase log retention or include data actions (what identity did with the data), we need to manually enable it.

Examples of services: AWS CloudTrail, Azure Monitor, Google Cloud Audit Logs, and Oracle Cloud Infrastructure Logging service.

Data Encryption – Enforce confidentiality by encrypting data in transit and at rest.

Principle 3: Privacy Embedded into Design

To achieve this principle, we need to embed privacy safeguards as part of the design.

Most data protection or data privacy regulations offers the data subjects the following rights:

The right to be informed about the collection and use of their data.
The right to view and request copies of their data.
The right to request inaccurate or outdated personal information be updated or corrected.
The right to request their data be deleted.
The right to ask for their data to be transferred to another controller or provided to them.

When we design an application, we need to develop it to support the above data subject rights from day one, so once we need to use those functionalities, we will have them prepared, even before collecting information about the first data subject.

Principle 4: Full functionality – positive-sum, not zero-sum

To achieve this principle, we need to look at the bigger picture.

Privacy safeguards should be embedded as part of the application design, without affecting security controls or without causing performance impact on other services.

An example can be the security requirement to audit all actions in the system (for the incident response process) while keeping data privacy requirement to keep only a minimum amount of information about data subjects, not to mention the cost of keeping long-term audit log storage.

In the case of audit logs, we need to find the balance between having logs for investigation, while removing unnecessary information about data subjects, and perhaps moving old logs to an archive tier to save costs.

Principle 5: End-to-end security – full lifecycle protection

To achieve this principle, we need to make sure data is kept private throughout its entire lifecycle, from collection, storage, retirement, and disposal (when not required anymore).

When talking about data security, we must always remember to follow the CIA triad: Confidentiality, Integrity, and Availability.

The data lifecycle management contains the following:

Data generation of the collection – We need to take into consideration automatic data classification.
Storage – We need to take into consideration data retention and archiving, including storage capacity and archiving capabilities.
Data use and sharing – We need to implement strong authentication and authorization processes to protect the data we store and process.
Data archive – We should take advantage of built-in storage archive capabilities that exist with all major cloud providers.
Data disposal – We should design mechanisms to allow us to destroy data no longer needed.

Principle 6: Visibility and transparency – keep it open

To achieve this principle, we need to create and publish a privacy policy, that will be available for our customers, per application or per website we publicly expose to the Internet.

The privacy policy should contain information about:

The data we collect.
The purpose for collecting data from our customers.
If we share private data with third parties, the privacy policy should indicate it.
The data subject rights (such as viewing which data is been collected, the right to update data subjects’ data or delete data subject data).
How can data subjects contact us (to view data, update it, delete it, or export it)?

Visibility and transparency are crucial, and as such, the privacy policy must be kept up to date.

Principle 7: Respect for user privacy – keep it user-centric

To achieve this principle, we need to put our customers (or data subjects) first.

User experience is an important factor – how will our customers know that we are collecting private data? How will they be able to consent to data collection, view the data we are collecting, or ask us to delete it?

We need to configure our application with privacy settings enabled by default and allow our customers an easy way to opt-in (subscribe) or opt out (unsubscribe) from our service.

We need to design our system to allow customers an easy way to export their private data and support the portability of the data we collect to another third-party system in a standard readable format.

Summary

When designing applications that will store or process private data in the cloud, we should remember the shared responsibility model, together with the seven principles of privacy by design.

Some of the principles can be achieved using services offered by cloud providers, for some, we can use third-party solutions and for some we are responsible for the implementation, to comply with privacy laws or regulations and to keep our customers’ private data safe.

For any organization designing new applications in the cloud, I recommend creating teams containing representatives of both the technology department (such as DevOps, architects, and security personnel) and legal department (such as Lawyers, data privacy, compliance, and risk), to be able to design an end-to-end solution.

I invite anyone designing new applications, to read and get more information about the privacy law and regulations affecting their customers.

Disclaimer – This blog post contains my opinion. It does not replace any legal advice for complying with privacy obligations or regulations.

Posted in Cloud computing, Privacy |

No Comments »

Introduction to AWS Resilience Hub

May 17th, 2023 |

Author: Eyal Estrin

When deploying a new application in the public cloud, we need to ask the business owner what are the resiliency (or SLA) requirements – How long can the business survive while our application is down and does not serve customers?

There are various answers to that question – from 24/7 availability (not realistic) to uptime of 99.9%, etc.

The domain of resiliency has two main concepts:

RTO (Recovery Time Objective) – the amount of time it takes to recover a system after disruption
RPO (Recovery Point Objective) – the amount of data loss, measured by time

To achieve high resiliency, or follow business SLA requirements, there are technical and cost consequences.

Naturally, we want to provision resources in high-availability (such as a farm of front-end web servers behind load-balancer), in a cluster (such as a cluster of database instances), deployed in multiple availability zones or perhaps in multiple regions, and try to avoid single point of failure.

We need to plan an architecture that will support our business resiliency requirements.

In theory, an architect can look at proposed architecture and say whether or not he sees potential availability failures, but it does not scale in large and complex architectures.

In 2021, AWS announced the general availability of the AWS Resilience Hub.

In this blog post, I will review what is the purpose of this service and how can we use it regularly, as part of our CI/CD process.

How does AWS Resilience Hub work?

Source: https://docs.aws.amazon.com/resilience-hub/latest/userguide/how-it-works.html

To work with AWS Resilience Hub, follow the steps below:

Add an application

AWS Resilience Hub allows you to assess an application by scanning the following resources:

Set resilience targets

AWS Resilience Hub supports the following built-in tiers:

Foundational IT core services
Mission critical
Critical
Important
Non-critical

Choose the target policy according to the application business requirements of RTO and RPO.

Select one of the predefined suggested policies:

Non-critical application
Important Application
Critical Application
Global Critical Application
Mission Critical Application
Global Mission Critical Application
Foundational Core Service

AWS Resilience Hub allows you to evaluate the resiliency of an application against the following types of disruption:

Customer Application RTO and RPO
AWS Infrastructure RTO and RPO
Cloud Infrastructure Availability Zone (AZ) disruption
AWS Region disruption

Run an assessment

AWS Resilience Hub allows you to either run manual on-time assessments or schedule an assessment daily.

To get the most value from AWS Resilience Hub, you can integrate it as part of a CI/CD pipeline, as an additional step, once you provision Infrastructure as Code (using CloudFormation templates or Terraform modules).

A common example of integration with CI/CD pipeline:

In a mature environment, you can take one step further and integrate AWS Resilience Hub with the built-in chaos engineering service AWS Fault Injection Simulator to conduct controlled experiments on your application and evaluate its resiliency.

Review results and continue improvements

Once an assessment was completed, it is time to review the results, to make sure your application meets the business resiliency requirements (in terms of RTO/RPO).

The results will be written in a report, with recommendations for improvements to your application resiliency, such as adding another node to an RDS cluster, deploying another EC2 instance in another availability zone, enabling S3 bucket versioning, etc.

To make things easy to understand and improve over time, you can build dashboards using Amazon QuickSight and send alerts using CloudWatch, as explained in the blog post:

https://aws.amazon.com/blogs/mt/resilience-reporting-dashboard-aws-resilience-hub/

For continuous and automated improvement, you can integrate AWS Resilience Hub with AWS Systems Manager to efficiently recover your application in the event of outages, as explained in the blog post:

https://docs.aws.amazon.com/resilience-hub/latest/userguide/create-custom-ssm-doc.html

Summary

In this blog post, we learned about the purpose of AWS Resilience Hub, what are the various steps for using it, and perhaps most important – how to automate the assessment as part of a CI/CD pipeline for continuous improvement.

I encourage anyone who builds applications on top of AWS to learn about the benefits of this service, providing insights into the resiliency of applications to meet business requirements.

Additional References:

Posted in AWS, Chaos Engineering, Cloud computing, Resiliency |

No Comments »