Archive for the ‘Azure’ Category
Checklist for designing cloud-native applications – Part 2: Security aspects
This post was originally published by the Cloud Security Alliance.
In Chapter 1 of this series about considerations when building cloud-native applications, we introduced various topics such as business requirements, infrastructure considerations, automation, resiliency, and more.
In this chapter, we will review security considerations when building cloud-native applications.
IAM Considerations – Authentication
Identity and Access Management plays a crucial role when designing new applications.
We need to ask ourselves – Who are our customers?
If we are building an application that will serve internal customers, we need to make sure our application will be able to sync identities from our identity provider (IdP).
On the other hand, if we are planning an application that will serve external customers, in most cases we would not want to manage the identities themselves, but rather allow authentication based on SAML, OAuth, or OpenID connect, and manage the authorization in our application.
Examples of managed cloud-native identity services: AWS IAM Identity Center, Microsoft Entra ID, and Google Cloud Identity.
IAM Considerations – Authorization
Authorization is also an important factor when designing applications.
When our application consumes services (such as compute, storage, database, etc.) from a CSP ecosystem, each CSP has its mechanisms to manage permissions to access services and take actions, and each CSP has its way of implementing Role-based access control (RBAC).
Regardless of the built-in mechanisms to consume cloud infrastructure, we must always follow the principle of least privilege (i.e., minimal permissions to achieve a task).
On the application layer, we need to design an authorization mechanism to check each identity that was authenticated to our application, against an authorization engine (interactive authentication, non-interactive authentication, or even API-based access).
Although it is possible to manage authorization using our own developed RBAC mechanism, it is time to consider more cloud-agnostic authorization policy engines such as Open Policy Agent (OPA).
One of the major benefits of using OPA is the fact that its policy engine is not limited to authorization to an application – you can also use it for Kubernetes authorization, for Linux (using PAM), and more.
Policy-as-Code Considerations
Policy-as-Code allows you to configure guardrails on various aspects of your workload.
Guardrails are offered by all major cloud providers, outside the boundary of a cloud account, and impact the maximum allowed resource consumption or configuration.
Examples of guardrails:
- Limitation on the allowed region for deploying resources (compute, storage, database, network, etc.)
- Enforce encryption at rest
- Forbid the ability to create publicly accessible resources (such as a VM with public IP)
- Enforce the use of specific VM instance size (number of CPUs and memory allowed)
Guardrails can also be enforced as part of a CI/CD pipeline when deploying resources using Infrastructure as Code for automation purposes – The IaC code is been evaluated before the actual deployment phase, and assuming the IaC code does not violate the Policy as Code, resources are been updated.
Examples of Policy-as-Code: AWS Service control policies (SCPs), Azure Policy, Google Organization Policy Service, HashiCorp Sentinel, and Open Policy Agent (OPA).
Data Protection Considerations
Almost any application contains valuable data, whether the data has business or personal value, and as such we must protect the data from unauthorized parties.
A common way to protect data is to store it in encrypted form:
- Encryption in transit – done using protocols such as TLS (where the latest supported version is 1.3)
- Encryption at rest – done on a volume, disk, storage, or database level, using algorithms such as AES
- Encryption in use – done using hardware supporting a trusted execution environment (TEE), also referred to as confidential computing
When encrypting data we need to deal with key generation, secured vault for key storage, key retrieval, and key destruction.
All major CSPs have their key management service to handle the entire key lifecycle.
If your application is deployed on top of a single CSP infrastructure, prefer to use managed services offered by the CSP.
For encryption in use, select services (such as VM instances or Kubernetes worker nodes) that support confidential computing.
Secrets Management Considerations
Secrets are equivalent to static credentials, allowing access to services and resources.
Examples of secrets are API keys, passwords, database credentials, etc.
Secrets, similarly to encryption keys, are sensitive and need to be protected from unauthorized parties.
From the initial application design process, we need to decide on a secured location to store secrets.
All major CSPs have their own secrets management service to handle the entire secret’s lifecycle.
As part of a CI/CD pipeline, we should embed an automated scanning process to detect secrets embedded as part of code, scripts, and configuration files, to avoid storing any secrets as part of our application (i.e., outside the secured secrets management vault).
Examples of secrets management services: AWS Secrets Manager, Azure Key Vault, Google Secret Manager, and HashiCorp Vault.
Network Security Considerations
Applications must be protected at the network layer, whether we expose our application to internal customers or customers over the public internet.
The fundamental way to protect infrastructure at the network layer is using access controls, which are equivalent to layer 3/layer 4 firewalls.
All CSPs have access control mechanisms to restrict access to services (from access to VMs, databases, etc.)
Examples of Layer 3 / Layer 4 managed services: AWS Security groups, Azure Network security groups, and Google VPC firewall rules.
Some cloud providers support private access to their services, by adding a network load-balancer in front of various services, with an internal IP from the customer’s private subnet, enforcing all traffic to pass inside the CSP’s backbone, and not over the public internet.
Examples of private connectivity solutions: AWS PrivateLink, Azure Private Link, and Google VPC Service Controls.
Some of the CSPs offer managed layer 7 firewalls, allowing customers to enforce traffic based on protocols (and not ports), inspecting TLS traffic for malicious content, and more, in case your application or business requires those capabilities.
Examples of Layer 7 managed firewalls: AWS Network Firewall, Azure Firewall, and Google Cloud NGFW.
Application Layer Protection Considerations
Any application accessible to customers (internal or over the public Internet), is exposed to application layer attacks.
Attacks can range from malicious code injection, data exfiltration (or data leakage), data tampering, unauthorized access, and more.
Whether you are exposing an API, a web application, or a mobile application, it is important to implement application layer protection, such as a WAF service.
All major CSPs offer managed WAF services, and there are many SaaS solutions by commercial vendors that offer managed WAF services.
Examples of managed WAF services: AWS WAF, Azure WAF, and Google Cloud Armor.
DDoS Protection Considerations
Denial-of-Service (DoS) or Distributed Denial-of-Service (DDoS) is a risk for any service accessible over the public Internet.
Such attacks try to consume all the available resources (from network bandwidth to CPU/memory), directly impacting the service availability to be accessible by customers.
All major CSPs offer managed DDoS protection services, and there are many DDoS protection solutions by commercial vendors that offer managed DDoS protection services.
Examples of managed DDoS protection services: AWS Shield, Azure DDoS Protection, Google Cloud Armor, and Cloudflare DDoS protection.
Patch Management Considerations
Software tends to be vulnerable, and as such it must be regularly patched.
For applications deployed on top of virtual machines:
- Create a “golden image” of a virtual machine, and regularly update the image with the latest security patches and software updates.
- For applications deployed on top of VMs, create a regular patch update process.
For applications wrapped inside containers, create a “golden image” of each of the application components, and regularly update the image with the latest security patches and software updates.
Embed software composition analysis (SCA) tools to scan and detect vulnerable third-party components – in case vulnerable components (or their dependencies) are detected, begin a process of replacing the vulnerable components.
Examples of patch management solutions: AWS Systems Manager Patch Manager, Azure Update Manager, and Google VM Manager Patch.
Compliance Considerations
Compliance is an important security factor when designing an application.
Some applications contain personally identifiable information (PII) about employees or customers, which requires compliance against privacy and data residency laws and regulations (such as the GDPR in Europe, the CPRA in California, the LGPD in Brazil, etc.)
Some organizations decide to be compliant with industry or security best practices, such as the Center for Internet Security (CIS) Benchmark for hardening infrastructure components, and can be later evaluated using compliance services or Cloud security posture management (CSPM) solutions.
References for compliance: AWS Compliance Center, Azure Service Trust Portal, and Google Compliance Resource Center.
Incident Response
When designing an application in the cloud, it is important to be prepared to respond to security incidents:
- Enable logging from both infrastructure and application components, and stream all logs to a central log aggregator. Make sure logs are stored in a central, immutable location, with access privileges limited for the SOC team.
- Select a tool to be able to review logs, detect anomalies, and be able to create actionable insights for the SOC team.
- Create playbooks for the SOC team, to know how to respond in case of a security incident (how to investigate, where to look for data, who to notify, etc.)
- To be prepared for a catastrophic event (such as a network breach, or ransomware), create automated solutions, to allow you to quarantine the impacted services, and deploy a new environment from scratch.
References for incident response documentation: AWS Security Incident Response Guide, Azure Incident response, and Google Data incident response process.
Summary
In the second blog post in this series, we talked about many security-related aspects, that organizations should consider when designing new applications in the cloud.
In this part of the series, we have reviewed various aspects, from identity and access management to data protection, network security, patch management, compliance, and more.
It is highly recommended to use the topics discussed in this series of blog posts, as a baseline when designing new applications in the cloud, and continuously improve this checklist of considerations when documenting your projects.
About the Author
Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.
Opinions are his own and not the views of his employer.
Checklist for designing cloud-native applications – Part 1: Introduction
This post was originally published by the Cloud Security Alliance.
When organizations used to build legacy applications in the past, they used to align infrastructure and application layers to business requirements, reviewing hardware requirements and limitations, team knowledge, security, legal considerations, and more.
In this series of blog posts, we will review considerations when building today’s cloud-native applications.
Readers of this series of blog posts can use the information shared, as a checklist to be embedded as part of a design document.
Introduction
Building a new application requires a thorough design process.
It is ok to try, fail, and fix mistakes during the process, but you still need to design.
Since technology keeps evolving, new services are released every day, and many organizations now begin using multiple cloud providers, it is crucial to avoid biased decisions.
During the design phase, avoid locking yourself to a specific cloud provider, instead, fully understand the requirements and constraints, and only then begin selecting the technology and services you will be using to architect your application’s workload.
Business Requirements
The first thing we need to understand is what is the business goal. What is the business trying to achieve?
Business requirements will impact architectural decisions.
Below are some of the common business requirements:
- Service availability – If an application needs to be available for customers around the globe, design a multi-region architecture.
- Data sovereignty – If there is a regulatory requirement to store customers data in a specific country, make sure it is possible to deploy all infrastructure components in a cloud region located in a specific country. Examples of data sovereignty services: AWS Digital Sovereignty, Microsoft Cloud for Sovereignty, and Google Digital Sovereignty
- Response time – If the business requirement is to allow fast response to customer requests, you may consider the use of API or caching mechanisms.
- Scalability – If the business requirement is to provide customers with highly scalable applications, to be able to handle unpredictable loads, you may consider the use of event-driven architecture (such as the use of message queues, streaming services, and more)
Compute Considerations
Compute may be the most important part of any modern application, and today there are many alternatives for running the front-end and business logic of our applications:
- Virtual Machines – Offering the same alternatives as we used to run legacy applications on-premise, but can also be suitable for running applications in the cloud. For most cases, use VMs if you are migrating an application from on-premise to the cloud. Examples of services: Amazon EC2, Azure Virtual Machines, and Google Compute Engine.
- Containers and Kubernetes – Most modern applications are wrapped inside containers, and very often are scheduled using Kubernetes orchestrator. Considered as a medium challenge migrating container-based workloads between cloud providers (you still need to take under consideration the integration with other managed services in the CSPs eco-system). Examples of Kubernetes services: Amazon EKS, Azure AKS, and Google GKE.
- Serverless / Functions-as-a-Service – Modern way to run various parts of applications. The underlying infrastructure is fully managed by the cloud provider (no need to deal with scaling or maintenance of the infrastructure). Considered as a vendor lock-in since there is no way to migrate between CSPs, due to the unique characteristics of each CSP’s offering. Examples of FaaS: AWS Lambda, Azure Functions, and Google Cloud Functions.
Data Store Considerations
Most applications require a persistent data store, for storing and retrieval of data.
Cloud-native applications (and specifically microservice-based architecture), allow selecting the most suitable back-end data store for your applications.
In a microservice-based architecture, you can select different data stores for each microservice.
Alternatives for persistent data can be:
- Object storage – The most common managed storage service that most cloud applications are using to store data (from logs, archives, data lake, and more). Examples of object storage services: Amazon S3, Azure Blob Storage, and Google Cloud Storage.
- File storage – Most CSPs support managed NFS services (for Unix workloads) or SMB/CIFS (for Windows workloads). Examples of file storage services: Amazon EFS, Azure Files, and Google Filestore.
When designing an architecture, consider your application requirements such as:
- Fast data retrieval requirements – Requirements for fast read/write (measures in IOPS)
- File sharing requirements – Ability to connect to the storage from multiple sources
- Data access pattern – Some workloads require constant access to the storage, while other connects to the storage occasionally, (such as file archive)
- Data replication – Ability to replicate data over multiple AZs or even multiple regions
Database Considerations
It is very common for most applications to have at least one backend database for storing and retrieval of data.
When designing an application, understand the application requirements to select the most suitable database:
- Relational database – Database for storing structured data stored in tables, rows, and columns. Suitable for complex queries. When selecting a relational database, consider using a managed database that supports open-source engines such as MySQL or PostgreSQL over commercially licensed database engine (to decrease the chance of vendor lock-in). Examples of relational database services: Amazon RDS, Azure SQL, and Google Cloud SQL.
- Key-value database – Database for storing structured or unstructured data, with requirements for storing large amounts of data, with fast access time. Examples of key-value databases: Amazon DynamoDB, Azure Cosmos DB, and Google Bigtable.
- In-memory database – Database optimized for sub-millisecond data access, such as caching layer. Examples of in-memory databases: Amazon ElastiCache, Azure Cache for Redis, and Google Memorystore for Redis.
- Document database – Database suitable for storing JSON documents. Examples of document databases: Amazon DocumentDB, Azure Cosmos DB, and Google Cloud Firestore.
- Graph database – Database optimized for storing and navigating relationships between entities (such as a recommendation engine). Example of Graph database: Amazon Neptune.
- Time-series database – Database optimized for storing and querying data that changes over time (such as application metrics, data from IoT devices, etc.). Examples of time-series databases: Amazon Timestream, Azure Time Series Insights, and Google Bigtable.
One of the considerations when designing highly scalable applications is data replication – the ability to replicate data across multiple AZs, but the more challenging is the ability to replicate data over multiple regions.
Few managed database services support global tables, or the ability to replicate over multiple regions, while most databases will require a mechanism for replicating database updates between regions.
Automation and Development
Automation allows us to perform repetitive tasks in a fast and predictable way.
Automation in cloud-native applications, allows us to create a CI/CD pipeline for taking developed code, integrating the various application components, and underlying infrastructure, performing various tests (from QA to securing tests) and eventually deploying new versions of our production application.
Whether you are using a single cloud provider, managing environments on a large scale, or even across multiple cloud providers, you should align the tools that you are using across the different development environments:
- Code repositories – Select a central place to store all your development team’s code, hopefully, it will allow you to use the same code repository for both on-prem and multiple cloud environments. Examples of code repositories: AWS CodeCommit, Azure Repos, and Google Cloud Source Repositories.
- Container image repositories – Select a central image repository, and sync it between regions, and if needed, also between cloud providers, to keep the same source of truth. Examples of container image repositories: Amazon ECR, Azure ACR, and Google Artifact Registry.
- CI/CD and build process – Select a tool to allow you to manage the CI/CD pipeline for all deployments, whether you are using a single cloud provider, or when using a multi-cloud environment. Examples of CI/CD build services: AWS CodePipeline, Azure Pipelines, and Google Cloud Build.
- Infrastructure as Code – Mature organizations choose an IaC tool to provision infrastructure for both single or multi-cloud scenarios, lowering the burden on the DevOps, IT, and developers’ teams. Examples of IaC: AWS CloudFormation, Azure Resource Manager, Google Cloud Deployment Manager, and HashiCorp Terraform.
Resiliency Considerations
Although many managed services in the cloud, are offered resilient by design by the cloud providers, consider resiliency when designing production applications.
Design all layers of the infrastructure to be resilient.
Regardless of the computing service you choose, always deploy VMs or containers in a cluster, behind a load-balancer.
Prefer to use a managed storage service, deployed over multiple availability zones.
For a persistent database, prefer a managed service, and deploy it in a cluster, over multiple AZs, or even better, look for a serverless database offer, so you won’t need to maintain the database availability.
Do not leave things to the hands of faith, embed chaos engineering experimentations as part of your workload resiliency tests, to have a better understanding of how your workload will survive a failure. Examples of managed chaos engineering services: AWS Fault Injection Service, and Azure Chaos Studio.
Business Continuity Considerations
One of the most important requirements from production applications is the ability to survive failure and continue functioning as expected.
It is crucial to design both business continuity in advance.
For any service that supports backups or snapshots (from VMs, databases, and storage services), enable scheduled backup mechanisms, and randomly test backups to make sure they are functioning.
For objects stored inside an object storage service that requires resiliency, configure cross-region replication.
For container registry that requires resiliency, configure image replication across regions.
For applications deployed in a multi-region architecture, use DNS records to allow traffic redirection between regions.
Observability Considerations
Monitoring and logging allow you insights into your application and infrastructure behavior.
Telemetry allows you to collect real-time information about your running application, such as customer experience.
While designing an application, consider all the options available for enabling logging, both from infrastructure services and from the application layer.
It is crucial to stream all logs to a central system, aggregated and timed synched.
Logging by itself is not enough – you need to be able to gain actionable insights, to be able to anticipate issues before they impact your customers.
It is crucial to define KPIs for monitoring an application’s performance, such as CPU/Memory usage, latency and uptime, average response time, etc.
Many modern tools are using machine learning capabilities to review large numbers of logs, be able to correlate among multiple sources and provide recommendations for improvements.
Cost Considerations
Cost is an important factor when designing architectures in the cloud.
As such, it must be embedded in every aspect of the design, implementation, and ongoing maintenance of the application and its underlying infrastructure.
Cost aspects should be the responsibility of any team member (IT, developers, DevOps, architect, security staff, etc.), from both initial service cost and operational aspects.
FinOps mindset will allow making sure we choose the right service for the right purpose – from choosing the right compute service, the right data store, or the right database.
It is not enough to select a service –make sure any service selected is tagged, monitored for its cost regularly, and perhaps even replaced with better and cost-effective alternatives, during the lifecycle of the workload.
Sustainability Considerations
The architectural decision we make has an environmental impact.
When developing modern applications, consider the environmental impact.
Choosing the right computing service will allow running a workload, with a minimal carbon footprint – the use of containers or serverless/FaaS wastes less energy in the data centers of the cloud provider.
The same thing when selecting a datastore, according to an application’s data access patterns (from hot or real-time tier, up to archive tier).
Designing event-driven applications, adding caching layers, shutting down idle resources, and continuously monitoring workload resources, will allow to design of an efficient and sustainable workload.
Sustainability related references: AWS Sustainability, Azure Sustainability, and Google Cloud Sustainability.
Employee Knowledge Considerations
The easiest thing is to decide to build a new application in the cloud.
The challenging part is to make sure all teams are aligned in terms of the path to achieving business goals and the knowledge to build modern applications in the cloud.
Organizations should invest the necessary resources in employee training, making sure all team members have the required knowledge to build and maintain modern applications in the cloud.
It is crucial to understand that all team members have the necessary knowledge to maintain applications and infrastructure in the cloud, before beginning the actual project, to avoid unpredictable costs, long learning curves, while running in production, or building a non-efficient workload due to knowledge gap.
Training related references: AWS Skill Builder, Microsoft Learn for Azure, and Google Cloud Training.
Summary
In the first blog post in this series, we talked about many aspects, organizations should consider when designing new applications in the cloud.
In this part of the series, we have reviewed various aspects, from understanding business requirements to selecting the right infrastructure, automation, resiliency, cost, and more.
When creating the documentation for a new development project, organizations can use the information in this series, to form a checklist, making sure all-important aspects and decisions are documented.
In the next chapter of this series, we will discuss security aspects when designing and building a new application in the cloud.
About the Author
Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.
Opinions are his own and not the views of his employer.
Building Resilient Applications in the Cloud
When building an application for serving customers, one of the questions raised is how do I know if my application is resilient and will survive a failure?
In this blog post, we will review what it means to build resilient applications in the cloud, and we will review some of the common best practices for achieving resilient applications.
What does it mean resilient applications?
AWS provides us with the following definition for the term resiliency:
“The ability of a workload to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.”
Resiliency is part of the Reliability pillar for cloud providers such as AWS, Azure, GCP, and Oracle Cloud.
AWS takes it one step further, and shows how resiliency is part of the shared responsibility model:
- The cloud provider is responsible for the resilience of the cloud (i.e., hardware, software, computing, storage, networking, and anything related to their data centers)
- The customer is responsible for the resilience in the cloud (i.e., selecting the services to use, building resilient architectures, backup strategies, data replication, and more).
How do we build resilient applications?
This blog post assumes that you are building modern applications in the public cloud.
We have all heard of RTO (Recovery time objective).
Resilient workload (a combination of application, data, and the infrastructure that supports it), should not only recover automatically, but it must recover within a pre-defined RTO, agreed by the business owner.
Below are common best practices for building resilient applications:
Design for high-availability
The public cloud allows you to easily deploy infrastructure over multiple availability zones.
Examples of implementing high availability in the cloud:
- Deploying multiple VMs behind an auto-scaling group and a front-end load-balancer
- Spreading container load over multiple Kubernetes worker nodes, deploying in multiple AZs
- Deploying a cluster of database instances in multiple AZs
- Deploying global (or multi-regional) database services (such as Amazon Aurora Global Database, Azure Cosmos DB, Google Cloud Spanner, and Oracle Global Data Services (GDS)
- Configure DNS routing rules to send customers’ traffic to more than a single region
- Deploy global load-balancer (such as AWS Global Accelerator, Azure Cross-region Load Balancer, or Google Global external Application Load Balancer) to spread customers’ traffic across regions
Implement autoscaling
Autoscaling is one of the biggest advantages of the public cloud.
Assuming we built a stateless application, we can add or remove additional compute nodes using autoscaling capability, and adjust it to the actual load on our application.
In a cloud-native infrastructure, we will use a managed load-balancer service, to receive traffic from customers, and send an API call to an autoscaling group, to add or remove additional compute nodes.
Implement microservice architecture
Microservice architecture is meant to break a complex application into smaller parts, each responsible for certain functionality of the application.
By implementing microservice architecture, we are decreasing the impact of failed components on the rest of the application.
In case of high load on a specific component, it is possible to add more compute resources to the specific component, and in case we discover a bug in one of the microservices, we can roll back to a previous functioning version of the specific microservice, with minimal impact on the rest of the application.
Implement event-driven architecture
Event-driven architecture allows us to decouple our application components.
Resiliency can be achieved using event-driven architecture, by the fact that even if one component fails, the rest of the application continues to function.
Components are loosely coupled by using events that trigger actions.
Event-driven architectures are usually (but not always) based on services managed by cloud providers, who are responsible for the scale and maintenance of the managed infrastructure.
Event-driven architectures are based on models such as pub/sub model (services such as Amazon SQS, Azure Web PubSub, Google Cloud Pub/Sub, and OCI Queue service) or based on event delivery (services such as Amazon EventBridge, Azure Event Grid, Google Eventarc, and OCI Events service).
Implement API Gateways
If your application exposes APIs, use API Gateways (services such as Amazon API Gateway, Azure API Management, Google Apigee, or OCI API Gateway) to allow incoming traffic to your backend APIs, perform throttling to protect the APIs from spikes in traffic, and perform authorization on incoming requests from customers.
Implement immutable infrastructure
Immutable infrastructure (such as VMs or containers) are meant to run application components, without storing session information inside the compute nodes.
In case of a failed component, it is easy to replace the failed component with a new one, with minimal disruption to the entire application, allowing to achieve fast recovery.
Data Management
Find the most suitable data store for your workload.
A microservice architecture allows you to select different data stores (from object storage to backend databases) for each microservice, decreasing the risk of complete failure due to availability issues in one of the backend data stores.
Once you select a data store, replicate it across multiple AZs, and if the business requires it, replicate it across multiple regions, to allow better availability, closer to the customers.
Implement observability
By monitoring all workload components, and sending logs from both infrastructure and application components to a central logging system, it is possible to identify anomalies, anticipate failures before they impact customers, and act.
Examples of actions can be sending a command to restart a VM, deploying a new container instead of a failed one, and more.
It is important to keep track of measurements — for example, what is considered normal response time to a customer request, to be able to detect anomalies.
Implement chaos engineering
The base assumption is that everything will eventually fail.
Implementing chaos engineering, allows us to conduct controlled experiments, inject faults into our workloads, testing what will happen in case of failure.
This allows us to better understand if our workload will survive a failure.
Examples can be adding load on disk volumes, injecting timeout when an application tier connects to a backend database, and more.
Examples of services for implementing chaos engineering are AWS Fault Injection Simulator, Azure Chaos Studio, and Gremlin.
Create a failover plan
In an ideal world, your workload will be designed for self-healing, meaning, it will automatically detect a failure and recover from it, for example, replace failed components, restart services, or switch to another AZ or even another region.
In practice, you need to prepare a failover plan, keep it up to date, and make sure your team is trained to act in case of major failure.
A disaster recovery plan without proper and regular testing is worth nothing — your team must practice repeatedly, and adjust the plan, and hopefully, they will be able to execute the plan during an emergency with minimal impact on customers.
Resilient applications tradeoffs
Failure can happen in various ways, and when we design our workload, we need to limit the blast radius on our workload.
Below are some common failure scenarios, and possible solutions:
- Failure in a specific component of the application — By designing a microservice architecture, we can limit the impact of a failed component to a specific area of our application (depending on the criticality of the component, as part of the entire application)
- Failure or a single AZ — By deploying infrastructure over multiple AZs, we can decrease the chance of application failure and impact on our customers
- Failure of an entire region — Although this scenario is rare, cloud regions also fail, and by designing a multi-region architecture, we can decrease the impact on our customers
- DDoS attack — By implementing DDoS protection mechanisms, we can decrease the risk of impacting our application with a DDoS attack
Whatever solution we design for our workloads, we need to understand that there is a cost and there might be tradeoffs for the solution we design.
Multi-region architecture aspects
A multi-region architecture will allow the most highly available resilient solution for your workloads; however, multi-region adds high cost for cross-region egress traffic, most services are limited to a single region, and your staff needs to know to support such a complex architecture.
Another limitation of multi-region architecture is data residency — if your business or regulator demands that customers’ data be stored in a specific region, a multi-region architecture is not an option.
Service quota/service limits
When designing a highly resilient architecture, we must take into consideration service quotas or service limits.
Sometimes we are bound to a service quota on a specific AZ or region, an issue that we may need to resolve with the cloud provider’s support team.
Sometimes we need to understand there is a service limit in a specific region, such as a specific service that is not available in a specific region, or there is a shortage of hardware in a specific region.
Autoscaling considerations
Horizontal autoscale (the ability to add or remove compute nodes) is one of the fundamental capabilities of the cloud, however, it has its limitations.
Provisioning a new compute node (from a VM, container instance, or even database instance) may take a couple of minutes to spin up (which may impact customer experience) or to spin down (which may impact service cost).
Also, to support horizontal scaling, you need to make sure the compute nodes are stateless, and that the application supports such capability.
Failover considerations
One of the limitations of database failover is their ability to switch between the primary node and one of the secondary nodes, either in case of failure or in case of scheduled maintenance.
We need to take into consideration the data replication, making sure transactions were saved and moved from the primary to the read replica node.
Summary
In this blog post, we have covered many aspects of building resilient applications in the cloud.
When designing new applications, we need to understand the business expectations (in terms of application availability and customer impact).
We also need to understand the various architectural design considerations, and their tradeoffs, to be able to match the technology to the business requirements.
As I always recommend — do not stay on the theoretical side of the equation, begin designing and building modern and highly resilient applications to serve your customers — There is no replacement for actual hands-on experience.
References
- Understand resiliency patterns and trade-offs to architect efficiently in the cloud
- Building resilience to your business requirements with Azure
- Success through culture: why embracing failure encourages better software delivery
- Building Resilient Solutions in OCI
About the Author
Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.
Opinions are his own and not the views of his employer.
Why choosing “Lift & Shift” is a bad migration strategy
One of the first decisions organizations make before migrating applications to the public cloud is deciding on a migration strategy.
For many years, the most common and easy way to migrate applications to the cloud was choosing a rehosting strategy, also known as “Lift and shift”.
In this blog post, I will review some of the reasons, showing that strategically this is a bad decision.
Introduction
When reviewing the landscape of possibilities for migrating legacy or traditional applications to the public cloud, rehosting is the best option as a short-term solution.
Taking an existing monolith application, and migrating it as-is to the cloud, is supposed to be an easy task:
- Map all the workload components (hardware requirements, operating system, software and licenses, backend database, etc.)
- Choose similar hardware (memory/CPU/disk space) to deploy a new VM instance(s)
- Configure network settings (including firewall rules, load-balance configuration, DNS, etc.)
- Install all the required software components (assuming no license dependencies exist)
- Restore the backend database from the latest full backup
- Test the newly deployed application in the cloud
- Expose the application to customers
From a time and required-knowledge perspective, this is considered a quick-win solution, but how efficient is it?
Cost-benefit
Using physical or even virtual machines does not guarantee us close to 100% of hardware utilization.
In the past organizations used to purchase hardware, and had to commit to 3–5 years (for vendor support purposes).
Although organizations could use the hardware 24×7, there were many cases where purchased hardware was consuming electricity and floor-space, without running at full capacity (i.e., underutilized).
Virtualization did allow organizations to run multiple VMs on the same physical hardware, but even then, it did not guarantee 100% hardware utilization — think about Dev/Test environments or applications that were not getting traffic from customers during off-peak hours.
The cloud offers organizations new purchase/usage methods (such as on-demand or Spot), allowing customers to pay just for the time they used compute resources.
Keeping a traditional data-center mindset, using virtual machines, is not efficient enough.
Switching to modern ways of running applications, such as the use of containers, Function-as-a-Service (FaaS), or event-driven architectures, allows organizations to make better use of their resources, at much better prices.
Right-sizing
On day 1, it is hard to predict the right VM instance size for the application.
When migrating applications as-is, organizations tend to select similar hardware (mostly CPU/Memory), to what they used to have in the traditional data center, regardless of the application’s actual usage.
After a legacy application is running for several weeks in the cloud, we can measure its actual performance, and switch to a more suitable VM instance size, gaining better utilization and price.
Tools such as AWS Compute Optimizer, Azure Advisor, or Google Recommender will allow you to select the most suitable VM instance size, but the VM still does not utilize 100% of the possible compute resources, compared to containers or Function-as-a-Service.
Scaling
Horizontal scaling is one of the main benefits of the public cloud.
Although it is possible to configure multiple VMs behind a load-balancer, with autoscaling capability, allowing adding or removing VMs according to the load on the application, legacy applications may not always support horizontal scaling, and even if they do support scale out (add more compute nodes), there is a very good chance they do not support scale in (removing unneeded compute nodes).
VMs do not support the ability to scale to zero — i.e., removing completely all compute nodes, when there is no customer demand.
Cloud-native applications deployed on top of containers, using a scheduler such as Kubernetes (such as Amazon EKS, Azure AKS, or Google GKE), can horizontally scale according to need (scale out as much as needed, or as many compute resources the cloud provider’s quota allows).
Functions as part of FaaS (such as AWS Lambda, Azure Functions, or Google Cloud Functions) are invoked as a result of triggers, and erased when the function’s job completes — maximum compute utilization.
Load time
Spinning up a new VM as part of auto-scaling activity (such as AWS EC2 Auto Scaling, Azure Virtual Machine Scale Sets, or Google Managed instance groups), upgrade, or reboot takes a long time — specifically for large workloads such as Windows VMs, databases (deployed on top of VM’s) or application servers.
Provisioning a new container (based on Linux OS), including all the applications and layers, takes a couple of seconds (depending on the number of software layers).
Invoking a new function takes a few seconds, even if you take into consideration cold start issues when downloading the function’s code.
Software maintenance
Every workload requires ongoing maintenance — from code upgrades, third-party software upgrades, and let us not forget security upgrades.
All software upgrade requires a lot of overhead from the IT, development, and security teams.
Performing upgrades of a monolith, where various components and services are tightly coupled together increases the complexity and the chances that something will break.
Switching to a microservice architecture, allows organizations to upgrade specific components (for example scale out, upgrade new version of code, new third-party software component), with small to zero impact on other components of the entire application.
Infrastructure maintenance
In the traditional data center, organizations used to deploy and maintain every component of the underlying infrastructure supporting the application.
Maintaining services such as databases or even storage arrays requires a dedicated trained staff, and requires a lot of ongoing efforts (from patching, backup, resiliency, high availability, and more).
In cloud-native environments, organizations can take advantage of managed services, from managed databases, storage services, caching, monitoring, and AI/ML services, without having to maintain the underlying infrastructure.
Unless an application relies on a legacy database engine, most of the chance, you will be able to replace a self-maintained database server, with a managed database service.
For storage services, most cloud providers already offer all the commodity storage services (from a managed NFS, SMB/CIFS, NetApp, and up to parallel file system for HPC workloads).
Most modern cloud-native services, use object storage services (such as Amazon S3, Azure Blob Storage, or Google Filestore), allowing scalable file systems for storing large amounts of data (from backups, and log files to data lake).
Most cloud providers offer managed networking services for load-balancing, firewalls, web application firewalls, and DDoS protection mechanisms, supporting workloads with unpredictable traffic.
SaaS services
Up until now, we mentioned lift & shift from the on-premise to VMs (mostly IaaS) and managed services (PaaS), but let us not forget there is another migration strategy — repurchasing, meaning, migrating an existing application, or selecting a managed platform such as Software-as-a-Service, allowing organizations to consume fully managed services, without having to take care of the on-going maintenance and resiliency.
Summary
Keeping a static data center mindset, and migrating using “lift & shift” to the public cloud, is the least cost-effective strategy and in most chances will end up with medium to low performance for your applications.
It may have been the common strategy a couple of years ago when organizations just began taking their first step in the public cloud, but as more knowledge is gained from both public cloud providers and all sizes of organizations, it is time to think about more mature cloud migration strategies.
It is time for organizations to embrace a dynamic mindset of cloud-native services and cloud-native applications, which provide organizations many benefits, from (almost) infinite scale, automated provisioning (using Infrastructure-as-Code), rich cloud ecosystem (with many managed services), and (if managed correctly) the ability to suit the workload costs to the actual consumption.
I encourage all organizations to expand their knowledge about the public cloud, assess their existing applications and infrastructure, and begin modernizing their existing applications.
Re-architecture may demand a lot of resources (both cost and manpower) in the short term but will provide an organization with a lot of benefits in the long run.
References:
- 6 Strategies for Migrating Applications to the Cloud
- Overview of application migration examples for Azure
- Migrate to Google Cloud
About the Author
Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.
Opinions are his own and not the views of his employer.
Introduction to Serverless Container Services
When developing modern applications, we almost immediately think about wrapping our application components inside Containers — it may not be the only architectural alternative, but a very common one.
Assuming our developers and DevOps teams have the required expertise to work with Containers, we still need to think about maintaining the underlying infrastructure — i.e., the Container hosts.
If our application has a steady and predictable load, and assuming we do not have experience maintaining Kubernetes clusters, and we do not need the capabilities of Kubernetes, it is time to think about an easy and stable alternative for deploying our applications on top of Containers infrastructure.
In the following blog post, I will review the alternatives of running Container workloads on top of Serverless infrastructure.
Why do we need Serverless infrastructure for running Container workloads?
Container architecture is made of a Container engine (such as Docker, CRI-O, etc.) deployed on top of a physical or virtual server, and on top of the Container engine, we deploy multiple Container images for our applications.
The diagram below shows a common Container architecture:
If we focus on the Container engine and the underlying operating system, we understand that we still need to maintain the operating system itself.
Common maintenance tasks for the operating system:
- Make sure it has enough resources (CPU, memory, storage, and network connectivity) for running Containers
- Make sure the operating system is fully patched and hardened from external attacks
- Make sure our underlying infrastructure (i.e., Container host nodes), provides us with high availability in case one of the host nodes fails and needs to be replaced
- Make sure our underlying infrastructure provides us the necessary scale our application requires (i.e., scale out or in according to application load)
Instead of having to maintain the underlying host nodes, we should look for a Serverless solution, that allows us to focus on application deployment and maintenance and decrease as much as possible the work on maintaining the infrastructure.
Comparison of Serverless Container Services
Each of the hyperscale cloud providers offers us the ability to consume a fully managed service for deploying our Container-based workloads.
Below is a comparison of AWS, Azure, and Google Cloud alternatives:
Side notes for Azure users
While researching for this blog post, I had a debate about whether to include Azure Containers Apps or Azure Container Instances.
Although both services allow customers to run Containers in a managed environment, Azure Container Instances is more suitable for running a single Container application, while Azure Container Apps allows customers to build a full microservice-based application.
Summary
In this blog post, I have compared alternatives for deploying microservice architecture on top of Serverless Container services offered by AWS, Azure, and GCP.
While designing your next application based on microservice architecture, and assuming you don’t need a full-blown Kubernetes cluster (with all of its features and complexities), consider using Serverless Container service.
References
About the Author
Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.
Opinions are his own and not the views of his employer.
Introduction to Break-Glass in Cloud Environments
Using modern cloud environments, specifically production environments, decreases the need for human access.
It makes sense for developers to have access to Dev or Test environments, but in a properly designed production environment, everything should be automated – from deployment, and observability to self-healing. In most cases, no human access is required.
Production environments serve customers, require zero downtime, and in most cases contain customers’ data.
There are cases such as emergency scenarios where human access is required.
In mature organizations, this type of access is done by the Site reliability engineering (SRE) team.
The term break-glass is an analogy to breaking a glass to pull a fire alarm, which is supposed to happen only in case of emergency.
In the following blog post, I will review the different alternatives each of the hyperscale cloud providers gives their customers to handle break-glass scenarios.
Ground rules for using break-glass accounts
Before talking about how each of the hyperscale cloud providers handles break-glass, it is important to be clear – break-glass accounts should be used in emergency cases only.
- Authentication – All access through the break-glass mechanism must be authenticated, preferred against a central identity provider, and not using local accounts
- Authorization – All access must be authorized using role-based access control (RBAC), following the principle of least privilege
- MFA – Since most break-glass scenarios require highly privileged access, it is recommended to enforce multi-factor authentication (MFA) for any interactive access
- Just-in-time access – All access through break-glass mechanisms must be granted temporarily and must be revoked after a pre-define amount of time or when the emergency is declared as over
- Approval process – Access through a break-glass mechanism should be manually approved
- Auditing – All access through break-glass mechanisms must be audited and kept as evidence for further investigation
- Documented process – Organizations must have a documented and tested process for requesting, approving, using, and revoking break-glass accounts
Handling break-glass scenarios in AWS
Below is a list of best practices provided by AWS for handling break-glass scenarios:
Identity Management
Identities in AWS are managed using AWS Identity and Access Management (IAM).
When working with AWS Organizations, customers have the option for central identity management for the entire AWS Organization using AWS IAM Identity Center – a single-sign-on (SSO) and federated identity management service (working with Microsoft Entra ID, Google Workspace, and more).
Since there might be a failure with a remote identity provider (IdP) or with AWS IAM Identity Center, AWS recommends creating two IAM users on the root of the AWS Organizations tree, and an IAM break-glass role on each of the accounts in the organization, to allow access in case of emergency.
The break-glass IAM accounts need to have console access, as explained in the documentation.
Authentication Management
When creating IAM accounts, enforce the use of a strong password policy, as explained in the documentation.
Passwords for the break-glass IAM accounts must be stored in a secured vault, and once the work on the break-glass accounts is over, the passwords must be replaced immediately to avoid reuse.
AWS recommends enforcing the use of MFA for any privileged access, as explained in the documentation.
Access Management
Access to resources inside AWS is managed using AWS IAM Roles.
AWS recommends creating a break-glass IAM role, as explained in the documentation.
Access using break-glass IAM accounts must be temporary, as explained in the documentation.
Auditing
All API calls within the AWS environment are logged into AWS CloudTrail by default, and stored for 90 days.
As best practices, it is recommended to send all CloudTrail logs to a central S3 bucket, from the entire AWS Organization, as explained in the documentation.
Since audit trail logs contain sensitive information, it is recommended to encrypt all data at rest using customer-managed encryption keys (as explained in the documentation) and limit access to the log files to the SOC team for investigation purposes.
Audit logs stored inside AWS CloudTrail can be investigated using Amazon GuardDuty, as explained in the documentation.
Resource Access
To allow secured access to EC2 instances, AWS recommends using EC2 Instance Connect or AWS Systems Manager Session Manager.
To allow secured access to Amazon EKS nodes, AWS recommends using AWS Systems Manager Agent (SSM Agent).
To allow secured access to Amazon ECS container instances, AWS recommends using AWS Systems Manager, and for debugging purposes, AWS recommends using Amazon ECS Exec.
To allow secured access to Amazon RDS, AWS recommends using AWS Systems Manager Session Manager.
Handling break-glass scenarios in Azure
Below is a list of best practices provided by Microsoft for handling break-glass scenarios:
Identity Management
Although Identities in Azure are managed using Microsoft Entra ID (formally Azure AD), Microsoft recommends creating two cloud-only accounts that use the *.onmicrosoft.com domain, to allow access in case of emergency and case of problems log-in using federated identities from the on-premise Active Directory, as explained in the documentation.
Authentication Management
Microsoft recommends enabling password-less login for the break-glass accounts using a FIDO2 security key, as explained in the documentation.
Microsoft does not recommend enforcing the use of MFA for emergency or break-glass accounts to prevent tenant-wide account lockout and exclude the break-glass accounts from Conditional Access policies, as explained in the documentation.
Access Management
Microsoft allows customers to manage privileged access to resources using Microsoft Entra Privileged Identity Management (PIM) and recommends assigning the break-glass accounts permanent access to the Global Administrator role, as explained in the documentation.
Microsoft Entra PIM allows to control of requests for privileged access, as explained in the documentation.
Auditing
Activity logs within the Azure environment are logged into Azure Monitor by default, and stored for 90 days.
As best practices, it is recommended to enable diagnostic settings for all audits and “allLogs” and send the logs to a central Log Analytics workspace, from the entire Azure tenant, as explained in the documentation.
Since audit trail logs contain sensitive information, it is recommended to encrypt all data at rest using customer-managed encryption keys (as explained in the documentation) and limit access to the log files to the SOC team for investigation purposes.
Audit logs stored inside a Log Analytics workspace can be queried for further investigation using Microsoft Sentinel, as explained in the documentation.
Microsoft recommends creating an alert when break-glass accounts perform sign-in attempts, as explained in the documentation.
Resource Access
To allow secured access to virtual machines (using SSH or RDP), Microsoft recommends using Azure Bastion.
To allow secured access to the Azure Kubernetes Service (AKS) API server, Microsoft recommends using Azure Bastion, as explained in the documentation.
To allow secured access to Azure SQL, Microsoft recommends creating an Azure Private Endpoint and connecting to the Azure SQL using Azure Bastion, as explained in the documentation.
Another alternative to allow secured access to resources in private networks is to use Microsoft Entra Private Access, as explained in the documentation.
Handling break-glass scenarios in Google Cloud
Below is a list of best practices provided by Google for handling break-glass scenarios:
Identity and Access Management
Identities in GCP are managed using Google Workspace or using Google Cloud Identity.
Access to resources inside GCP is managed using IAM Roles.
Google recommends creating a dedicated Google group for the break-glass IAM role, and configuring temporary access to this Google group as explained in the documentation.
The temporary access is done using IAM conditions, and it allows customers to implement Just-in-Time access, as explained in the documentation.
For break-glass access, add dedicated Google identities to the mentioned Google group, to gain temporary access to resources.
Authentication Management
Google recommends enforcing the use of MFA for any privileged access, as explained in the documentation.
Auditing
Admin Activity logs (configuration changes) within the GCP environment are logged into Google Cloud Audit logs by default, and stored for 90 days.
It is recommended to manually enable data access audit logs to get more insights about break-glass account activity, as explained in the documentation.
As best practices, it is recommended to send all Cloud Audit logs to a central Google Cloud Storage bucket, from the entire GCP Organization, as explained in the documentation.
Since audit trail logs contain sensitive information, it is recommended to encrypt all data at rest using customer-managed encryption keys (as explained in the documentation) and limit access to the log files to the SOC team for investigation purposes.
Audit logs stored inside Google Cloud Audit Logs can be sent to the Google Security Command Center for further investigation, as explained in the documentation.
Resource Access
To allow secured access to Google Compute Engine instances, Google recommends using an Identity-Aware Proxy, as explained in the documentation.
To allow secured access to Google App Engine instances, Google recommends using an Identity-Aware Proxy, as explained in the documentation.
To allow secured access to Google Cloud Run service, Google recommends using an Identity-Aware Proxy, as explained in the documentation.
To allow secured access to Google Kubernetes Engine (GKE) instances, Google recommends using an Identity-Aware Proxy, as explained in the documentation.
Summary
In this blog post, we have reviewed what break-glass accounts are, and how AWS, Azure, and GCP are recommending to secure break-glass accounts (from authentication, authorization, auditing, and secure access to cloud resources).
I recommend any organization that manages cloud production environments follow the vendors’ security best practices and keep the production environment secured.
About the Author
Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.
Opinions are his own and not the views of his employer.
Introduction to AI Code Generators
The past couple of years brought us tons of examples of using generative AI to improve many aspects of our lives.
We can see vendors, with strong community and developers’ support, introducing more and more services for almost any aspect of our lives.
The two most famous examples are ChatGPT (AI Chatbot) and Midjourney (Image generator).
Wikipedia provides us with the following definition for Generative AI:
“Generative artificial intelligence (also generative AI or GenAI) is artificial intelligence capable of generating text, images, or other media, using generative models. Generative AI models learn the patterns and structure of their input training data and then generate new data that have similar characteristics.”
Source: https://en.wikipedia.org/wiki/Generative_artificial_intelligence
In this blog post, I will compare some of the alternatives for using Gen AI to assist developers in producing code.
What are AI Code Generators?
AI code generators are services using AI/ML engines, integrated as part of the developer’s Integrated Development Environment (IDE), and provide the developer suggestions for code, based on the programming language and the project’s context.
In most cases, AI code generators come as a plugin or an addon to the developer’s IDE.
Mature AI code generators support multiple programming languages, can be integrated with most popular IDEs, and can provide valuable code samples, by understanding both the context of the code and the cloud provider’s eco-system.
AI Code Generators Terminology
Below are a couple of terms to know when using AI code generators:
- Suggestions – The output of AI code generators is code samples
- Prompts – Collection of code and supporting contextual information
- User engagement data / Client-side telemetry – Events generated at the client IDE (error messages, latency, feature engagement, etc.)
- Code snippets – Lines of code created by the developer inside the IDE
- Code References – Code originated from open-source or externally trained data
AI Code Generators – Alternative Comparison
The table below provides a comparison between the alternatives the largest cloud providers offer their customers:
AI Code Generators – Security Aspects
AI Code Generators can provide a lot of benefits for the developers, but at the end of the day we need to recall that these are still cloud-based services, deployed in a multi-tenant environment, and as with the case of any AI/ML, the vendor is aiming at training their AI/ML engines to provide better answers.
Code may contain sensitive data – from static credentials (secrets, passwords, API keys), hard-coded IP addresses or DNS names (for accessing back-end or even internal services), or even intellectual property code (as part of the organization’s internal IP).
Before consuming AI code generators, it is recommended to thoroughly review the vendors’ documentation, understand what data (such as telemetry) is transferred from the developer’s IDE back to the cloud, and how data is protected at all layers.
The table below provides a comparison between the alternatives the largest cloud providers offer their customers from a security point of view:
Summary
In this blog post, we have reviewed alternatives of AI code generators, offered by AWS, Azure, and GCP.
Although there are many benefits from using those services, allowing developers fast coding capabilities, the work on those services is still a work in progress.
Customers should perform their risk analysis before using those services, and limit as much as possible the amount of data shared with the cloud providers (since they are all built on multi-tenant environments).
As with any code developed, it is recommended to embed security controls, such as Static application security testing (SAST) tools, and invest in security training for developers.
References
- What is Amazon Code Whisperer?
https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html
- GitHub Copilot documentation
https://docs.github.com/en/copilot
- Duet AI in Google Cloud overview
Identity and Access Management in Multi-Cloud Environments
IAM (Identity and Access Management) is a crucial part of any cloud environment.
As organizations evolve, they may look at multi-cloud as a solution to consume cloud services in different cloud providers’ technologies (such as AI/ML, data analytics, and more), to have the benefit of using different pricing models, or to decrease the risk of vendor lock-in.
Before we begin the discussion about IAM, we need to understand the following fundamental concepts:
- Identity – An account represents a persona (human) or service (non-interactive account)
- Authentication – The act where an identity proves himself against a system (such as providing username and password, certificate, API key, and more)
- Authorization – The act of validating granting an identity’s privileges to take actions on a system (such as view configuration, read database content, upload a file to object storage, and more)
- Access Management – The entire lifecycle of IAM – from account provisioning, granting access, and validating privileges until account or privilege revocation.
Identity and Access Management Terminology
Authorization in the Cloud
Although all cloud providers have the same concept of identities, when we deep dive into the concept of authorization or access management to resources/services, we need to understand the differences between cloud providers.
Authorization in AWS
AWS has two concepts for managing permissions to resources:
- IAM Role – Permissions assigned to an identity temporarily.
- IAM Policy – A document defines a set of permissions assigned to an IAM role.
Permissions in AWS can be assigned to:
- Identity – A policy attached to a user, group, or role.
- Resource – A policy attached to a resource (such as Amazon S3 bucket).
Authorization in Azure
Permissions in Azure AD are controlled by roles.
A role defines the permissions an identity has over an Azure resource.
Within Azure AD, you control permissions using RBAC (Role-based access control).
Azure AD supports the following types of roles:
- Built-in roles – A pre-defined role according to job function (as you can read on the link).
- Custom roles – A role that we create ourselves to match the principle of least privilege.
Authorization in Google Cloud
Permissions in Google Cloud IAM are controlled by IAM roles.
Google Cloud IAM supports the following types of IAM roles:
- Basic roles – The most permissive type of roles (Owner, Editor, and Viewer).
- Predefined roles – Roles managed by Google, which provides granular access to specific services (as you can read on the link).
- Custom roles – User-specific roles, which provide the most granular access to resources.
Authorization – Default behavior
As we can see below each cloud provider takes a different approach to default permissions:
- AWS – By default, new IAM users have no permission to access any resource in AWS.
To allow access to resources or take actions, you need to manually assign the user an IAM role. - Azure – By default, all Azure AD users are granted a set of default permissions (such as listing all users, reading all properties of users and groups, registering new applications, and more).
- Google Cloud – By default, a new service account is granted the Editor role on the project level.
Identity Federation
When we are talking about identity federation, there are two concepts:
- Service Provider (SP) – Provide access to resources
- Identity Provider (IdP) – Authenticate the identities
Identities (user accounts, service accounts, groups, etc.) are managed by an Identity Provider (IdP).
An IdP can exist in the local data center (such as Microsoft Active Directory) or the public cloud (such as AWS IAM, Azure AD, Google Cloud IAM, etc.)
Federation is the act of creating trust between separate IdP’s.
Federation allows us to keep identity in one repository (i.e., Identity Provider).
Once we set up an identity federation, we can grant an identity privilege to consume resources in a remote repository.
Example: a worker with an account in Microsoft Active Directory, reading a file from object storage in Azure, once a federation trust was established between Microsoft Active Directory and Azure Active Directory.
When federating between the on-premise and cloud environments, we need to recall the use of different protocols.
On-premise environments are using legacy authentication protocols such as Kerberos or LDAP.
In the public cloud, the common authentication protocols are SAML 2.0, Open ID Connect (OIDC), and OAuth 2.0
Each cloud provider has a list of supported external third-party identity providers to federate with, as you can read in the list below:
- Integrating third-party SAML solution providers with AWS
- Azure AD Identity Provider Compatibility Docs
- Google Cloud IAM – Configure workforce identity federation
Single Sign-On
The concept behind SSO is to allow identities (usually end-users) access to resources in the cloud while having to sign (to an identity provider) once.
Over the past couple of years, the concept of SSO was extended and now it is possible to allow a single identity (who authenticated to a specific identity provider), access to resources over federated login to an external (mostly SAML) identity provider.
Each cloud provider has its own SSO service, supporting federation with external identity providers:
- AWS IAM Identity Center
- Azure Active Directory single sign-on
- Google Cloud Workload identity federation
Steps for creating a federation between cloud providers
The process below explains (at a high level) the steps require to set up identity federation between different cloud providers:
- Choose an IdP (where identities will be created and authenticated to).
- Create a SAML identity provider.
- Configure roles for your third-party identity provider.
- Assign roles to the target users.
- Create trust between SP and IdP.
- Test the ability to authenticate and identify (user) to a resource in a remote/external cloud provider.
Additional References:
- AWS IAM Identity Center and Azure AD as IdP
- How to set up IAM federation using Google Workspace
- Azure AD SSO integration with AWS IAM Identity Center
- Azure AD SSO integration with Google Cloud / G Suite Connector by Microsoft
- Federating Google Cloud with Azure Active Directory
- Configure Google workload identity federation with AWS or Azure
Summary
In this blog post, we had a deep dive into identity and access management in the cloud, comparing different aspects of IAM in AWS, Azure, and GCP.
After we have reviewed how authentication and authorization work for each of the three cloud providers, we have explained how federation and SSO work in a multi-cloud environment.
Important to keep in mind:
When we are building systems in the cloud, whether they are publicly exposed or even internal, we need to follow some basic rules:
- All-access to resources/systems/applications must be authenticated
- Permissions must follow the principle of least privileged and business requirements
- All access must be audited (for future analysis, investigation purposes, etc.)
Introduction to Chaos Engineering
In the past couple of years, we hear the term “Chaos Engineering” in the context of cloud.
Mature organizations have already begun to embrace the concepts of chaos engineering, and perhaps the most famous use of chaos engineering began at Netflix when they developed Chaos Monkey.
To quote Werner Vogels, Amazon CTO: “Everything fails, all the time”.
What is chaos engineering and what are the benefits of using chaos engineering for increasing the resiliency and reliability of workloads in the public cloud?
What is Chaos Engineering?
“Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production.” (Source: https://principlesofchaos.org)
Production workloads on large scale, are built from multiple services, creating distributed systems.
When we design large-scale workloads, we think about things such as:
- Creating high-available systems
- Creating disaster recovery plans
- Decreasing single point of failure
- Having the ability to scale up and down quickly according to the load on our application
One thing we usually do not stop to think about is the connectivity between various components of our application and what will happen in case of failure in one of the components of our application.
What will happen if, for example, a web server tries to access a backend database, and it will not be able to do so, due to network latency on the way to the backend database?
How will this affect our application and our customers?
What if we could test such scenarios on a live production environment, regularly?
Do we trust our application or workloads infrastructure so much, that we are willing to randomly take down parts of our infrastructure, just so we will know the effect on our application?
How will this affect the reliability of our application, and how will it allow us to build better applications?
History of Chaos Engineering
In 2010 Netflix developed a tool called “Chaos Monkey“, whose goal was to randomly take down compute services (such as virtual machines or containers), part of the Netflix production environment, and test the impact on the overall Netflix service experience.
In 2011 Netflix released a toolset called “The Simian Army“, which added more capabilities to the Chaos Monkey, from reliability, security, and resiliency (i.e., Chaos Kong which simulates an entire AWS region going down).
In 2012, Chaos Monkey became an open-source project (under Apache 2.0 license).
In 2016, a company called Gremlin released the first “Failure-as-a-Service” platform.
In 2017, the LitmusChaos project was announced, which provides chaos jobs in Kubernetes.
In 2019, Alibaba Cloud announced ChaosBlade, an open-source Chaos Engineering tool.
In 2020, Chaos Mesh 1.0 was announced as generally available, an open-source cloud-native chaos engineering platform.
In 2021, AWS announced the general availability of AWS Fault Injection Simulator, a fully managed service to run controlled experiments.
In 2021, Azure announced the public preview of Azure Chaos Studio.
What exactly is Chaos Engineering?
Chaos Engineering is about experimentation based on real-world hypotheses.
Think about Chaos Engineering, as one of the tests you run as part of a CI/CD pipeline, but instead of a unit test or user acceptance test, you inject controlled faults into the system to measure its resiliency.
Chaos Engineering can be used for both modern cloud-native applications (built on top of Kubernetes) and for the legacy monolith, to achieve the same result – answering the question – will my system or application survive a failure?
On high-level, Chaos Engineering is made of the following steps:
- Create a hypothesis
- Run an experiment
- Analyze the results
- Improve system resiliency
As an example, here is AWS’s point of view regarding the shared responsibility model, in the context of resiliency:
Source: https://aws.amazon.com/blogs/architecture/chaos-engineering-in-the-cloud
Chaos Engineering managed platform comparison
In the table below we can see a comparison between AWS and Azure-managed services for running Chaos Engineering experiments:
Additional References:
Summary
In this post, I have explained the concept of Chaos Engineering and compared alternatives to cloud-managed services.
Using Chaos Engineering as part of a regular development process will allow you to increase the resiliency of your applications, by studying the effect of failures and designing recovery processes.
Chaos Engineering can also be used as part of a disaster recovery and business continuity process, by testing the resiliency of your systems.
Additional References
- Chaos engineering (Wikipedia)
- Principles of Chaos Engineering
- Chaos Engineering in the Cloud
- What Chaos Engineering Is (and is not)
- AWS re:Invent 2022 – The evolution of chaos engineering at Netflix (NFX303)
- What is AWS Fault Injection Simulator?
- What is Azure Chaos Studio?
- Public Chaos Engineering Stories / Implementations
Introduction to Day 2 Kubernetes
Over the years, I have shared several blog posts about Kubernetes (What are Containers and Kubernetes, Modern Cloud deployment and usage, Introduction to Container Operating Systems, and more).
Kubernetes became a de-facto standard for running container-based workloads (for both on-premise and the public cloud), but most organizations tend to fail on what is referred to as Day 2 Kubernetes operations.
In this blog post, I will review what it means “Day 2 Kubernetes” and how to prepare your workloads for the challenges of Day 2 operations.
Ready, Set, Go!
In the software lifecycle, or the context of this post, the Kubernetes lifecycle, there are several distinct stages:
Day 0 – Planning and Design
In this stage, we focus on designing our solution (application and underlying infrastructure), understanding business needs, budget, required skills, and more.
For the context of this post, let us assume we have decided to build a cloud-native application, made of containers, deployed on top of Kubernetes.
Day 1 – Configuration and Deployment
In this stage, we focus on deploying our application using the Kubernetes orchestrator and setting up the configurations (number of replicas, public ports, auto-scale settings, and more).
Most organizations taking their first steps deploying applications on Kubernetes are stacked at this stage.
They may have multiple environments (such as Dev, Test, UAT) and perhaps even production workloads, but they are still on Day 1.
Day 2 – Operations
Mature organizations have reached this stage.
This is about ongoing maintenance, observability, and continuous improvement of security aspects of production workloads.
In this blog post, I will dive into “Day 2 Kubernetes”.
Day 2 Kubernetes challenges
Below are the most common Kubernetes challenges:
Observability
Managing Kubernetes at a large scale requires insights into the Kubernetes cluster(s).
It is not enough to monitor the Kubernetes cluster by collecting performance logs, errors, or configuration changes (such as Nodes, Pods, containers, etc.)
We need to have the ability to truly understand the internals of the Kubernetes cluster (from logs, metrics, etc.), be able to diagnose the behavior of the Kubernetes cluster – not just performance issues, but also debug problems, detect anomalies, and (hopefully) be able to anticipate problems before they affect customers.
Prefer to use cloud-native monitoring and observability tools to monitor Kubernetes clusters.
Without proper observability, we will not be able to do root cause analysis and understand problems with our Kubernetes cluster or with our application deployed on top of Kubernetes.
Common tools for observability:
- Prometheus – An open-source systems monitoring and alerting toolkit for monitoring large cloud-native deployments.
- Grafana – An open-source query, visualization, and alerting tool (resource usage, built-in and customized metrics, alerts, dashboards, log correlation, etc.)
- OpenTelemetry – A collection of open-source tools for collecting and exporting telemetry data (metrics, logs, and traces) for analyzing software performance and behavior.
Additional references for managed services:
- Amazon Managed Grafana
- Amazon Managed Service for Prometheus
- AWS Distro for OpenTelemetry
- Azure Monitor managed service for Prometheus (Still in preview on April 2023)
- Azure Managed Grafana
- OpenTelemetry with Azure Monitor
- Google Cloud Managed Service for Prometheus
- Google Cloud Logging plugin for Grafana
- OpenTelemetry Collector (Part of Google Cloud operations suite)
Security and Governance
On the one hand, it is easy to deploy a Kubernetes cluster in private mode, meaning, the API server or the Pods are on an internal subnet and not directly exposed to customers.
On the other hand, many challenges in the security domain need to be solved:
- Secrets Management – A central and secure vault for generating, storing, retrieving, rotating, and eventually revoking secrets (instead of hard-coded static credentials inside our code or configuration files).
- Access control mechanisms – Ability to control what persona (either human or service account) has access to which resources inside the Kubernetes cluster and to take what actions, using RBAC (Role-based access control) mechanisms.
- Software vulnerabilities – Any vulnerabilities related to code – from programming languages (such as Java, PHP, .NET, NodeJS, etc.), use of open-source libraries with known vulnerabilities, to vulnerabilities inside Infrastructure-as-Code (such as Terraform modules)
- Hardening – Ability to deploy a Kubernetes cluster at scale, using secured configuration, such as CIS Benchmarks.
- Networking – Ability to set isolation between different Kubernetes clusters or even between different development teams using the same cluster, not to mention multi-tenancy where using the same Kubernetes platform to serve different customers.
Additional Reference:
- Securing the Software Supply Chain in the Cloud
- OPA (Open Policy Agent) Gatekeeper
- Kyverno – Kubernetes Native Policy Management
- Foundational Cloud Security with CIS Benchmarks
- Amazon EKS Best Practices Guide for Security
- Azure security baseline for Azure Kubernetes Service (AKS)
- GKE Security Overview
Developers experience
Mature organizations have already embraced DevOps methodologies for pushing code through a CI/CD pipeline.
The entire process needs to be done automatically and without direct access of developers to production environments (for this purpose you build break-glass mechanisms for the SRE teams).
The switch to applications wrapped inside containers, allowed developers to develop locally or in the cloud and push new versions of their code to various environments (such as Dev, Test, and Prod).
Integration of CI/CD pipeline, together with containers, allows organizations to continuously develop new software versions, but it requires expanding the knowledge of developers using training.
The use of GitOps and tools such as Argo CD allowed a continuous delivery process for Kubernetes environments.
To allow developers, the best experience, you need to integrate the CI/CD process into the development environment, allowing the development team the same experience as developing any other application, as they used to do in the on-premise for legacy applications, which can speed the developer onboarding process.
Additional References:
- GitOps 101: What is it all about?
- Argo CD – Declarative GitOps CD for Kubernetes
- Continuous Deployment and GitOps delivery with Amazon EKS Blueprints and ArgoCD
- Getting started with GitOps, Argo, and Azure Kubernetes Service
- Building a Fleet of GKE clusters with ArgoCD
Storage
Any Kubernetes cluster requires persistent storage – whether organizations choose to begin with an on-premise Kubernetes cluster and migrate to the public cloud, or provision a Kubernetes cluster using a managed service in the cloud.
Kubernetes supports multiple types of persistent storage – from object storage (such as Azure Blob storage or Google Cloud Storage), block storage (such as Amazon EBS, Azure Disk, or Google Persistent Disk), or file sharing storage (such as Amazon EFS, Azure Files or Google Cloud Filestore).
The fact that each cloud provider has its implementation of persistent storage adds to the complexity of storage management, not to mention a scenario where an organization is provisioning Kubernetes clusters over several cloud providers.
To succeed in managing Kubernetes clusters over a long period, knowing which storage type to use for each scenario, requires storage expertise.
High Availability
High availability is a common requirement for any production workload.
The fact that we need to maintain multiple Kubernetes clusters (for example one cluster per environment such as Dev, Test, and Prod) and sometimes on top of multiple cloud providers, make things challenging.
We need to design in advance where to provision our cluster(s), thinking about constraints such as multiple availability zones, and sometimes thinking about how to provision multiple Kubernetes clusters in different regions, while keeping HA requirements, configurations, secrets management, and more.
Designing and maintaining HA in Kubernetes clusters requires a deep understanding of Kubernetes internals, combined with knowledge about specific cloud providers’ Kubernetes management plane.
Additional References:
- Designing Production Workloads in the Cloud
- Amazon EKS Best Practices Guide for Reliability
- AKS – High availability Kubernetes cluster pattern
- GKE best practices: Designing and building highly available clusters
Cost optimization
Cost is an important factor in managing environments in the cloud.
It can be very challenging to design and maintain multiple Kubernetes clusters while trying to optimize costs.
To monitor cost, we need to deploy cost management tools (either the basic services provided by the cloud provider) or third-party dedicated cost management tools.
For each Kubernetes cluster, we need to decide on node instance size (amount of CPU/Memory), and over time, we need to review the node utilization and try to right-size the instance type.
For non-production clusters (such as Dev or Test), we need to understand from the cloud vendor documentation, what are our options to scale the cluster size to the minimum, when not in use, and be able to spin it back up, when required.
Each cloud provider has its pricing options for provisioning Kubernetes clusters – for example, we might want to choose reserved instances or saving plans for production clusters that will be running 24/7, while for temporary Dev or Test environment, we might want to choose Spot instances and save cost.
Additional References:
- Cost optimization for Kubernetes on AWS
- Azure Kubernetes Service (AKS) – Cost Optimization Techniques
- Best practices for running cost-optimized Kubernetes applications on GKE
- 5 steps to bringing Kubernetes costs in line
- 4 Strategies for Kubernetes Cost Reduction
Knowledge gap
Running Kubernetes clusters requires a lot of knowledge.
From the design, provision, and maintenance, usually done by DevOps or experienced cloud engineers, to the deployment of new applications, usually done by development teams.
It is crucial to invest in employee training, in all aspects of Kubernetes.
Constant updates using vendor documentation, online courses, blog posts, meetups, and technical conferences will enable teams to gain the knowledge required to keep up with Kubernetes updates and changes.
Additional References:
- Kubernetes Blog
- AWS Containers Blog
- Azure Kubernetes Service (AKS) issue and feature tracking
- Google Cloud Blog – Containers & Kubernetes
Third-party integration
Kubernetes solve part of the problems related to container orchestration.
As an open-source solution, it can integrate with other open-source complimentary solutions (from monitoring, security and governance, cost management, and more).
Every organization might wish to use a different set of tools to achieve each task relating to the ongoing maintenance of the Kubernetes cluster or for application deployment.
Selecting the right tools can be challenging as well, due to various business or technological requirements.
It is recommended to evaluate and select Kubernetes native tools to achieve the previously mentioned tasks or resolve the mentioned challenges.
Summary
In this blog post, I have reviewed the most common Day 2 Kubernetes challenges.
I cannot stress enough the importance of employee training in deploying and maintaining Kubernetes clusters.
It is highly recommended to evaluate and look for a centralized management platform for deploying, monitoring (using cloud-native tools), and securing the entire fleet of Kubernetes clusters in the organization.
Another important recommendation is to invest in automation – from policy enforcement to application deployment and upgrade, as part of the CI/CD pipeline.
I recommend you continue learning and expanding your knowledge in the ongoing changed world of Kubernetes.