web analytics

Qualities of a Good Cloud Architect

In 2020 I published a blog post called “What makes a good cloud architect?“, where I tried to lay out some of the main qualities required to become a good cloud architect.

Four years later, I still believe most of the qualities are crucial, but I would like to focus on what I believe today is critical to succeed as a cloud architect.

Be able to see the bigger picture

A good cloud architect must be able to see the bigger picture.

Before digging into details, an architect must be able to understand what the business is trying to achieve (from a chatbot, e-commerce mobile app, a reporting system for business analytics, etc.)

Next, it is important to understand technological constraints (such as service cost, resiliency, data residency, service/quota limits, etc.)

A good architect will be able to translate business requirements, together with technology constraints, into architecture.

Being multi-lingual

An architect should be able to speak multiple languages – speak to business decision-makers understand their goals, and be able to translate it to technical teams (such as developers, DevOps, IT, engineers, etc.)

The business never comes with a requirement “We would like to expose an API to end-customers”. They will probably say “We would like to provide customers valuable information about investments” or “Provide patients with insights about their health”.

Being multi-disciplinary

There is always the debate between someone who is a specialist in a certain area (from being an expert in a specific cloud provider’s eco-system or being an expert in specific technology such as Containers or Serverless) and someone who is a generalist (having broad knowledge about cloud technology from multiple cloud providers).

I am always in favor of being a generalist, having hands-on experience working with multiple services from multiple cloud providers, knowing the pros and cons of each service, making it easier to later decide on the technology and services to implement as part of an architecture.

Being able to understand modern technologies

The days of architectures based on VMs are almost gone.

A good cloud architect will be able to understand what an application is trying to achieve, and be able to embed modern technologies:

  • Microservice architecture, to split a complex workload into small pieces, developed and owned by different teams
  • Containerization solutions, from managed Kubernetes services to simpler alternatives such as Amazon ECS, Azure Container Apps, or Google Cloud Run
  • Function-as-a-Service, been able to process specific tasks such as image processing, handling user registration, error handling, and much more.

Note: Although FaaS is considered vendor-opinionated, and there is no clear process to migrate between cloud providers, once decided on a specific CSP, a good architect should be able to find the pros for using FaaS as part of an application architecture.

  • Event-driven architecture has many benefits in modern applications, from decoupling complex architecture, the ability for different components to operate independently, the ability to scale specific components (according to customers’ demand) without impacting other components of the application, and more.

Microservices, Containers, or FaaS does not have to be the answer for every architecture, but a good cloud architect will be able to find the right tools to achieve the business goals, sometimes by combining different technologies.

We must remember that technology and architecture change and evolve. A good cloud architect should reassess past architecture decisions, to see if, over time, different architecture can provide better results (in terms of cost, security, resiliency, etc.)

Understanding cloud vs. on-prem

As much as I admire organizations that can design, build, and deploy production-scale applications in the public cloud, I admit the public cloud is not a solution for 100% of the use cases.

A good cloud architect will be able to understand the business goals, with technological constraints (such as cost, resiliency requirements, regulations, team knowledge, etc.), and be able to understand which workloads can be developed as cloud-native applications, and which workloads can remain, or even developed from scratch on-prem.

I do believe that to gain the full benefits of modern technologies (from elasticity, infinite scale, use of GenAI technology, etc.) an organization should select the public cloud, but for simple or stable workloads, an organization can find suitable solutions on-prem as well.

Thoughts of Experienced Architects

Make unbiased decisions

“A good architecture allows major decisions to be deferred (to a time when you have more information). A good architecture maximizes the number of decisions that are not made. A good architecture makes the choice of tools (database, frameworks, etc.) irrelevant.”

Source: Allen Holub

Beware the Assumptions

“Unconscious decisions often come in the form of assumptions. Assumptions are risky because they lead to non-requirements, those requirements that exist but were not documented anywhere. Tacit assumptions and unconscious decisions both lead to missed expectations or surprises down the road.”

Source: Gregor Hohpe

Cloud building blocks – putting things together

“A cloud architect is a system architect responsible for putting together all the building blocks of a system to make an operating application. This includes understanding networking, network protocols, server management, security, scaling, deployment pipelines, and secrets management. They must understand what it takes to keep systems operational.”

Source: Lee Atchison

Being a generalist

“Good generalists need to cast a wider net to define the best-optimized technologies and configurations for the desired business solution. This means understanding the capabilities of all cloud services and the trade-offs of deploying a heterogeneous cloud solution.”

Source: David Linthicum

The importance of cost considerations

“By considering cost implications early and continuously, systems can be designed to balance features, time-to-market, and efficiency. Development can focus on maintaining lean and efficient code. And operations can fine-tune resource usage and spending to maximize profitability.”

Source: Dr. Werner Vogels

Summary

There are many more qualities of a good and successful cloud architect (from understanding cost decisions, cybersecurity threats and mitigations, designing for scalability, high availability, resiliency, and more), but in this blog post, I have tried to mention the qualities that in 2024 I believe are the most important ones.

Whether you just entered the role of a cloud architect, or if you are an experienced cloud architect, I recommend you keep learning, gain hands-on experience with cloud services and the latest technologies, and share your knowledge with your colleagues, for the benefit of the entire industry.

About the author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Unpopular opinion about “Moving back to on-prem”

Over the past year, I have seen a lot of posts on social media about organizations moving back from the public cloud to on-prem.

In this blog post, I will explain why I believe it is nothing more than a myth, and why the public cloud is the future.

Introduction

Anyone who follows my posts on social media knows that I am a huge advocate of the public cloud.

Back in 2023, I published a blog post called “How to Avoid Cloud Repatriation“, where I explained why I believe that organizations rushed to the public cloud, without having a clear strategy, that would guide them on which workloads are suitable to run in the public cloud, to invest in cost management and employee training, etc.

I am aware of Barclay’s report from mid-2024 claiming that according to conversations they had with CIOs, 83% of the surveyed organizations plan to move workloads back to private cloud, while another report from Synergy Research Group (published in August 2024), claiming that “hyper-scale operators are 41% of the worldwide capacity of all data centers”, and “Looking ahead to 2029, hyper-scale operators will account for over 60% of all capacity, while on-premises will drop to just 20%”.

Analysts claim there is a trend of organizations to move back to on-prem, but the newspapers are far from been filled with customer stories (specifically enterprises), who moved their production workloads from the public cloud to the on-prem.

You may be able to find some stories about small companies (with stable workloads and highly skilled personnel), who decided to move back to on-prem, but it is far from being a trend.

I do not disagree that large workloads in the public cloud will cost an organization a lot of money, but it raises a question:

Has the organization embedded cost as part of any architecture decision from day 1, or has the organization ignored cost for a long time and realized now that the usage of cloud resources costs a lot of money if not managed properly?

Why do I believe the future is in the public cloud?

I am not looking at the public cloud as a solution for all IT questions/issues.

As with any (kind of) new field, an organization must invest in learning the topic from the bottom up, consult with experts, create a cloud strategy, and invest in cost, security, sustainability, and employee training, to be able to get the full benefits of the public cloud.

Let us dig deeper into some of the main areas where we see benefits of the public cloud:

Scalability

One of the huge benefits of the public cloud is the ability to scale horizontally (i.e., add or remove compute, storage, or network resources according to customer demand).

Were you able to horizontally scale using the traditional virtualization on-prem? Yes.

Did you have the capacity to scale virtually unlimited? No. Organizations are always limited by the amount of hardware they purchase and deploy in their on-prem data center.

Data center management

Regardless of what people may believe, most organizations do not have the experience of building and maintaining data centers to be physically secured, energetic sustainable, and to be CSP grade highly available.

Data centers do not produce any business value (unless you are in the data center or hosting industry), and in most cases, moving the responsibility to a cloud provider will be more beneficial for most organizations.

Hardware maintenance

Let us assume your organization decided to purchase expensive hardware for their SAP HANA cluster, or an NVIDIA cluster with the latest GPUs for AI/ML workloads.

In this scenario, your organization will need to pay in advance for several years, train your IT on deployment and maintenance of the purchased hardware (do not forget the cooling of GPUs…), and the moment you complete deploying the new hardware, your organization is in charge of the on-going maintenance, until the hardware will become outdated (probably couple of weeks/months after you purchased the hardware), and not you are stacked with old hardware, that will not be able to suit your business needs (such as the latest GenAI LLMs).

In the public cloud, you pay for the resources that you need, scale as needed, and pay only for the resources being used (unless you decide to go for Spot, or savings plans, to lower the total costs).

Using or experimenting with new technology

In the traditional data center, we are stacked with a static data center mentality, i.e., use what you currently have.

One of the greatest capabilities the public cloud offers us is switching to a dynamic mindset. Business managers would like their organizations to provide new services to their customers, in a short time-to-market.

A new mindset encourages experimentation, allowing development teams to build new products, experiment with them, and if the experiment fails, switch to something else.

One of the examples of experimentation is the spiky usage of GenAI technology. Suddenly everyone is using (or planning to use) LLMs to build solutions (from chatbots, through text summarization, and image or video generation).

Only the public cloud will allow organizations to experiment with the latest hardware and the latest LLMs for building GenAI applications.

If you try to experiment with GenAI, you will have to purchase dedicated hardware (which will soon get outdated and will not be sufficient for your business needs for a long time), and you will suffer from resource limitations (at least when using the latest LLMs).

Storage capacity

In the traditional data center, organizations (almost) always suffer from limited storage capacity.

The more organizations collect data (for business analytics, providing customers added-value, research, AI/ML, etc.), to more data will be produced and needs to be stored.

In the on-prem, you are eventually limited with the amount of storage you can purchase and physically deploy in your data center.

Once organizations (usually large enterprises), store PBs of data in the public cloud, the cost and time to move such amounts of data out of the public cloud to on-prem (or even to another cloud provider), will be so high, that eventually, most organizations will keep their data as is, and it will become a hard decision to move out of their existing cloud provider.

Modern / Cloud-native applications

Building modern applications changes the way organizations develop and deploy new applications.

Most businesses would like to move faster and provide new solutions to their customers.

Although you could develop new applications based on Kubernetes on-prem, the cost and complexity of managing the control plane, and the limited scale capabilities, will make your solution a wannabe cloud. A small and pale version of the public cloud.

You could find Terraform/OpenTofu providers for some of the resources that exist on-prem (mostly for the legacy virtualization), but how do you implement infrastructure-as-code (not to mention policy-as-code) in legacy systems? How will you benefit from automated system deployment capabilities?

Conversation about data residency/data sovereignty

This is a hot topic, at least since the GDPR in the EU became effective in 2018.

Today most public cloud providers have regions in most (if not all) countries with data regulation laws.

Not to mention that 85-90 percent of all IaaS/PaaS solutions are regional, meaning, the CSP will not transfer your data from the EU to the US unless you specifically design your workloads accordingly (due to egress data cost, and service built-in limitations).

If you want to add an extra layer of assurance, choose cloud services that allow you to encrypt your data using customer-managed keys (i.e., keys that the customer controls the key generation and rotation process).

Summary

I am sure we can continue and deep dive into the benefits of the public cloud vs. the limitations of the on-prem data center (or what people sometimes refer to as “private cloud”).

For the foreseen future (and I am not saying this as something beneficial), we will continue to see hybrid clouds, while more and more organizations will see the benefits of the public cloud and migrate their production workloads and data to the public cloud.

We will continue to find scenarios where the on-prem and legacy applications will continue to provide value for organizations, but as technology evolves (see GenAI for example), we will see more and more organizations consuming public cloud services.

To gain the full benefit of the public cloud, organizations need to understand how the public cloud can support their business, allowing them to focus on what matters (such as developing new services for their customers), and lower the work on data center maintenance.

Organizations should not neglect cost, security, sustainability, and employee training, to be able to gain the full benefit of the public cloud.

I strongly believe that the public cloud is the future, for developing and innovative solutions, while shipping the hardware and data center responsibility for companies who specialize in this field.

Why do I call it an “unpopular opinion”? When people are reluctant to change, they rather stick with what they know and are familiar with. Change can be challenging, but if organizations embrace the change, look strategically into the future, embed cost into their decisions, and invest in employee training, they will be able to adapt to the change and see its benefits.

About the author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Cybersecurity burnout is a real risk

Business leaders around the world understand the importance of cybersecurity for supporting the business, complying with laws and regulations, and earning customers’ trust.

Good CISOs know how to lead cybersecurity efforts, from raising money for the cybersecurity budget, taking part in incident investigation, recruiting talents to support the security efforts, and making sure their organizations remain safe (as much as possible).

There is one topic not getting enough attention – employees’ burnout.

No doubt working in cybersecurity is stressful – and it impacts all levels – from the top management of CISO/CSO to the lower levels of any practitioner in the industry.

To keep up in a cybersecurity role, you need to have passion for what you do. Find the time to keep up with technology evolvement, and new attacks published every day, while still doing your everyday job, in protecting the organizations you work for.

Let us talk about some statistics:

  • 67% of responders say “My organization has a significant shortage of cybersecurity staff to prevent and troubleshoot cybersecurity issues” (Source: ISC2 2023 cybersecurity workforce study)
  • 90% of organizations have skills gaps within their security teams (Source: ISC2 2024 cybersecurity workforce study)
  • 90% of CISOs globally say they are concerned about the impact of stress, fatigue, and burnout on their workforce’s well-being (Source: Hack the Box)
  • 89% of cybersecurity professionals globally say the workload, volume of projects to deliver, and the time needed to deliver tasks are the key causes of burnout (Source: Hack the Box)
  • 74% of cybersecurity professionals globally say that they have taken time off due to work-related mental well-being problems (Source: Hack the Box)
  • 32% of CISOs or IT Cybersecurity Leaders in the UK and US are considering leaving their current organization (Source: BlackFog)
  • 30% cited the lack of work-life balance (Source: BlackFog)
  • 27% stated that too much time was spent on firefighting rather than focusing on strategic issues (Source: BlackFog)

We can see that cybersecurity employees (at all levels) suffer from huge stress as part of their daily work, struggling to keep up with their ongoing tasks, and balancing personal time with their families.

Good CISOs/CSOs will know how to do their job, pushing the boundaries and protecting their organizations, but the big question is – do CISOs/CSOs have the emotional intelligence to focus on their most important asset – employees?

Can cybersecurity leaders find the time to speak with their employees, to sense when the tension is too much for an employee to handle, and do something about it?

The work of cybersecurity teams is crucial for organizations (keep the organization safe and secure, comply with regulations, and earn customers’ trust), but if organizations ignore the human factor, they will lose valuable employees, and we already have a talent shortage in the cybersecurity industry.

CISO/CSO – do not wait until your talents reach burnout and resign, have a personal conversation with them, try to lower the load on employees (among others by raising the budget for more positions in the cybersecurity teams), and never neglect your employees.

About the author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Comparison of Serverless Development and Hosting Platforms

When designing solutions in the cloud, there is (almost) always more than one alternative for achieving the same goal.

One of the characteristics of cloud-native applications is the ability to have an automated development process (such as the use of CI/CD pipelines).

In this blog post, I will compare serverless solutions for developing and hosting web and mobile applications in the cloud.

Why choose a serverless solution?

From a developer’s point of view, there is (almost) no value in maintaining infrastructure – the whole purpose is to enable developers to write new applications/features and provide value to the company’s customers.

Serverless platforms allow us to focus on developing new applications for our customers, without the burden of maintaining the lower layers of the infrastructure, i.e., virtual machine scale, patch management, host machine configuration, and more.

Serverless development and hosting platforms allow us CI/CD workflow, from Git repository to the build stage, and finally deployment to the various application stages (Dev, Test, Prod), in a single solution (Git repos is still outside the scope of such services).

Serverless development and hosting platforms allow us to deploy fully functional applications at any scale – from a small test environment to a large-scale production application, which we can put behind a content delivery network (CDN), and a WAF, and be accessible for external or internal customers.

Serverless development platform workflow

Below is a sample workflow for developing and deploying an application based on a Serverless platform:

  1. A developer writes code and pushes the code to a Git repository
  2. A new application is configured using AWS Amplify, based on the code from the Git repository
  3. The AWS Amplify pulls secrets from AWS Secrets Manager to connect to AWS resources
  4. The new application is configured to connect to Amazon S3 for uploading static content
  5. The new application is configured to connect to Amazon DynamoDB for storing and retrieving data
  6. The new application has been deployed using AWS Amplify

Note: The example below is based on AWS services but can be configured similarly to other cloud platforms mentioned in this blog post.

Service Comparison

The table below provides a high-level comparison of commonly used Serverless development and hosting platforms, from the major cloud providers:

Service comparison (development languages, framework, and platform support)

The table below provides a comparison of development languages and frameworks supported by Serverless development and hosting platforms, from the major cloud providers:

Service comparison (security features)

The table below provides a comparison of security features supported by Serverless development and hosting platforms, from the major cloud providers:

Summary

Serverless development and hosting platforms offer us an alternative for automating the development lifecycle of cloud-native applications, with built-in integration with cloud providers’ eco-system.

For simple web or mobile applications, I recommend considering using one of the services discussed in this blog post, as compared to the alternative of having to learn and maintain an entire suite of services for running a CI/CD pipeline, and the requirement to decide where to deploy and host applications to production (from VMs, containers platforms, and other hosting solutions).

Reference documentation

About the author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Time to move on to Day 2 cloud operations

Anyone who has been following my past content on social media knows that I am a huge advocate for cloud adoption, and I have been focusing on various topics related to cloud in the past almost a decade.

While organizations taking their first steps using the public cloud or rushing into the public cloud, they are making a lot of mistakes, to name a few:

  • Failing to understand why are they using the cloud in the first place, and what value the public cloud can bring to their business
  • Bringing legacy data center mindset and practices and trying to implement them in the public cloud, which results in inefficiencies
  • Not embedding cost as part of architecture decisions, which results in high cloud usage costs

In this post, we will focus on the next steps in embracing the public cloud, or what is sometimes referred to as Day 2 cloud operations.

What do all those dates mean?

When compared to software engineering, Day 0 is known as the design phase. You collect requirements for moving an application to the cloud, or for developing a new application in the cloud.

Day 1 in cloud operations is where most organizations are stuck. They begin migrating several applications to the cloud, deploying some workloads directly into cloud environments, and perhaps even running the first production applications for several months, or even a year or two. This is the phase where development and DevOps teams are still debating about selecting the most appropriate infrastructure (VMs, containers, perhaps even Serverless, managed vs. self-managed services, etc.)

Day 2 in cloud operations is where things are getting interesting. Teams begin to realize the ongoing cost of services, the amount of effort required to deploy and maintain workloads manually, security aspects in cloud environments, and various troubleshooting and monitoring of production incidents.

What does Day 2 cloud operations mean?

When organizations reach day 2 of their cloud usage, they begin to look at previously made mistakes and begin to fine-tune their cloud operations.

Automation is the king

Unless your production contains one or two VMs with a single database, manual work is no longer an option.

Assuming your applications are not static, it is time to switch the development processes to a CI/CD process, and automate (almost) the entire development lifecycle – from code review, static or dynamic application security testing, quality tests, build creation (for example, packaging from source doe to container images), up to the final deployment of a fully functional version of an application.

This is also the time to invest in learning and using automated infrastructure deployment using an Infrastructure as Code (IaC) language such as Terraform, OpenTofu, Pulumi, etc.

The use of IaC will allow you to take benefit of code practices such as code versioning, rollback, audit (who did what change), and naturally the ability to reuse the same code for different environments (dev, test, prod) while gaining the same results.

Rearchitecting and reusing cloud-native capabilities

On Day 1, it may be ok to take traditional architectures (such as manually maintaining VMs), but on Day 2 it is time to take the full benefit of cloud-native services.

The easiest way is to replace any manual maintenance of infrastructure with managed services – in most cases, switching to a managed database, storage, or even load-balancers and API gateways, will provide a lot of benefits (such as lower maintenance, resource allocation, etc.), while allowing IT and DevOps teams to focus on supporting and deployment of new application versions, instead of operating system and server maintenance.

If you are already re-evaluating past architecture decisions, it is time to think about moving to microservices architecture, decoupling complex workloads to smaller and more manageable components, owned by the development teams who develop those components.

For predictable workloads (in terms of spike load of customer demand), consider using containers.

If your developers and DevOps teams are familiar with packaging applications inside containers, but lack experience with Kubernetes, consider using services such as Amazon ECSAzure App Service, or Google Cloud Run.

If your developers and DevOps teams have experience using Kubernetes, consider using one of the managed flavors of Kubernetes such as Amazon EKS, Azure AKS, or Google GKE.

Do not stop at containers technologies, if your workload is unpredictable (in terms of customers load), consider even taking architecture one step further and consider using Function-as-a-Service (FaaS) such as AWS Lambda, Azure Functions, Google Cloud Functions, or event-driven architectures, using services such as Amazon EventBridge, Azure Event Grid, or Google Eventarc.

Resiliency is not wishful thinking

The public cloud, and the use of cloud-native services, allow us to raise the bar in terms of building highly resilient applications.

In the past, we needed to purchase solutions such as load-balancers, API gateways, DDoS protection services, and more, and we had to learn how to maintain and configure them.

Cloud providers offer us managed services, making it easy to design and implement resilient applications.

Customers’ demand has also raised the bar – customers are no longer willing to accept downtime or availability issues while accessing applications – they expect (almost) zero downtime, which forces us to design applications while keeping resiliency in mind from day 1.

We need to architect our applications as clusters, deployed in multiple availability zones (and in rare cases even in multiple regions), but also make sure we constantly test the resiliency of our workloads.

We should consider implementing chaos engineering, as part of application development and test phases, and be able to conduct controlled experiments (at the bare minimum in the test stage, and ideally also in production), to be able to understand the impact of failures on our applications.

Observability to the aid

The traditional monitoring of infrastructure and applications is no longer sufficient in modern and dynamic applications.

The dynamic nature of modern applications, where new components (from containers to functions) are been deployed, running for a short amount of time (according to application demand and configuration), and decommissioned when no longer needed, will not be able to handle by traditional monitoring tools (commonly deployed as agents).

We need to embed monitoring in any aspect of our workloads, at all layers – from the network layer such as flow logs, infrastructure layer (such as load-balancer, or OS, containers, and functions logs), all the way to application or even customer experience logs.

Storing logs is not enough – we need managed services that can constantly review logs from various sources (ideally aggregated into a central log system), use machine learning capabilities, try to anticipate issues, before they impact customer experience, and provide insights and near real-time recommendations for fixing the arise problems.

Cost and efficiency

In the public cloud, almost any service has its pricing – sometimes it is the time a compute resource was running, the number of invocations of a running function, a storage service storing files, database queries, or even an egress data from the cloud environment back to on-prem or to the public Internet.

Understanding the pricing of each component in a complex architecture is crucial, but not enough.

We need to embed cost in every architecture decision, understand what is the most valuable cost option (for example choosing between on-demand, savings plan, or Spot), and monitor each workload’s cost regularly.

Cost is very important, but not enough.

We need to embed efficiency in any architecture decision – are we using the most suitable compute service, are we using the most suitable storage tier (from real-time, to archive), are we using the most suitable functions resources (in terms of Memory/CPU), etc.

We need to combine an architect’s view (being able to see the bigger picture), with an engineer or developer’s experience (being able to write efficient code), to meet the business requirements.

Security is job zero

I cannot stress enough how important security is in today’s world.

I have mentioned before the dynamic nature of modern cloud-native applications, and the evolving threats identified every day require no mindset when talking about security.

At first, we need to embed automation – from testing new versions of code, regularly scanning for vulnerable open-source libraries, embedding SBOM (Software Bill of Materials) solutions (to be able to know which components are we using), automatically deploying security patches, and finally running an automated vulnerability scanning tools to detect vulnerabilities as soon as possible.

We should consider implementing immutable infrastructure, switching from over-changing VMs (containing both libraries, configuration, code, and data), to read-only immutable images of VMs or containers, being updated (to new versions), in an automated CI/CD process.

Data must be encrypted end-to-end, to protect its confidentiality and integrity.

Mature cloud providers allow us to manage both encryption keys (using customer-managed keys), and secrets (i.e., static credentials) using managed services, fully supported by (almost) all cloud-native services, which makes it extremely easy to protect data.

Lastly, we should embrace a zero-trust mindset. We should always assume breach, and in this mindset, we should verify any request coming from any customer, over any medium (mobile, public Internet, Wi-Fi, etc.). We need to authenticate any customer’s request and assign each customer the right number of privileges to access our applications and act, following the principle of least privilege.

Training, training, and training

It may be acceptable on day 1 for developers and operational teams to make mistakes, taking their first steps in the public cloud.

To allow organizations to move to day 2 cloud operations, we need to heavily invest in employee training.

Encouraging a culture of experimentation, opening environments in the cloud for employee training, using the many options of online courses, and the fact that most cloud documentation is publicly available, will allow both developers and operational teams to gain confidence in using cloud services.

As more and more organizations are beginning to use more than a single cloud provider, (not necessarily a multi-cloud environment, but more than a single vendor), requires employees to have hands-on experience working with several cloud providers, with different platforms, services, and capabilities. The best way to achieve this experience is to train and gain experience working with different platforms.

Summary

It is time for organizations to move on from the day 1 cloud operations phase (initial application deployment and configuration phase) to the day 2 cloud operations phase (fine-tune, and ongoing maintenance phase).

It is a change in mindset, but it is crucial for maintaining production applications, in the modern and cloud-native era.

About the author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

The Container Orchestration vs Function-as-a-Service (FaaS) Debate

When designing modern applications in the cloud, there is always the debate – should we base our application on a container engine, or should we go with a fully serverless solution?

In this blog post, I will review some of the pros and cons of each alternative, trying to understand which solution we should consider.

The containers alternative

Containers have been with us for about 10 years.

The Docker engine was released in 2013, and Kubernetes was released in 2015.

The concept of packaging an application inside a container image brought many benefits:

  • Portability – The ability to run the same code on any system that supports a container engine.
  • Scalability – The ability to add or remove container instances according to application load.
  • Isolation – The ability to limit the blast radius to a single container, instead of the whole running server (which in many cases used to run multiple applications).
  • Resource Efficiency – Container image is usually made of the bare minimum required binaries and libraries (compared to a fully operating system).
  • Developer experience – The ability to integrate container development processes with developers’ IDE, and with CI/CD pipelines.
  • Consistency – Once you have completed creating the container image and fully tested it, it will be deployed and run in the same way every time.
  • Fast deployment time – It takes a short amount of time to deploy a new container (or to delete a running container when it is no longer needed).

Containers are not perfect – they have their disadvantages, to name a few:

  • Security – The container image is made of binaries, libraries, and code. Each of them may contain vulnerabilities and must be regularly scanned and updated, under the customer’s responsibility.
  • Storage challenges – Container images are by default stateless. They should not hold any persistent data, which forces them to connect to external (usually managed) storage services (such as object storage, managed NFS, managed file storage, etc.)
  • Orchestration – When designing a containers-based solution, you need to consider the networking side, meaning, how do I separate between a publicly facing interface (for receiving inbound traffic from customers), and private subnets (for deploying containers or Pods, and communication between them).

Containers are very popular in many organizations (from small startups to large enterprises), and today organizations have many alternatives for running containers – from Amazon ECS, Azure Container Apps, and Google Cloud Run, to managed Kubernetes services such as Amazon EKS, Azure AKS, and Google GKE.

The Serverless alternative

Serverless, at a high level, is any solution that does not require end-users to deploy or maintain the underlying infrastructure (mostly servers).

There are many services under this category, to name a few:

The Serverless alternative usually means the use of FaaS, together with other managed services, in the cloud provider ecosystem (such as running functions based on containers, mounting persistent storage, database, etc.)

FaaS has been with us for nearly 10 years.

AWS Lambda became available in 2015, Azure Functions became available in 2016, and Google Cloud Functions became available in 2018.

The use of FaaS has advantages, to name a few:

  • Infrastructure maintenance – The cloud provider is responsible for maintaining the underlying servers and infrastructure, including resiliency (i.e., deploying functions across multiple AZs).
  • Fast Auto-scaling – The cloud provider is responsible for adding or removing running functions according to the application’s load. Customers do not need to take care of scale.
  • Fast time to market – Customers can focus on what is important to their business, instead of the burden of taking care of the server provisioning task.
  • Cost – You pay per the amount of time a function was running, and the number of running functions (also known as invocations or executions).

FaaS is not perfect – it has its disadvantages, to name a few:

  • Vendor lock-in – Each cloud provider has its implementation of FaaS, making it almost impossible to migrate between cloud providers.
  • Maximum execution time – Functions have hard limits in terms of maximum execution time – AWS Lambda is limited to 15 minutes, Azure Functions (in the Consumption plan) are limited to 10 minutes, and Google Cloud Functions (HTTP functions) are limited to 9 minutes.
  • Cold starts – The time it takes a function to respond (and execute), for a function that has not been in use recently, which increases the number of seconds it takes a function to load.
  • Security – Each cloud provider implements isolation between different functions running for different customers. Customers have no visibility on how each deployed function is protected by the cloud provider, at the infrastructure level.
  • Observability – Troubleshooting a running function in real-time is challenging in a fully managed environment, managed by cloud providers, in a distributed architecture.
  • Cost – Workloads with predictable load, or bugs in the function’s code which ends up with an endless loop, may generate high costs for running FaaS.

How do we know what to choose?

The answer to this question is not black or white, it depends on the use case.

Common use cases for choosing containers or Kubernetes:

  • Legacy application modernization – The ability to package legacy applications inside containers, and run them inside a managed infrastructure at scale.
  • Environment consistency – The ability to run containers consistently across different environments, from Dev, Test, to Prod.
  • Hybrid and Multi-cloud – The ability to deploy the same containers across hybrid or multi-cloud environments (with adjustments such as connectivity to different storage or database services).

Common use cases for choosing Functions as a Service:

  • Event-driven architectures – The ability to trigger functions by events, such as file upload, database change, etc.
  • API backends – The ability to use functions to handle individual API requests and scale automatically based on demand.
  • Data processing – Functions are suitable for data processing tasks such as batch processing, stream processing, ETL operations, and more because you can spawn thousands of them in a short time.
  • Automation tasks – Functions are perfect for tasks such as log processing, scheduled maintenance tasks (such as initiating backups), etc.

One of the benefits of using microservices architecture is the ability to choose different solutions for each microservice.

Customers can mix between containers, and FaaS in the same architecture.

Below is a sample microservice architecture:

  1. A customer logs into an application using an API gateway.
  2. API calls are sent from the API gateway to a Kubernetes cluster (deployed with 3 Pods).
  3. User access logs are sent from the Kubernetes cluster to Microservice A.
  4. Microservice A sends the logs to Amazon Data Firehose.
  5. The Amazon Data Firehose converts the logs to JSON format and stores them in an S3 bucket.
  6. The Kubernetes cluster sends an API call to Microservice B.
  7. Microservice B sends a query for information from DynamoDB.
  8. A Lambda function pulls information from DynamoDB tables.
  9. The Lambda function sends information from DynamoDB tables to OpenSearch, for full-text search, which later be used to respond to customer’s queries.

Note: Although the architecture above mentions AWS services, the same architecture can be implemented on top of Azure, or GCP.

Summary

In this blog post, I have reviewed the pros and cons of using containers and Serverless.

Some use cases are more suitable for choosing containers (such as modernization of legacy applications), while others are more suitable for choosing serverless (such as event-driven architecture).

Before designing an application using containers or serverless, understand what are you trying to achieve, which services will allow you to accomplish your goal, and what are the services’ capabilities, limitations, and pricing.

The public cloud allows you to achieve similar goals using different methods, based on different services – never stop questioning your architecture decisions over time, and if needed, adjust to gain better results (in terms of performance, cost, etc.)

About the authors

Efi Merdler-Kravitz is an AWS Serverless Hero and the author of ‘Learning Serverless in Hebrew’. With over 15 years of experience, he brings extensive expertise in cloud technologies, encompassing both hands-on development and leadership of R&D teams.

You can connect with him on social media (https://linktr.ee/efimk).

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin). Opinions are his own and not the views of his employer.

The Rise of AI in Cyber Threats: Key Challenges and How to Respond

While artificial intelligence (AI) can greatly increase productivity in the workplace, it can also be exploited to launch complex and sophisticated cyber-attacks. A recent report from the UK’s National Cyber Security Center (NCSC) claims that AI will “almost certainly increase the volume and heighten the impact of cyber-attacks over the next two years”.

Generative AI models, which can create new content such as text, images, and videos, have sparked controversy as they can be easily exploited to carry out malicious activities. For example, threat actors can use Generative AI to generate convincing phishing emails to lure people into handing over credentials, or other types of sensitive information. Likewise, AI can be used to create deepfake videos to manipulate public opinion on a variety of matters, including elections.

In this article we will explore some of the ways that AI has made it possible for even inexperienced hackers to join the ranks, allowing them to orchestrate sophisticated attacks with relative ease.

Polymorphic Viruses

Artificial Intelligence (AI) has significantly accelerated the development of polymorphic viruses, making it easier for hackers to create and deploy these malicious programs. AI-powered tools can rapidly generate countless code variants and code strings, allowing polymorphic viruses to evade detection by antivirus software and adapt to new environments. By leveraging machine learning algorithms and mutation engines, virus strains can be effortlessly created which continuously mutate and evade detection. As a result, polymorphic viruses have become a significant threat to cybersecurity, capable of infecting files on any operating system. While security technologies and methods, such as behavior-based analytics and application whitelisting can help detect these viruses, will they will be enough to adequately safeguard against such threats in the future?

The Use of Deepfakes for Social Engineering

Deepfakes are artificially created digital content that can deceive people into believing they’re seeing or hearing something that never actually occurred. According to the World Economic Forum, an alarming 66% of cybersecurity professionals encountered deepfake attacks within their own organizations in 2022, highlighting the prevalence of this type of threat. 

These highly realistic forgeries can be easily produced using generative AI tools (mentioned above), and they have already been used to create fake videos of public figures, as well as unauthorized pornographic content. Unfortunately, deepfakes have also been employed to spread propaganda and influence political and social outcomes, and they can even be used to add credibility to social engineering attacks, such as impersonating senior executives on video and phone calls.

In recent years, deepfakes have been used to trick people into sending large sums of money to cybercriminals, with criminals using deepfakes to impersonate colleagues and initiate fraudulent payments. To prevent similar attacks, organizations should prepare by implementing robust governance mechanisms, such as requiring multiple sign-offs for payments. 

AI Voice Cloning

Alongside the growing menace of visual deepfakes, AI voice cloning has emerged as a major concern. The widespread use of voice biometrics in various devices and systems, touted as a robust security measure, has now been rendered vulnerable to hacking. This is because AI has advanced to the point where it can accurately replicate audio fingerprints and mimic voice clips from mere sample vocals. The implication is that voice-protected systems are no longer secure, leaving them susceptible to manipulation by hackers. This can lead to a range of nefarious consequences, as hackers manipulate audio files to convincingly perpetuate false narratives.  

AI Keylogging

AI Keylogging tools can actively record every keystroke, collecting sensitive information such as passwords, with astonishing accuracy, boasting a success rate of nearly 95%. This means that even the most cautious and security-conscious individuals can be vulnerable to having their sensitive information compromised by this type of malware. To defend against AI-powered keyloggers, it is essential to implement a multi-layered approach. One effective strategy is to monitor user behavior to identify and respond to unusual typing patterns. Additionally, a robust endpoint security solution can detect and prevent malware-driven keyloggers from infiltrating systems. Multi-factor authentication (MFA) adds an extra layer of protection, requiring an additional authentication factor even if keystrokes are intercepted. To ensure the integrity of keystrokes, encryption can be used to safeguard captured data, making it indecipherable without the encryption key. Finally, regular updates and patches to software, operating systems, and security applications are crucial to maintaining a secure environment and addressing known vulnerabilities exploited by attackers.    

Better Spelling and Grammar To Evade Spam Filters

Cybercriminals have traditionally used poor spelling and grammar to mask their phishing emails, but with the advent of AI-powered writing tools, they can now create convincing social engineering campaigns in any language in a matter of seconds. This new approach has made it increasingly difficult for spam and malicious content filters to detect and block these emails. According to a recent report by cybersecurity firm SlashNext, the use of AI-generated content has led to a 1,265% surge in phishing emails since 2022. As a result, AI-generated content has become a widespread and effective tactic used by cybercriminals on a large scale, making it a crucial concern for individuals and organizations seeking to protect themselves from cyber threats.

AI Brute Force Attacks & CAPTCHA Cracking

AI-powered brute force attacks have emerged as a significant threat to online security. These attacks use machine learning to analyze user behavior and patterns to crack passwords faster. Additionally, AI has also been able to outsmart CAPTCHA systems, which were previously designed to distinguish between human and bot interactions. By leveraging patterns learned from human behavior, AI can now accurately solve CAPTCHA forms, rendering these security measures less effective in preventing bots from accessing secured locations.

Specialized Language Models Are on The Rise

While not a threat in itself, the rise of large language models (LLMs) has transformed the field of organizational cybersecurity, arming security teams with the power to sift through large amounts of data and generate actionable insights with simple queries. While these models have shown remarkable capabilities in understanding and generating human-like text, they are still limited in their ability to comprehend the intricacies of specialized cybersecurity datasets. However, in the coming years security teams can expect to transition to smaller language models that offer tailored and actionable insights, real-time data training, and the ability to adapt quickly to the ever-evolving threat landscape. These small language models will provide more focused and effective solutions for cybersecurity teams, enabling them to stay ahead of the curve in the fight against cyber threats.

Conclusion

As AI becomes increasingly pervasive in our daily lives, the way cybersecurity defenders respond to its emergence will be crucial. The rise of generative AI has sparked a heated debate about its ethical implications and potential uses, but what’s clear is that organizations must act quickly to harness its power before threat actors exploit it. It’s likely that threat actors will use AI to launch sophisticated phishing campaigns, create swarms of deepfakes, and gain access to detailed information about targets, ultimately bypassing endpoint security defenses. To stay ahead of the curve, security leaders must prepare for the inevitable wave of AI-generated threats and develop strategies to mitigate their impact.

Author bio

Aidan Simister
Aidan Simister is the CEO of Lepide, a leading provider of data security and compliance solutions. With over two decades of experience in the IT industry, he is recognized for his expertise in cybersecurity and his commitment to helping organizations safeguard their sensitive data.

Comparison of Cloud Storage Services

When designing workloads in the cloud, it is rare to have a workload without persistent storage, for storing and retrieving data.

In this blog post, we will review the most common cloud storage services and the different use cases for choosing specific cloud storage.

Object storage

Object storage is perhaps the most commonly used cloud-native storage service.

It is been used by various use cases from simple storage or archiving of logs or snapshots to more sophisticated use cases such as storage for data lakes or AI/ML workloads.

Object storage is used by many cloud-native applications from Kubernetes-based workloads using CSI driver (such as Amazon EKS, Azure AKS, and Google GKE), and for Serverless / Function-as-a-Service (such as AWS Lambda, and Azure Functions).

As a cloud-native service, the access to object storage is done via Rest API, HTTP, or HTTPS.

Unstructured data is stored inside object storage services as objects, in a flat hierarchy, where most cloud providers call it buckets.

Data is automatically synched between availability zones in the same region (unless we choose otherwise), and if needed, buckets can be synched between regions (using cross-region replication capability).

To support different data access patterns, each of the hyperscale cloud providers, offers its customers different storage classes (or storage tiers), from real-time, near real-time, to archive storage, and a capability for configuring rules for moving data between storage classes (also known as lifecycle policies).

As of 2023, all hyperscale cloud providers enforce data encryption at rest in all newly created buckets.

Comparison between Object storage alternatives:

As you can read in the comparison table above, most features are available in all hyper-scale cloud providers, but there are still some differences between the cloud providers:

  • AWS – Offers a cheap storage tier called S3 One Zone-IA for scenarios where data access patterns are less frequent, and data availability and resiliency are not highly critical, such as secondary backups. AWS also offers a tier called S3 Express One Zone for single-digit millisecond data access requirements, with low data availability or resiliency, such as AI/ML training, Amazon Athena analytics, and more.
  • Azure – Most storage services in Azure (Blob, files, queues, pages, and tables), require the creation of an Azure storage account – a unique namespace for Azure storage data objects, accessible over HTTP/HTTPS. Azure also offers a Premium block blob for high-performance workloads, such as AI/ML, IoT, etc.
  • GCP – Cloud storage in Google, is not limited to a single region but can be provisioned and synched automatically to dual-regions and even multi-regions.

Block storage

Block storage is the disk volume attached to various compute services – from VMs, managed databases, Kubernetes worker notes, and mounted inside containers.

Block storage can be used as the storage for transactional databases, data warehousing, and workloads with high volumes of read and write.

Block storage is not just limited to traditional workloads deployed on top of virtual machines, they can be mounted as persistent volumes for container-based workloads (such as Amazon ECS), and for Kubernetes-based workloads using CSI driver (such as Amazon EKS, Azure AKS, and Google GKE).

Block storage volumes are usually limited to a single availability zone within the same region and should be mounted to a VM in the same AZ.

Comparison between Block storage alternatives:

As you can read in the comparison table above, most features are available in all hyper-scale cloud providers, but there are still some differences between the cloud providers:

File storage

File storage services are the equivalent of the traditional Storage Area Network (SAN).

All major hyperscale cloud providers offer managed file storage services, allowing customers to share files between multiple Windows (CIFS/SMB), and Linux (NFS) virtual machines.

File storage is not just limited to traditional workloads sharing files between multiple virtual machines, they can be mounted as persistent volumes for container-based workloads (such as Amazon ECS, Azure Container Apps, and Google Cloud Run), Kubernetes-based workloads using CSI driver (such as Amazon EKS, Azure AKS, and Google GKE), and for Serverless / Function-as-a-Service (such as AWS Lambda, and Azure Functions).

Other than the NFS or CIFS/SMB file storage services, major cloud providers also offer a managed NetApp files system (for customers who wish to have the benefits of NetApp storage) and managed Lustre file system (for HPC workloads or workloads that require extreme high-performance throughput).

Comparison between NFS File storage alternatives:

As you can read in the comparison table above, most features are available in all hyper-scale cloud providers, but there are still some differences between the cloud providers:

  • AWS – Offers cheap storage tier called EFS One Zone file system, for scenarios where data access pattern is less frequent, and data availability and resiliency are not highly critical. By default, data inside the One Zone file system is automatically backed up using AWS Backup.
  • Azure – Offers an additional security protection mechanism such as malware scanning and sensitive data threat detection, as part of a service called Microsoft Defender for Storage.
  • GCP – Offers enterprise-grade tier for critical applications such as SAP or GKE workloads, with regional high-availability and data replication called Enterprise tier.

Comparison between CIFS/SMB File storage alternatives:

Comparison between managed NetApp File storage alternatives:

Comparison between File storage for HPC workloads alternatives:

Summary

Persistent storage is required by almost any workload, including cloud-native applications.

In this blog post, we have reviewed the various managed storage options offered by the hyperscale cloud providers.

As best practice, it is crucial to understand the application’s requirements, when selecting the right storage option.

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry.

You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

👇Help to support my authoring👇
☕ Buy me a coffee ☕

Poor architecture decisions when migrating to the cloud

When organizations are designing their first workload migrations to the cloud, they tend to mistakenly look at the public cloud as the promised land that will solve all their IT challenges (from scalability, availability, cost, and more).

In the way to achieve their goals, organizations tend to make poor architectural decisions.

In this blog post, we will review some of the common architectural mistakes made by organizations.

Lift and Shift approach

Migrating legacy monolithic workload from the on-premises and moving it as is to the public cloud might work (unless you have a license or specific hardware requirements), but it will result in poor outcomes.

Although VMs can run perfectly well in the cloud, most of the chances you will have to measure the performance of the VMs over time and right-size the instances to match the actual running workload to match customers’ demand.

The lift and shift approach is suitable as an interim solution until the organization has the time and resources to re-architect the workload, and perhaps choose a different architecture (for example migrate from VMs to containers or even Serverless).

In the long run, lift and shift will be a costly solution (compared to the on-premises) and will not be able to gain the full capabilities of the cloud (such as horizontal scale, scale to zero, resiliency of managed services, and more).

Using Kubernetes for small/simple workloads

When designing modern applications, organizations tend to follow industry trends.

One of the hottest trends is to choose containers for deploying various application components, and in many cases, organizations choose Kubernetes as the container orchestrator engine.

Although Kubernetes does have many benefits, and all hyper-scale cloud providers offer a managed Kubernetes control plane, Kubernetes creates many challenges.

The learning curve for fully understanding how to configure and maintain Kubernetes is very long.

For small or predictable applications, built from a small number of different containers, there are better and easy-to-deploy and maintain alternatives, such as Amazon ECSAzure Container Apps, or Google Cloud Run — all of them are fully capable of running production workloads, and are much easier to learn and maintain then Kubernetes.

Using cloud storage for backup or DR scenarios

When organizations began to search for their first use cases for using the public cloud, they immediately thought about using cloud storage as a backup location or perhaps even for DR scenarios.

Although both use cases are valid options, they both tend to miss the bigger picture.

Even if we use object storage (or managed NFS/CIFS storage services) for the organization’s backup site, we must always take into consideration the restore phase.

Large binary backup files that we need to pull from the cloud environment back to on-premises will take a lot of time, not to mention the egress data cost, the read object API calls cost, and more.

The same goes with DR scenarios — if we back up our on-premises VMs or even databases to the cloud, if we don’t have a similar infrastructure environment in the cloud, what will a cold DR site assist us in case of a catastrophic disaster?

Separating between the application and the back-end data-store tiers

Most applications are built from a front-end/application tier and a back-end persistent storage tier.

In a legacy or tightly coupled architecture, there is a requirement for low latency between the application tier and the data store tier, specifically when reading or writing to a backend database.

A common mistake is creating a hybrid architecture, where the front-end is in the cloud, pulling data from an on-prem database, or an architecture (rare scenario) where a legacy on-prem application is connecting to a managed database service in the cloud.

Unless the target application is not prone to network latency, it is always recommended to architect all components close to each other, decreasing the network latency between the various application components.

Going multi-cloud in the hope of resolving vendor lock-in risk

A common risk many organizations looking into is vendor lock-in (i.e., customers being locked into the ecosystem of a specific cloud provider).

When digging into this risk, vendor lock-in is about the cost of switching between cloud providers.

Multi-cloud will not resolve the risk, but it will create many more challenges, from skills gap (teams familiar with different cloud providers ecosystems), central identity and access management, incident response over multiple cloud environments, egress traffic cost, and more.

Instead of designing complex architectures to mitigate theoretical or potential risk, design solutions to meet the business needs, familiarize yourself with a single public cloud provider’s ecosystem, and over time, once your teams have enough knowledge about more than a single cloud provider, expand your architecture — don’t run to multi-cloud from day 1.

Choosing the cheapest region in the cloud

As a rule of thumb, unless you have a specific data residency requirement, choose a region close to your customers, to lower the network latency.

Cost is an important factor, but you should design an architecture where your application and data reside close to customers.

If your application serves customers all around the globe, or in multiple locations, consider adding a CDN layer to keep all static content closer to your customers, combined with multi-region solutions (such as cross-region replication, global databases, global load-balancers, etc.)

Failing to re-assess the existing architecture

In the traditional data center, we used to design an architecture for the application and keep it static for the entire lifecycle of the application.

When designing modern applications in the cloud, we should embrace a dynamic mindset, meaning keep re-assessing the architecture, look at past decisions, and see if new technologies or new services can provide more suitable solutions for running the application.

The dynamic nature of the cloud, combined with evolving technologies, provides us with the ability to make changes and better ways to run applications faster, more resilient, and in a cost-effective manner.

Bias architecture decisions

This is a pitfall that many architects fall into — coming with a background in a specific cloud provider, and designing architectures around this cloud provider’s ecosystem, embedding bias decisions and service limitations into architecture design.

Instead, architects should fully understand the business needs, the entire spectrum of cloud solutions, service costs, and limitations, and only then begin to choose the most appropriate services, to take part in the application’s architecture.

Failure to add cost to architectural decisions

Cost is a huge factor when consuming cloud services, among the main reasons is the ability to consume services on demand (and stop paying for unused services).

Each decision you are making (from selecting the right compute nodes, storage tier, database tier, and more), has its cost impact.

Once we understand each service pricing model, and the specific workload potential growth, we can estimate the potential cost.

As we previously mentioned, the dynamic nature of the cloud might cause different costs each month, and as a result, we need to keep evaluating the service cost regularly, replace services from time to time, and adjust it to suit the specific workload.

Summary

The public cloud has many challenges in picking the right services and architectures to meet specific workload requirements and use cases.

Although there is no right or wrong answer when designing architecture, in this blog post, we have reviewed many “poor” architectural decisions that can be avoided by looking at the bigger picture and designing for the long term, instead of looking at short-term solutions.

Recommendation for the post readers — keep expanding your knowledge in cloud and architecture-related technologies, and keep questioning your current architectures, to see over time, if there are more suitable alternatives for your past decisions.

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the books Cloud Security Handbook and Security for Cloud Native Applications, with more than 20 years in the IT industry. You can connect with him on social media (https://linktr.ee/eyalestrin).

Opinions are his own and not the views of his employer.

Checklist for designing cloud-native applications – Part 2: Security aspects

This post was originally published by the Cloud Security Alliance.

In Chapter 1 of this series about considerations when building cloud-native applications, we introduced various topics such as business requirements, infrastructure considerations, automation, resiliency, and more.

In this chapter, we will review security considerations when building cloud-native applications.

IAM Considerations – Authentication

Identity and Access Management plays a crucial role when designing new applications.

We need to ask ourselves – Who are our customers?

If we are building an application that will serve internal customers, we need to make sure our application will be able to sync identities from our identity provider (IdP).

On the other hand, if we are planning an application that will serve external customers, in most cases we would not want to manage the identities themselves, but rather allow authentication based on SAML, OAuth, or OpenID connect, and manage the authorization in our application.

Examples of managed cloud-native identity services: AWS IAM Identity Center, Microsoft Entra ID, and Google Cloud Identity.

IAM Considerations – Authorization

Authorization is also an important factor when designing applications.

When our application consumes services (such as compute, storage, database, etc.) from a CSP ecosystem, each CSP has its mechanisms to manage permissions to access services and take actions, and each CSP has its way of implementing Role-based access control (RBAC).

Regardless of the built-in mechanisms to consume cloud infrastructure, we must always follow the principle of least privilege (i.e., minimal permissions to achieve a task).

On the application layer, we need to design an authorization mechanism to check each identity that was authenticated to our application, against an authorization engine (interactive authentication, non-interactive authentication, or even API-based access).

Although it is possible to manage authorization using our own developed RBAC mechanism, it is time to consider more cloud-agnostic authorization policy engines such as Open Policy Agent (OPA).

One of the major benefits of using OPA is the fact that its policy engine is not limited to authorization to an application – you can also use it for Kubernetes authorization, for Linux (using PAM), and more.

Policy-as-Code Considerations

Policy-as-Code allows you to configure guardrails on various aspects of your workload.

Guardrails are offered by all major cloud providers, outside the boundary of a cloud account, and impact the maximum allowed resource consumption or configuration.

Examples of guardrails:

  • Limitation on the allowed region for deploying resources (compute, storage, database, network, etc.)
  • Enforce encryption at rest
  • Forbid the ability to create publicly accessible resources (such as a VM with public IP)
  • Enforce the use of specific VM instance size (number of CPUs and memory allowed)

Guardrails can also be enforced as part of a CI/CD pipeline when deploying resources using Infrastructure as Code for automation purposes – The IaC code is been evaluated before the actual deployment phase, and assuming the IaC code does not violate the Policy as Code, resources are been updated.

Examples of Policy-as-Code: AWS Service control policies (SCPs), Azure Policy, Google Organization Policy Service, HashiCorp Sentinel, and Open Policy Agent (OPA).

Data Protection Considerations

Almost any application contains valuable data, whether the data has business or personal value, and as such we must protect the data from unauthorized parties.

A common way to protect data is to store it in encrypted form:

  • Encryption in transit – done using protocols such as TLS (where the latest supported version is 1.3)
  • Encryption at rest – done on a volume, disk, storage, or database level, using algorithms such as AES
  • Encryption in use – done using hardware supporting a trusted execution environment (TEE), also referred to as confidential computing

When encrypting data we need to deal with key generation, secured vault for key storage, key retrieval, and key destruction.

All major CSPs have their key management service to handle the entire key lifecycle.

If your application is deployed on top of a single CSP infrastructure, prefer to use managed services offered by the CSP.

For encryption in use, select services (such as VM instances or Kubernetes worker nodes) that support confidential computing.

Secrets Management Considerations

Secrets are equivalent to static credentials, allowing access to services and resources.

Examples of secrets are API keys, passwords, database credentials, etc.

Secrets, similarly to encryption keys, are sensitive and need to be protected from unauthorized parties.

From the initial application design process, we need to decide on a secured location to store secrets.

All major CSPs have their own secrets management service to handle the entire secret’s lifecycle.

As part of a CI/CD pipeline, we should embed an automated scanning process to detect secrets embedded as part of code, scripts, and configuration files, to avoid storing any secrets as part of our application (i.e., outside the secured secrets management vault).

Examples of secrets management services: AWS Secrets Manager, Azure Key Vault, Google Secret Manager, and HashiCorp Vault.

Network Security Considerations

Applications must be protected at the network layer, whether we expose our application to internal customers or customers over the public internet.

The fundamental way to protect infrastructure at the network layer is using access controls, which are equivalent to layer 3/layer 4 firewalls.

All CSPs have access control mechanisms to restrict access to services (from access to VMs, databases, etc.)

Examples of Layer 3 / Layer 4 managed services: AWS Security groups, Azure Network security groups, and Google VPC firewall rules.

Some cloud providers support private access to their services, by adding a network load-balancer in front of various services, with an internal IP from the customer’s private subnet, enforcing all traffic to pass inside the CSP’s backbone, and not over the public internet.

Examples of private connectivity solutions: AWS PrivateLink, Azure Private Link, and Google VPC Service Controls.

Some of the CSPs offer managed layer 7 firewalls, allowing customers to enforce traffic based on protocols (and not ports), inspecting TLS traffic for malicious content, and more, in case your application or business requires those capabilities.

Examples of Layer 7 managed firewalls: AWS Network Firewall, Azure Firewall, and Google Cloud NGFW.

Application Layer Protection Considerations

Any application accessible to customers (internal or over the public Internet), is exposed to application layer attacks.

Attacks can range from malicious code injection, data exfiltration (or data leakage), data tampering, unauthorized access, and more.

Whether you are exposing an API, a web application, or a mobile application, it is important to implement application layer protection, such as a WAF service.

All major CSPs offer managed WAF services, and there are many SaaS solutions by commercial vendors that offer managed WAF services.

Examples of managed WAF services: AWS WAF, Azure WAF, and Google Cloud Armor.

DDoS Protection Considerations

Denial-of-Service (DoS) or Distributed Denial-of-Service (DDoS) is a risk for any service accessible over the public Internet.

Such attacks try to consume all the available resources (from network bandwidth to CPU/memory), directly impacting the service availability to be accessible by customers.

All major CSPs offer managed DDoS protection services, and there are many DDoS protection solutions by commercial vendors that offer managed DDoS protection services.

Examples of managed DDoS protection services: AWS Shield, Azure DDoS Protection, Google Cloud Armor, and Cloudflare DDoS protection.

Patch Management Considerations

Software tends to be vulnerable, and as such it must be regularly patched.

For applications deployed on top of virtual machines:

  • Create a “golden image” of a virtual machine, and regularly update the image with the latest security patches and software updates.
  • For applications deployed on top of VMs, create a regular patch update process.

For applications wrapped inside containers, create a “golden image” of each of the application components, and regularly update the image with the latest security patches and software updates.

Embed software composition analysis (SCA) tools to scan and detect vulnerable third-party components – in case vulnerable components (or their dependencies) are detected, begin a process of replacing the vulnerable components.

Examples of patch management solutions: AWS Systems Manager Patch Manager, Azure Update Manager, and Google VM Manager Patch.

Compliance Considerations

Compliance is an important security factor when designing an application.

Some applications contain personally identifiable information (PII) about employees or customers, which requires compliance against privacy and data residency laws and regulations (such as the GDPR in Europe, the CPRA in California, the LGPD in Brazil, etc.)

Some organizations decide to be compliant with industry or security best practices, such as the Center for Internet Security (CIS) Benchmark for hardening infrastructure components, and can be later evaluated using compliance services or Cloud security posture management (CSPM) solutions.

References for compliance: AWS Compliance Center, Azure Service Trust Portal, and Google Compliance Resource Center.

Incident Response

When designing an application in the cloud, it is important to be prepared to respond to security incidents:

  • Enable logging from both infrastructure and application components, and stream all logs to a central log aggregator. Make sure logs are stored in a central, immutable location, with access privileges limited for the SOC team.
  • Select a tool to be able to review logs, detect anomalies, and be able to create actionable insights for the SOC team.
  • Create playbooks for the SOC team, to know how to respond in case of a security incident (how to investigate, where to look for data, who to notify, etc.)
  • To be prepared for a catastrophic event (such as a network breach, or ransomware), create automated solutions, to allow you to quarantine the impacted services, and deploy a new environment from scratch.

References for incident response documentation: AWS Security Incident Response Guide, Azure Incident response, and Google Data incident response process.

Summary

In the second blog post in this series, we talked about many security-related aspects, that organizations should consider when designing new applications in the cloud.

In this part of the series, we have reviewed various aspects, from identity and access management to data protection, network security, patch management, compliance, and more.

It is highly recommended to use the topics discussed in this series of blog posts, as a baseline when designing new applications in the cloud, and continuously improve this checklist of considerations when documenting your projects.

About the Author

Eyal Estrin is a cloud and information security architect, and the author of the book Cloud Security Handbook, with more than 20 years in the IT industry. You can connect with him on Twitter.

Opinions are his own and not the views of his employer.