Guest Post - Resilience in the cloud

posted 7 May 2019, 09:23 by Lee Newcombe   [ updated 7 May 2019, 10:22 ]
Here's the latest guest blog post - another one courtesy of Leron Zinatullin (@le_rond), this time on considerations when it comes to operating resiliently if using cloud services.  Don't forget, feel free to contact us if you feel that you have a burning issue that you'd like to get off your chest via a guest blog post. Practical lessons learned that you'd like to share would be most gratefully received by ourselves (as a Chapter) and also by our readers.  And now, over to Leron...

Resilience in the Cloud

Modern digital technology underpins the shift that enables businesses to implement new processes, scale quickly and serve customers in a whole new way.

Historically, organisations would invest in their own IT infrastructure to support their business objectives and the IT department's role would be focused on keeping the ‘lights on’.

To minimise the chance of failure of the equipment, engineers traditionally introduced an element of redundancy in the architecture. That redundancy could manifest itself on many levels. For example, it could be a redundant datacentre, which is kept as a ‘hot’ or ‘warm’ site with a complete set of hardware and software ready to take the workload in case of the failure of a primary datacentre. Components of the datacentre, like power and cooling, can also be redundant to increase the resiliency.

On a lesser scale, within a single datacentre, networking infrastructure elements can be redundant. It is not uncommon to procure two firewalls instead of just one to configure them to balance the load or just to have a second one as a backup. Power and utilities companies still stock up on critical industrial control equipment to be able to quickly react to a failed component.

The majority of effort, however, went into protecting the data storage. Magnetic disks were assembled in RAIDs to reduce the chances of data loss in case of failure and backups were relegated to magnetic tapes to preserve less time-sensitive data and stored in separate physical locations.

Depending on specific business objectives or compliance requirements, organisations had to heavily invest in these architectures. One-off investments were, however, only one side of the story. On-going maintenance, regular tests and periodic upgrades were also required to keep these components operational. Labour, electricity, insurance and other costs were adding to the final bill. Moreover, if a company was operating in a regulated space, for example if they processed payments and cardholder data then external audits, certification and attestation were also required.

With the advent of cloud computing, companies were able to abstract away a lot of this complexity and let someone else handle the building and operation of datacentres and dealing with compliance issues relating to physical security.

The need for the business resilience, however, did not go away.

Cloud providers can offer options that far exceed (at comparable costs) the traditional infrastructure; but only if configured appropriately.

One example of this is the use of 'zones' of availability, where your resources can be deployed across physically separate datacentres. In this scenario, your service can be balanced across these availability zones and can remain running even if one of the 'zones' goes down. Capital investment required to achieve such functionality is much greater if you would to build your own infrastructure for this. In essence, you would have to build two or more datacentres. -. You better have a solid business case for this.

Additional resiliency in the cloud, however, is only achieved if you architect your solutions well: running your service in a single zone or, worse still, on a single virtual server can prove less resilient than running it on a physical machine.

It is important to keep this in mind when deciding to move to the cloud from the traditional infrastructure. Simply lifting and shifting your applications to the cloud may, in fact, reduce the resiliency. These applications are unlikely to have been developed to work in the cloud and take advantage of these additional resiliency options. Therefore, I advise against such migration in favour of re-architecting.

Cloud Service Provider SLAs should also be considered. Compensation might be offered for failure to meet these, but it’s your job to check how this compares to the traditional “5 nines” of availability in a traditional datacentre – alongside the financial differences between service credits as recompense and business losses from lack of availability.

You should also be aware of the many differences between cloud service models.

When procuring a SaaS, for example, your ability to manage resilience is significantly reduced. In this case you are relying completely on your provider to keep the service up and running, potentially raising the provider outage concern. In this scenario, archiving and regular data extraction might be your only options apart from reviewing the SLAs and accepting the residual risk. Even with the data, however, your options are limited without a second application on-hand to process that data, which may also require data transformation. Study the historical performance and pick your SaaS provider carefully.

IaaS gives you more options to design an architecture for your application, but with this great freedom comes great responsibility. The provider is responsible for fewer layers of the overall stack when it comes to IaaS, so you must design and maintain a lot of it yourself. When doing so, assume failure rather than thinking of it as a (remote) possibility.  Availability zones are helpful, but not always sufficient.  What scenarios require consideration of the use of a separate geographical region? Do any scenarios or requirements justify a need for a second cloud services provider? The European Banking Authority recommendations on Exit and Continuity can be an interesting example to look at from a testing and deliverability perspective.

Finally, PaaS, as always, is somewhere in-between SaaS and IaaS. I find that a lot of the times it depends on a particular platform; some of them will give you options you can play with when it comes to resiliency and others will retain full control. Be mindful of characteristics of SaaS that also affect PaaS from a redundancy perspective. For example, if you’re using a proprietary PaaS then you can’t just lift and shift your data and code.

Above all, when designing for resiliency, take a risk-based approach. Not all your assets have the same criticality. Understand the priorities, know your RPO and RTO. Remember that SaaS can be built on top of AWS or Azure, exposing you to supply chain risks.

Even when assuming the worst, you may not have to keep every single service running should the worst actually happen. For one thing, it's too expensive - just ask your business stakeholders. The very worst time to be defining your approach to resilience is in the middle of an incident, closely followed by shortly after an incident.  As with other elements of security in the cloud, resilience should “shift left” and be addressed as early in the delivery cycle as possible.  As the Scout movement is fond of saying – “be prepared”.

About the author

Leron Zinatullin is an experienced risk consultant, specialising in cyber-security strategy, management and delivery. He has led large-scale, global, high-value security transformation projects with a view to improving cost performance and supporting business strategy. He has extensive knowledge and practical experience in solving information security, privacy and architectural issues across multiple industry sectors. Leron is the author of The Psychology of Information Security.

Twitter: @le_rond

www.zinatullin.com

Comments