What The Cloud is not and why you should care to know it
We all are using cloud services since years now, this session does not pretend to teach anything technical, there are many cloud services, even assuming I have the skills to teach you what they are and how to operate each of them (and the assumption would be wrong), we would need to stay in one room for months to complete such an exercise and once completed we would discover that what we’ve learned is old and new services have been created in the meantime.
What I try to achieve with this story is to have a common understanding of what does it mean to work with the Cloud, how this change the paradigms we were following before, how this change the definition of our roles and interactions.
This session aims to give a common context and approach when it comes to consider Cloud services to avoid common mistakes during Cloud adoption programs and be ready to select this journey for the rights reasons.
From the candle to electricity
Before mentioning Cloud Services I would like to drive your attentions to other transformations we had in the past.
We used to have candles to provide light in the houses, this has been replaced by electricity powering light bulb. Moving from candles to light bulb is not cost effective, candles are really inexpensive and the only dependency to use them is to be able to have an initial flame; light bulbs are more expensive than candles and require electrical plants and a subscription to a provider bringing electricity up to your house.
But with light bulbs the operations to have lights is much easier and it is much easier to vary how much light you want in a certain room compared to one other, and you can easily have light in multiple areas of the house coming up together and finally the electrical plant can be used for many other things than not just provide light.
Electricity was not just replacing candles it was creating a completely new set of capabilities.
Let’s consider now a more recent change. When we all moved from legacy mobile phones to smart phones we got a product that was seriously more expensive, that was bigger, harder to fit into a pocket, often with worst antennas compared to the phones we were using before and with batteries that were lasting incredibly less (we used not to charge the legacy phones for days, with smart phones this became hours).
But we all moved from legacy phones to smart phones, because it was enabling us to do much more, to become more efficient, we used to travel with a phone, an agenda, a computer, a camera, a navigator system, a music player. All these things entered into a single product capable to do all they were doing and something more.
But it means we had to start using the smart phone on a way that is pretty different from the way we were using the legacy phones.
“I move to The Cloud because it is less expensive!”
The Cloud providers have very good marketing people and marketing had been very good on creating attention to The Cloud even when the concept was very far from what people was used to.
Marketing communicates using simple messages, people tend to absorb them quickly and so this caused the creation of a series of false myths that is important to demystify.
The most common mistake is to believe Cloud will cost less: “I can use it without any problem because it is very inexpensive”.
This is not true, if you take the same computational power in an On Premises setup and in the Cloud (any of the major Cloud providers) and you compare costs, in the long term the On Premises will always cost less.
This because you are comparing two different things and it is easy to understand with an example. If you make your own pizza you get some flour, yeast, salt, oil, tomato sauce and you do the job using your own oven with your own electricity. Will this cost you around 2 Euro per pizza? If you go and order a pizza in one of the best Pizzeria in town, the same pizza will cost you 5 euro.
But if you need to do 100 pizzas at home this would be incredibly slow and difficult or very expensive because you will have to buy additional ovens and tables and space, while in the pizzeria it will always cost 5 euro each or maybe less because they will give you a volume discount. You could decide to buy or build your own big Pizza Machine to be able to prepare 100 pizzas very quickly and in the long term this will cost you less than not keep ordering from the Pizzeria, but what if you discover that you need to make 50 Pizzas only but add 50 Lasagna? Your investment on the big Pizza Machine will be unused and you’ll need a big Lasagna machine too. In the pizzeria any pizza wills keep costing you 5 euro and you will be able to go and ask to have also Lasagna or Spaghetti and change your mind in any moment without the need to plan in advance.
Common mistakes when moving to The Cloud | Cost Evaluation
Cloud is not less expensive, it just has a different usage model. I am insisting a lot into this aspect because my experience so far has been an experience of wasting money but it is not only me, the most common failure for companies migrating from traditional solutions to the Cloud is that they end up spending more money than in the past.
It is dramatically important for everyone to remind that the Cloud has a “per usage” model and also that the Cloud has a very detailed list of explicit costs. When you create a Virtual Machine, the simpler example, you have to remind that you will pay a cost per hour based on the CPU size and Memory size of the VM but you will also in addition pay a cost on top for the disk storage used by the VM based on the size of the space allocated but also on the kind of disk (different models having different scalability, security, performance measures) but you will also pay a cost on top for the number of operations you make on the disk (that is not included in the pure cost of the disk space availability) and you pay a cost on top for the network traffic generated by the VM and you pay a cost on top for any snapshot you take of the VM and if you enable some special agent or add-on it will also have a separate cost line.
What has to be always kept in mind is that you pay if what you created exist, it does not matter if you are using it or not, you pay per the minutes it does exist so it makes a very big difference if you close a service now or in two hours and if you sized it appropriately for what you need or if you oversize it just to be sure and have contingency. The business model of the Cloud is taking serious advantages on the concept that many people will over provision and will forget to turn off services when not in use.
The Cloud providers are constantly creating new services that can be more convenient to run the business cases of your projects than the ones you were using before, they are also constantly updating their API or UI, they also change the prices of existing services on the base of how the market moves and how capacity and capabilities are used in the datacentre.
The value of the Cloud is also on the ability to create multiple systems, isolated from each other in terms of deployment, so if you were used to have a cluster of 9 database servers to run all business, when moving in the Cloud you will easily find yourself having hundreds of them.
For all these reasons together, it is clear that the work required to manage and operate the Cloud is increased compared with the work that was required to manage a traditional On Premises solution; the number of people required to manage a specific number of deployments (applications and their hosting components) decreased due to automation but the number of deployments is increasing exponentially. A Cloud Engineer (or a DevOps Engineer) is doing a very different job from the System or Network Engineer you used to have On Premises and she\he can be much more effective and efficient but the total number of Cloud Engineers you need keep increasing due to the increasing number of deployments to manage.
To conclude on this point, if you want to take what you have On Premises and move it to the Cloud “as is” preserving the same operational model, just don’t do it, it will just cost you much more.
You have to move to the Cloud operating things as required by the Cloud (so continuous resizing, deletion and creation on the base of the actual needs).
Common mistakes when moving to The Cloud | No need for capacity planning
One other common mistake is to believe that capacity and scalability, as backup and resiliency or security, are available by default. You moved from your On Premises setup to The Cloud and you start immediately to forget about all those limitations that you were experiencing On Premises, considering The Cloud an infinite space for growth.
Reality is that The Clouds is composed by hardware physically installed into some physical datacentres and as anything physical it is subjected to limits.
It means that each service in each region has its own pretty finite capacity.
In most services you have to configure explicitly the capacity and scalability you need defining and activating the rules and you’ll always find some limit.
If you can not accept the limits of the datacentre, you have to design your solution to be able to run from multiple regions.
Common mistakes when moving to The Cloud | No need to manage availability and reliability
The Clouds are very complex distributed systems, operated by humans and running on physical hardware. For this reason Clouds have incidents and Cloud have failures that some time, even if rarely, are able to make an entire region unavailable for long hours.
This is common to each and every Cloud provider and can be easily verified looking at the incidents report of each of the main vendors. Azure for example had incidents in the following days taking a range between October and beginning of November 2018: 8 of November, 2 of November, 27 of October, 24 of October, 17 of October, 16 of October, 13 of October, 11 of October, 8 of October, 4 of October, 3 of October, 2 of October. During the same period AWS had 40 incidents (measured in a period of 30 days).
When it comes to SLA, you have to pay attention that each single atomic service has its own independent SLA, it means that if your service depends on the availability of the DNS, a set of App Service, the network, a couple of Virtual Machines and a PaaS based database and the DNS is down and so your service is completely unavailable for your end users, the Cloud providers will pay you only the penalties related to difference between the actual uptime of the DNS service and the SLA on that service and will pay only the credits related to the DNS services, without any penalty on the other components that you will keep paying completely without any reimbursement.
This, as per the capacity and scalability limitations, means that you have to design your solution to be capable of running into multiple regions and to design it in a way to cope with multiple kind of failures.
Ideally, and in extreme, you should also design your solution to run using services from multiple different cloud providers.
Please also always remind that Cloud business model is, at least partially, based on the concept of overbooking and abstraction, therefore consider that when the Cloud refer to 1CPU this is not equal to the capacity you get from 1CPU in your On Premises setup, same for the network throughput. Even if rare, due to the protection that the vendors are trying to implement, it is also possible, as experience by some projects I’ve been working for, that activities executed by other customers of the same Cloud provider and region will impact the capacity available for your deployment.
Cloud platforms have all that is needed to manage data replication, data backup, security, monitoring. All these features are available but often none of them is activated by default. The deployment has to be designed in order to implement the required configurations for each of these areas.
It is also important to take a clear distinction between data backup and data replication. Often people confuse the two things and believe that to have a geographical redundant storage is all that they need to save their data.
Data replication means that there are multiple distributed copies of the data, this is necessary to assure data availability, but if you need to assure data reliability you need also to implement a backup policy. When you replicate data you replicate every action done on data, replicating also data corruptions caused by bad editing or deletion.
Different mechanisms and levels of data replication and data backup policies are available in each of the Cloud service, it is important to define the business requirement for the specific solution and explicitly implement what is best for the case, taking into consideration the related costs.
Common mistakes when moving to The Cloud | No need to managed security
One of the most common mistakes with Cloud provisioning is to forget about security. Thousands of Kubernetes clusters, Redis clusters, Mongodb replica-sets and other services are compromised every year because deployed into the Cloud using simple templates without taking the time to configure restriction in network access or to change default usernames and password.
As for the other elements we just discussed, services are available in the Cloud to correctly manage any security concern, but rarely they are activated and configured by default.
While resources On Premises are by default available only on private office networks, access to services in public Cloud is operated via public internet, it means that when you open access you are potentially opening access to everyone.
Almost every company had already experienced multiple services and Virtual Machines that had been configured as completely open to the Internet. One thing is certain, and you can easily check if you look at the logs of the ADSL router you have at home, any public IP is constantly subjected to scan and attacks, do not think “it has a public IP but none knows about it, so why should someone attack it”, the simple fact to have a public IP means that the service or server is under attack and will constantly be under attack.
Encrypt traffic and apply as much network access filtering as possible on any system or service you deploy in the Cloud, change any default username or password, always make a personal copy of any template you want to use and make your own changes, avoiding the direct usage of public images and templates and always select only more than trustable repositories.
Logging and Monitoring tools are also available on each Cloud vendor, again these are not necessarily configured by default, take your time to analyse the solution you have to deploy into the Cloud and properly configure logging collection, log analysis and monitoring tools, for both security and availability. This is a constant activity, monitoring has to be tuned constantly.
How to approach The Cloud?
The usage of Cloud services requires a new paradigm, it can not be approached with the same toolset and mindset we had while working On Premises or with traditional infrastructure.
Cloud adoption history is full of cases of big success but it is also full of cases of incredibly, and very expensive, failures.
As for everything in technology (and organisation or methodology) we do not have in absolute terms a “right thing to do” versus a “wrong thing to do”. Cloud is not “good” and traditional hosting “bad” but which is the best solution depends on each specific context (being the context the architecture, processes and tools, project methodology, team organisation, available skills and business requirements and objectives).
It is possible to use Cloud services also with traditional business and processes and solutions, but the best achievements are visible when a DevOps organised team, using Agile methodologies is using automation tools to deploy a Micro Services (or at least modular) architecture.
Micro Services and Modularity
If you have a big monolithic application that requires a fixed immutable capacity necessarily based on a IaaS (virtual machines) model, please make yourself a favour and keep it in traditional hosting; with Cloud you will just get higher costs and probably also lower performances.
The benefits of the Cloud are the ability to continuously deploy, use dynamically allocated resources and to change deployment topologies cross services and geography; this can be achieved with modular solutions, where you can manage a fully distributed design composed by a series of small independent deployments.
Automation and CI\CD
With the Cloud you will find yourself creating multiple independent deployments, the number of services and instances will constantly grow.
Also the interaction between services will gradually become more complex.
It is not possible to manage this complexity using documents (as configuration, network or deployment detailed schemas) or to operate manually configurations or updates.
Automation tools, orchestrators and solution to manage service mesh must be used to assure the control of the deployments.
Only operating exclusively via these tools it is possible to guarantee that the status of each deployment is as it was thought to be (and the documentation is inside the tools, as part of the pipeline, scripts and variables that have been actually used to build the deployment) and only knowing the starting status it is possible to apply safely the changes.
Different skills (and so people) from the multidisciplinary team taking care of a project will have to cooperate to define each step of the pipelines for the automation and orchestrations and almost every component of the team will be able to operate it.
Cloud means elasticity. It is easy and quick to activate a new service or resource or to change an existing one. In the Cloud you pay for what you created, so it is important to be effective, in quick cycles create, test and destroy until you find the right setup, without leaving unused and unneeded resources active, knowing you will always be able to recreate them when needed.
DevOps is the practice of operations and development engineers participating together in the entire service lifecycle, from design through the development process to production support. DevOps is also characterize by operations staff making use for their systems work of the same techniques as developers. “DevOps is the application of Agile Methodology to System Administration” (Tom Limoncelli, “The Practice of Cloud System Administration”).
Core values: Culture (People, process, tools); Automation (infrastructure as code), Measurement (measure everything), Sharing (collaboration and feedback) — CAMS
DevOps is mostly about breaking down barriers between teams. An enormous amount of time is wasted with tickets sitting in queues, or individuals writing handoff documentation for the person sitting right next to them.
The complexity of a Cloud deployment can not be managed by separated siloes organisations.
Cloud Market and Services
Cloud is here to stay, adoption of Cloud Services is keep growing with a rate of more than 50% each quarter.
Amazon AWS, Microsoft Azure, Google GCP are clearly recognised globally as the dominant providers. Alibaba, Oracle and IBM (now including Red Hat) are other important competitors. Many other companies created their space in this market, as for example VMWare or Rackspace, providing full Cloud services or integrated solution or professional services.
Cloud Services are constantly evolving at a speed that has no precedence. Different services are made available every month.
It is very important to spend effort (and time) to analyse for new (but also for existing projects) what changed in the services we were using (new capabilities or new cost model) and which other new services could be a better fit for the projects’ needs.
Each service exists in multiple flavour, for example something basic as a disk space can be provided with very different capabilities (local or geographical redundancies, ability to perform snapshotting, online resizing, native HTTPS or SFTP or Rsync access, etc..), with very different attributes (in terms of number of operations supported, in terms of capacity and available space, etc..) and different SLAs (for example a disk could support up to 500 IOPS but with no guaranteed service and one other could support again 500 IOPS but guaranteed).
You need to spend time to operate the right design of the deployment in order to match the business requirements.
I strongly believe in four assumptions to be kept always in mind when designing a solution.
This is a given, no matter if it is On Premises, physical or virtual, or in Cloud, no matter the vendor or provider you are using, you will always experience infrastructure failures. For this reason you have to design your deployment in order to cope with failures (multiregion, multicloud, graceful degredation).
You must have methods and tools to try to reduce as much as possible the probability for a bug to reach the production environment but bugs will always exists and they will always find their way up to the end user. For this reason you need to have in place tools and procedures to react and correct on a fast way. Do not think “zero bug” but think “fast recovery”.
You need to design your process and tools to minimize the possibility and the impact of human mistakes, but this will never allow to come to a “zero mistake” situation. The process and tools needs to be in place to intercept the mistake and operate corrections.
You will design your solution to be used on a certain way, but one day a person will use it differently, this happen in frontend applications, backoffice applications but also on tools and pipelines for the development and the deployment. Be ready to recognise a different usage pattern and to support or even embrace it.
Always remind that a technology is never “good” or “bad” in “essence”, the starting point to evaluate a technical choice are the business requirements and priorities.
No IT project can succeed if it does not have a business need behind, a pure technological refactoring can be approached only if it is directly linked with business value.
The current business context demands to be fast, to reduce the time to market; the current technology context is constantly changing and increasingly complex, removing a clear distinction between software and infrastructure. The only way to be capable to respond to such demands from business and technology is to run small teams that are including all the required skills (and it is more about skills than roles now), so teams that can design, execute and operate without dependencies from external teams.
The giant in the market, as Amazon or Microsoft or Google, are constantly creating services available and other companies are building tools and components, if you have a business need first check if something exist out in the market to be used, but always remind you’ll have to analyse it in deep and integrate it, building something around it or learn how best to use (please remember previous examples about missing backup or monitoring).
Reliability and Security are key in Cloud environments because of the public nature of the deployments (so even more important than in the case of On Premises), it is key to keep these two elements always in mind from the design to the operation phase and to constantly evolve the deployment on the base of the changes and evolution.
The Cloud is an open space, where it is very quick, easy and inexpensive to experiment, don’t miss this opportunity!
The content of this article was part of a presentation done at Deltatre S.p.A. in Turin on 15th November 2018. The slides are available in the Newesis YouTube channel https://www.youtube.com/watch?v=7nxKX9xRRGw&t=69s or in SlideShare at https://www.slideshare.net/RaunoDePasquale/newesis-introduction-to-the-cloud