Introduction to Cloud FinOps
FinOpsWhat is FinOps #
Fin & Ops. Fin means the financial aspect. What usually mean FinOps is Cloud cost management or Cloud cost Optimization but sounds better with FinOps.
Cloud can be expensive, make savings can make impacts in the final bill. It is right to say that poor cloud design can lead to overspend. But change that design can be costly as well, this mean that FinOps can sometime be as simple as knowing some good practices and tricks to save. But we can speak of a cultural practice of Cloud aspect.
Here are some core principles.
- Only use what you really need.
- Small amounts adds up quickly.
- Don’t need million-dollar cloud bill to start think FinOps.
- Good spending vs bad spending. If high spending gives high value to the company, then it's fine. After all, it is the business value that should drive the cost of Cloud.
- Everyone in companies should be aware of costs and has a part to optimization.
Two cases that will influence the approach. You are in the company long enough to grasp the context, maybe as Cloud Architect that want to optimize costs. The second approach would be a consultant who need to understand the context at the same time to provide a FinOps analyze. In this case, you need to figure out where you stand, what does the total spend and it forecast, understand context and business value.
But is saving your goal? Is time to market or innovation more important?
If you are not sure if you should learn FinOps, Why You Should Learn FinOps as DevOps could be for you.
Understand Cloud Bill #
This might sound obvious for someone that never seen the level of detail of a cloud bill. It goes in so many details that it could be difficult to spot where your spending goes. The granularity can seem overwhelming at first, but it is particularly useful for FinOps.
Tagging resources is another powerful tool to help understand where the money goes.
Budgets, Alerts and Forecast #
Budget and alerts are good practices, they should be seen as ways to be aware of overspending, not as limits you should not reach. Don’t limit innovation based on budget. It's just nice feedback on spending.
On personal accounts, fix a monthly alert as low as 5$ per month. it will be triggered if you forgot some expensive learning experience. You want to be aware fast enough to do something about it. This advice applies as soon as you open your free account.
I set the budget limit for an average monthly cost and set a 50% and 80% target. I will trigger an email. If the 50% mail is received the 5 of the months, something is wrong.
Be alert early of a credit spike gives credibility, much better than if someone reports you 3 months latter according to financial sheets. Good alerts should be a priority.
Forecast based on Machine Learning is a good indicator to where your spending is and where they are going.
Wasted Usage #
Only use what you really need as seen in core principles above, mean that everything that is not needed is a wasted usage, and wasted money. We'll discuss more deeply but some examples are
- Turn off Compute Engine, you are wasting CPU and RAM usage if it could simply turn off.
- Gives the right size. Oversize Compute Engine, Persistent Disk too big for no reason.
- SQL instances that could be shared.
- Network not needed
Small charges matter #
It depends on the total amount you have, but small daily charges add up quickly. Numbers are not the same with $1 million dollars cloud bill.
If you have 10 resources around $0.50 per day, it's seeming so little but would be 10x0.5x30=$150 per month. What seems small at first can be a part of your bill.
Resources listing #
The approach can be different if you are the Cloud guy that does everything. It's more likely that you know every resource that is been created. In larger company, it could be difficult to keep track of useless test resources created by developers. It could help to keep project separated with real production.
- Keep track of what is used for. Tags or Labels are a way to go.
- Find resources that is not needed. It can mean investigate the resources or people around.
- Keep track of what is created by other for testing purposes.
Get people involved #
Gives visibility on what you do, share your culture. Inform teams about Cloud usage and good practices can help to optimize. You might don’t know the entire architecture of the current cloud, get people involved help to point out some wasting.
Cultural, because is not a one shot but more a swift of mind. If someone is used to choose overside resources or always keep draft resources running, the origin problem won't go away.
Because sometime just to talk about a resource, like "hey, do you really need this?" is enough. It might be overwhelming to review every single resource, that where the tagging system comes in.
Turn off Compute Engine #
For reminder, when a compute engine is turned off. You only pay for the disk. CPU and RAM are deallocated and not paid. This mean, don’t need? Turn it off, don't use and don't pay for it.
Here is a concrete example, a Gitlab runner on a separate instance of the Gitlab, in a 9am to 5pm company, let turn on from 7am to 7pm in case you have early/late worker employees. 5 days a week. In case of emergency, if someone want the GitLab runner you'll give some specific access to turn it on.
The mechanism is easy, in GCP or other cloud providers, you have a scheduler to start and stop compute engine.
Let's do the math.
- 7x24 = 168h
- 5x12 = 60h
This means you pay 1/3 of the price, and a 2/3 saves. Nice discount!
You might consider this for every not production or administration compute engine
- Bastion host (just turn it on for night call, schedule launch for when you start work)
- Gitlab runner
- Staging, development environment
Lauch compute only for a task #
Seconds, minutes, hours billing matter.
Computes Engines are bills for the time you use divided in seconds. This means you can launch a Compute Engine only for a 5-minute task, you'll paid for that amount of time. This mean you can do stuff less costly that if you would be bill for an hour every time you launch a Compute Engine.
It makes perfect sense to launch a big compute engine every night to make big calculation and turn it off after. Or even a not so big compute engine if the result is not needed quickly.
This could be a good use case for serverless computing.
Preemptible or Spot #
This is the opposite of commitment. In GCP, a preemptible instance is an instance that would live 24h at most and can be killed at any time (with a catchable signal).
You might have this concept on a Kubernetes Cluster. Nodes can be tagged. This is a good use case for pods that are fault tolerant with capacity to take the calculation back to where it was. This can easily be done for example with just one cluster with two nodes pools, one pool preemptible capacity and a constraint on the deployment.
Committed use discounts #
The commitment is not to a specific compute engine, but a total number of CPU or RAM, this mean if you need to resize an instance, it will not break your commitment.
If you know servers will run 24/7, you might want to do a commitment to resources, meaning you will reserve some resources. In GCP you commit independently to CPU and RAM in a specific region. There is 1- or 3-years commitment that gives you up to 50% discount. This could be useful for servers like. The cost is not upfront but a monthly cost that you'll pay even if it's unused. Don't waste reservations.
- FTP
- Kubernetes, if you know that you have at least few nodes that always running.
- Services like Redis, mongo or RabbitMQ that could run all the time.
For the above, you could do a 20 CPUs and 80Go RAM commitment.
You can do the same thing with SUSE Linux Enterprise Server for SAP Applications, if you have a server that run 24/7.
Network Costs #
VPC are free to create, but networking can be costly, a good understanding of the billing system can help you to reduce costs.
Communication between server should use the internal IP address. Avoid as much as possible egress. Gzip http response is a good practice for internet speed but decrease the amount that you send. Less egress, less cost.
Network fees apply for 2 instances in 2 different zones, instance in zone1 can write to a bucket, instance in zone2 read it. no fees inside region for GCS. This could reduce cost of instances in 2 zones sharing data via a bucket.
In Google Cloud Platfom, you have 2 kind of network. Premium as default and standard, you will get about 30% off to switch to a standard network, do you really need the giant worlwide backbone?
Persistent Disk vs Snapshot #
To go further about turn off Compute Engine, once the compute engine is turn off. the only expensive part is the Persistent Disk (hard drive). It's even more true if it's an SSD.
Here are two example in blog post I wrote for a personnal usage that I needed.
I automate the process with Terraform. The idea is to have a nice and costly SSD. The benefit is double.
-
Snapshots are cheaper than persistent disk.
-
50Go SSD if not full could be saved with a 20Go snapshoot.
-
Turn off
-
Snapshoot
-
Delete Persistent Disk
-
Create instance from Snapshot
-
Turn on
Insight recommendations and cost optimization #
That might the easiest part, in Compute Engine you have recommendation if the resource is underused. You will have recommendation directly in the Compute Engine tab.
Something similar with Kubernetes, you have a tab "cost optimization" in GCP with request and limits CPU and RAM of workload in GKE, it's good indication if your cluster can be optimized.
Cloud Storage #
Storage can be mount into a compute engine, you don’t need to provision more than it actually use like with persistent disk.
Set the storage on the right class, Nearline, Coldline, archive is cheaper to store but retrieve fees is higher. You can find rules like if you access a file once a month, you should go for Coldline, once a year go for archive.
Keep in mind that archive is cheap but download it could be expensive. This a good case for data that you have a copy and leave a copy as archive. You hope to never need it (this would mean that something bad happen). But the day you will really need it, you ready to spend big money to get it back.
In GCP, you have "lifecycle", that will change the class value of a file inside a bucket. after xx days you'll assume nobody will access your file, set a lifecycle that downgrade the class or even delete it.
Logging #
Two parts in this one.
- Log volume can be costly, keep it minimal. A simple example is for http status, do you really need 200 statuses, or 5xx and 4xx could be enough?
- Log and monitoring help you to spot waste of resources, therefore waste of money. Help to spot anomalies and spikes.
Conclusion #
This is really a light introduction to the subject, more recipes to follow that a real in depth of the subject. But if what you are looking for is deep understanding of FinOps, you'll find other resources on the subject.
As the cloud adoption is getting bigger and bigger, and the weight of cloud cost can be an important part of the financial sheet of a company, it seems like FinOps still have good days ahead. FinOps considerations should be everyone concern.
Hope it can help and gives ideas to saving potentials. I made a quick checklist for FinOps.