Introduction to Cloud FinOps

19 Apr 2022FinOps

What is FinOps

Fin & Ops. Fin means the financial aspect. What usually mean FinOps is Cloud cost management or Cloud cost Optimization but sounds better with FinOps.

Cloud can be expensive, make savings can make impacts in the final bill. It is right to say that poor cloud design can lead to overspend. But change that design can be costly as well, this mean that FinOps can sometime be as simple as knowing some good practices and tricks to save. But we can speak of a cultural practice of Cloud aspect.

Here are some core principles.

Only use what you really need.
Small amounts adds up quickly.
Don’t need million-dollar cloud bill to start think FinOps.
Good spending vs bad spending. If high spending gives high value to the company, then it's fine. After all, it is the business value that should drive the cost of Cloud.
Everyone in companies should be aware of costs and has a part to optimization.

Two cases that will influence the approach. You are in the company long enough to grasp the context, maybe as Cloud Architect that want to optimize costs. The second approach would be a consultant who need to understand the context at the same time to provide a FinOps analyze. In this case, you need to figure out where you stand, what does the total spend and it forecast, understand context and business value.

But is saving your goal? Is time to market or innovation more important?

If you are not sure if you should learn FinOps, Why You Should Learn FinOps as DevOps could be for you.

Understand Cloud Bill

This might sound obvious for someone that never seen the level of detail of a cloud bill. It goes in so many details that it could be difficult to spot where your spending goes. The granularity can seem overwhelming at first, but it is particularly useful for FinOps.

Tagging resources is another powerful tool to help understand where the money goes.

Budgets, Alerts and Forecast

Budget and alerts are good practices, they should be seen as ways to be aware of overspending, not as limits you should not reach. Don’t limit innovation based on budget. It's just nice feedback on spending.

On personal accounts, fix a monthly alert as low as 5$ per month. it will be triggered if you forgot some expensive learning experience. You want to be aware fast enough to do something about it. This advice applies as soon as you open your free account.

I set the budget limit for an average monthly cost and set a 50% and 80% target. I will trigger an email. If the 50% mail is received the 5 of the months, something is wrong.

Be alert early of a credit spike gives credibility, much better than if someone reports you 3 months latter according to financial sheets. Good alerts should be a priority.

Forecast based on Machine Learning is a good indicator to where your spending is and where they are going.

Wasted Usage

Only use what you really need as seen in core principles above, mean that everything that is not needed is a wasted usage, and wasted money. We'll discuss more deeply but some examples are

Turn off Compute Engine, you are wasting CPU and RAM usage if it could simply turn off.
Gives the right size. Oversize Compute Engine, Persistent Disk too big for no reason.
SQL instances that could be shared.
Network not needed

Small charges matter

It depends on the total amount you have, but small daily charges add up quickly. Numbers are not the same with $1 million dollars cloud bill.

If you have 10 resources around $0.50 per day, it's seeming so little but would be 10x0.5x30=$150 per month. What seems small at first can be a part of your bill.

Resources listing

The approach can be different if you are the Cloud guy that does everything. It's more likely that you know every resource that is been created. In larger company, it could be difficult to keep track of useless test resources created by developers. It could help to keep project separated with real production.

Keep track of what is used for. Tags or Labels are a way to go.
Find resources that is not needed. It can mean investigate the resources or people around.
Keep track of what is created by other for testing purposes.

Get people involved

Gives visibility on what you do, share your culture. Inform teams about Cloud usage and good practices can help to optimize. You might don’t know the entire architecture of the current cloud, get people involved help to point out some wasting.

Cultural, because is not a one shot but more a swift of mind. If someone is used to choose overside resources or always keep draft resources running, the origin problem won't go away.

Because sometime just to talk about a resource, like "hey, do you really need this?" is enough. It might be overwhelming to review every single resource, that where the tagging system comes in.

Turn off Compute Engine

For reminder, when a compute engine is turned off. You only pay for the disk. CPU and RAM are deallocated and not paid. This mean, don’t need? Turn it off, don't use and don't pay for it.

Here is a concrete example, a Gitlab runner on a separate instance of the Gitlab, in a 9am to 5pm company, let turn on from 7am to 7pm in case you have early/late worker employees. 5 days a week. In case of emergency, if someone want the GitLab runner you'll give some specific access to turn it on.

The mechanism is easy, in GCP or other cloud providers, you have a scheduler to start and stop compute engine.

Let's do the math.

7x24 = 168h
5x12 = 60h

This means you pay 1/3 of the price, and a 2/3 saves. Nice discount!

You might consider this for every not production or administration compute engine

Bastion host (just turn it on for night call, schedule launch for when you start work)
Gitlab runner
Staging, development environment

Lauch compute only for a task

Seconds, minutes, hours billing matter.

Computes Engines are bills for the time you use divided in seconds. This means you can launch a Compute Engine only for a 5-minute task, you'll paid for that amount of time. This mean you can do stuff less costly that if you would be bill for an hour every time you launch a Compute Engine.

It makes perfect sense to launch a big compute engine every night to make big calculation and turn it off after. Or even a not so big compute engine if the result is not needed quickly.

This could be a good use case for serverless computing.

Preemptible or Spot

This is the opposite of commitment. In GCP, a preemptible instance is an instance that would live 24h at most and can be killed at any time (with a catchable signal).

You might have this concept on a Kubernetes Cluster. Nodes can be tagged. This is a good use case for pods that are fault tolerant with capacity to take the calculation back to where it was. This can easily be done for example with just one cluster with two nodes pools, one pool preemptible capacity and a constraint on the deployment.

Committed use discounts

The commitment is not to a specific compute engine, but a total number of CPU or RAM, this mean if you need to resize an instance, it will not break your commitment.

If you know servers will run 24/7, you might want to do a commitment to resources, meaning you will reserve some resources. In GCP you commit independently to CPU and RAM in a specific region. There is 1- or 3-years commitment that gives you up to 50% discount. This could be useful for servers like. The cost is not upfront but a monthly cost that you'll pay even if it's unused. Don't waste reservations.

FTP
Kubernetes, if you know that you have at least few nodes that always running.
Services like Redis, mongo or RabbitMQ that could run all the time.

For the above, you could do a 20 CPUs and 80Go RAM commitment.
You can do the same thing with SUSE Linux Enterprise Server for SAP Applications, if you have a server that run 24/7.

Network Costs

VPC are free to create, but networking can be costly, a good understanding of the billing system can help you to reduce costs.
Communication between server should use the internal IP address. Avoid as much as possible egress. Gzip http response is a good practice for internet speed but decrease the amount that you send. Less egress, less cost.
Network fees apply for 2 instances in 2 different zones, instance in zone1 can write to a bucket, instance in zone2 read it. no fees inside region for GCS. This could reduce cost of instances in 2 zones sharing data via a bucket.

In Google Cloud Platfom, you have 2 kind of network. Premium as default and standard, you will get about 30% off to switch to a standard network, do you really need the giant worlwide backbone?

Persistent Disk vs Snapshot

To go further about turn off Compute Engine, once the compute engine is turn off. the only expensive part is the Persistent Disk (hard drive). It's even more true if it's an SSD.

Here are two example in blog post I wrote for a personnal usage that I needed.

I automate the process with Terraform. The idea is to have a nice and costly SSD. The benefit is double.

Snapshots are cheaper than persistent disk.
50Go SSD if not full could be saved with a 20Go snapshoot.
Turn off
Snapshoot
Delete Persistent Disk
Create instance from Snapshot
Turn on

Insight recommendations and cost optimization

That might the easiest part, in Compute Engine you have recommendation if the resource is underused. You will have recommendation directly in the Compute Engine tab.

Something similar with Kubernetes, you have a tab "cost optimization" in GCP with request and limits CPU and RAM of workload in GKE, it's good indication if your cluster can be optimized.

Cloud Storage

Storage can be mount into a compute engine, you don’t need to provision more than it actually use like with persistent disk.

Set the storage on the right class, Nearline, Coldline, archive is cheaper to store but retrieve fees is higher. You can find rules like if you access a file once a month, you should go for Coldline, once a year go for archive.

Keep in mind that archive is cheap but download it could be expensive. This a good case for data that you have a copy and leave a copy as archive. You hope to never need it (this would mean that something bad happen). But the day you will really need it, you ready to spend big money to get it back.

In GCP, you have "lifecycle", that will change the class value of a file inside a bucket. after xx days you'll assume nobody will access your file, set a lifecycle that downgrade the class or even delete it.

Logging

Two parts in this one.

Log volume can be costly, keep it minimal. A simple example is for http status, do you really need 200 statuses, or 5xx and 4xx could be enough?
Log and monitoring help you to spot waste of resources, therefore waste of money. Help to spot anomalies and spikes.

Conclusion

This is really a light introduction to the subject, more recipes to follow that a real in depth of the subject. But if what you are looking for is deep understanding of FinOps, you'll find other resources on the subject.

As the cloud adoption is getting bigger and bigger, and the weight of cloud cost can be an important part of the financial sheet of a company, it seems like FinOps still have good days ahead. FinOps considerations should be everyone concern.

Hope it can help and gives ideas to saving potentials. I made a quick checklist for FinOps.

Next: 12 Tips to Reduce Cloud Cost FinOps
Previous: YAML is everywhere in the DevOps World

Introduction to Cloud FinOps

What is FinOps #

Understand Cloud Bill #

Budgets, Alerts and Forecast #

Wasted Usage #

Small charges matter #

Resources listing #

Get people involved #

Turn off Compute Engine #

Lauch compute only for a task #

Preemptible or Spot #

Committed use discounts #

Network Costs #

Persistent Disk vs Snapshot #

Insight recommendations and cost optimization #

Cloud Storage #

Logging #

Conclusion #