Meet your Instructor - Constanze - Staff Engineer at the VSC-Research Center
Last Updated: 2023-12-24
The "Austrian DataLAB and Services" and the associated "AustrianOpenCloudCommunity" demonstrates how digital technologies can be used in teaching and research. So, in 2022, we asked the ministry to approve a prototypical lecture to also teach the cloud technologies on top of our newly created cloud stacks. Summer 2024 is the second round of lectures, which will partially be recorded and converted into self-service labs.
The infrastructure belongs exclusively to TUWien and the VSC, however, any student from any university is welcome to co-enroll.
The lectures follow the Open Source philosophy and aim to convey the value of open source mentality, mindset, culture and its value (cf https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4693148 )
For students who are reasonably skilled at overall software engineering and who would like to leverage cloud computing and cloud capabilities to widen the impact of their engineering creativity.
Specifically, students who are playing with the idea of founding a tech startup, this class will give them a solid overview over the challenges of operations, security, product placement and team work. Cloud native is one option to realize a product idea without too much up-front investment, however, it is not without its own challenges. And to this end, this class is designed to give a practical and experience driven overview, covering the most crucial aspect of bringing a software product to market.
The topics covered are:
All classes contain Architecture, Security, Development, Operations, Management and Psychology topics, the students will be treated as mature adults and expected to use their own best judgment in how to solve problems. Exams are open-book. Constructive behaviour and active problem solving is encouraged and participants can actively contribute to the content of these labs.
licensed under Creative Commons Attribution-NonCommercial 2.5 license, retrieved from xkcd
Each week of the course is dedicated to one topic and the pedagogical approach is based on "agency" and "creativity", teamwork is explicitly required as it is close to real life.
You will explore fundamental concepts (from scratch) and real-life problems
You will be given homework to adequately prepare you for the hands-on class.
You will be working on live systems and you will receive a cloud account. You are responsible for treating your accounts with adequate due diligence and be aware that we live in a time and age, where everything that is on the internet, is fair game (Freiwild).
By the end of the semester, your startup (team) will have a running service that includes the following deliverables:
And:
Over the semester, you will form teams that will build a very simple product on your "own" cloud account.
In this first week, we will focus on "Cloud, Automation and Infrastructure as Code":
Evolutionary architecture is the art of constraining the parameter space such that entropy evolves into consistent patterns
(Credit to Forrest Brazeal)
Today's class is about starting your cloud journey in such a way that you don't have technical debt from Day 1. The setup is necessarily tedious, but will stay with you for the duration of the semester, thus your diligence will pay off.
We must realise that in the cloud "everything moves" . So, if we aim to build a foundation, it must be a foundation that is rock solid while being able to handle constant change.
From Day 1, you are encouraged to create a mental practise of internally visualizing all components that you have in play and their mutual interactions and interdependencies. You may use any productivity tools that you find helpful.
The minimal set of ingredients for working with the cloud are:
Take your paper and your pen
Sketch how you could setup the "bootstrap segment" of your cloud account.
The cloud bootstrap is where we handle the most powerful credentials , however, unfortunately, here is where most companies make mistakes because it can be a bit of a chicken/egg problem or because an account is setup quickly, and never cleaned up.
From a threat-modelling perspective, we will defend against our worst opponent: ourselves as cloud admins: How can I protect my own accounts, if I am *compromised, *tired, *sick (Covid-fog, real story) or otherwise acting against my own best interests.
It is thus absolute paramount that we securely bootstrap our automation, with so-called break-glass and remove ourselves (as admins) from direct contact with our crown-jewels:
Shared ownership
The fundamental design principle behind modern cloud efficiency is abstraction, ie to each their own: each part of the stack is owned , maintained and developed by a dedicated entity. In case of the large hyperscalers, this means that and a developer is treated as an end user for eg Microsoft, whereas for the company that employs the developer, Microsoft may only be one of many service providers and their customers may never know that Microsoft is involved.
This leads to a highly distributed net of shared ownerships. There is no one way to define any one relationship, as consumers we must read and stay up to date with terms and conditions and SLO (service level objectives). This is relevant especially for liabilities but even for individuals just playing around: violating TnCs accidentally can e.g. lead to you being silently shunned.
It depends...
One thing is for sure though: you need to be very aware of what pieces of your platform or service you own and control, and which you don't. And what you will do if a service fails.
See for our Microsoft example an illustration ( make sure to check if it's up to date if you want to use it for decisions making).
It can be tempting to manually generate things by clicking on buttons in UIs, but it always backfires. This is why we have all this lab in code. One of your key learnings this week will be to generate a mental image on "who is doing what on behalf of whom" and it is important to draw this mental image and compare it with the code and the currently existing state. Until they all match.
"Cloud" in the very beginning was just "Infrastructure as a Service" . Nowadays it is so much more and - if correctly leveraged - cloud can be extremely powerful. One of the biggest reasons is the open source nature of "How to do something" and being able to replicate other peoples projects, combine them, modify them and build on top "of the shoulders of giants".
This means that *everything* is written "as Code" and lies in git and gets applied via some form of automation.
Do yourself a service, and keep your infrastructure, your configuration, your init files, your patches, your policies, your users and your role-bindings and and and .... in git .
Please read: CNCF Landscape - Automation Tile
For the HandsOn portion, we will use openTofu("Terraform") and Ansible, today on Azure. We will meet Openstack later in class. Note that this landscape doesnt include all tools and projects, some have been declared "winners" and are de-facto standards whereas others may not survive much longer or have been de-facto deprecated. One notable absence in the landscape is crossplane.io , which may be included in future versions of this class.
In order to fulfill the promise of build-once run anywhere, we need to be foolproof. And commands over network to some API: can fail. Thus, many designs are eventually consistent and failure tolerant from first principle.
Most importantly, when we design our instructure as code, the state in git must match the state on the cloud: thus , running code multiple times must yield the same result: This is called idempotency and we will meet this principle again in message queues and in k8s.
When choosing how to create and manage large infrastructures, we thus "declare" our desired target state which is committed to version control, this file is translated via automation into API calls that make sure that the cloud state equals the text in the file. We as humans have no control over the order in which the state is eventually achieved.
Sometimes, we need configuration (deamons, mounts etc) that just don't work this way, for them we use (also idempotent please) imperative methods.
Service Connection
Library Contexts
your current stage
, be intentionalIt will be very important to run from the correct branch and the correct stage of a pipeline.
git clone git@ssh.dev.azure.com:v3/AOCC-Lectures/HandsOnCloudNative/HandsOnCloudNative cd HandsOnCloudNative git checkout features/example git checkout -b features/<myname>
Some branches from last year are still there, you can use them for reference. You need to make sure that your ssh key is known to AzureDevOps
Now, we will face the first problem in multi-tenant environments: finding the right tenant (or directory) . Whether in the IDPs, kubernetes clusters on the cloud portals, everything looks the same in each tenant, but it is crucial to be in the right one, this applies to CLI as well as UI.
Navigate around in the Azure Portal. Your most useful feature is "All Resources"
, unfortunately neither AWS nor GCP have this feature, it is great to get an overview.
You should see two resource groups, in one there is a key vault. Check it out, click around, explore your options and compare with the suggestions listed above in the Security Considerations.. In the vault, there are the credentials to your logical section of this cloud account. Treat this as if it were your root-key. You do *not need to ever take them out of the vault as you'll see below.
Now, we need to solve the chicken and egg problem of "creating state" , i.e. we need a place to safely store a statefile that the automation agent can access, but we can't yet run the agent to create that place. Your statefile(s) are absolutely crucial, they contain sensitive information and hold the current truth. But, they are just json files, so we use some cloud-native features, to make sure, we don't accidentally mess them up:
iac.tfstate
<!-- CODELAB: an empty statefile -->
{
"version": 4,
"terraform_version": "1.6.6",
"serial": 23,
"lineage": "8dfadfe-1853-19d1-6c6d-7bfde3191b5e",
"outputs": {},
"resources": [],
"check_results": null
}
So, now, you need to manually create such a Azure Storage account
and inside it a (Blob Storage) Container
with the right properties, using the UI. This is an anti-pattern exercise, so you will please not do manual UI ever again 😀
Your settings are:
Issue | Setting | Why | Comment |
Location | West Europe | we want GDPR to apply | typically put in same location as your assets, since this is mostly a legal question |
Encryption** | Yes (provided by Microsoft) | You need to protect this file! | it is one of the most sensitive files you have, in terms of powerful credentials, especially if its a bootstrap statefile |
Immutability | Yes (in Production) | Audit and traceability | also debugging and protection against accidental overrides |
Replication | LRS (for this demo), GRS | In Prod, you need this redundant, in case Azure has an outage | Depends on your reliability requirement and cost-willingness |
**Encryption: You could bring your own key, that is quite feasible. Please do not use a HSM (Hardware Security Module) , they are very expensive and have purge-protection. Please do not switch on Microsoft Defender (we will discuss what it does in class).
In order for your build agent to reach the statefile stored in the storage account, you need to allowlist the IP ranges of the Azure hosted build agents, on which your terraform will be running.
Microsoft publishes these ranges regularly, e.g. here
https://www.microsoft.com/en-us/download/details.aspx?id=56519
This is Azure terminology and means that we partition our blob storage. This is useful because each Container can have different properties and access rights.
Here comes the next piece, a pipeline: the client code that is accessing the storage backend. This pipeline agent executes the terraform command and directs it to the correct storage account, wherein terraform will then store its state and compare IS versus SHOULD. It is your task to make the values of your names match on client (pipeline.yml in git) and backend ( Resource Group Name, Storage Account and Blob Storage Container )
Please update your paper drawing at this point to reflect the "agent" and the "backend".
You will notice a value called key
, this is the name of the statefile inside the Container. We will demonstrate later in this lab, what happens when you mismatch it (Hint: desaster).
templates/iac-student.yaml
export TF_CLI_ARGS_init=" -backend-config=\"resource_group_name=$(AZURERM_resource_group_name)\" -backend-config=\"key=$(key).tfstate\" -backend-config=\"storage_account_name=$(AZURERM_storage_account_name)\" -backend-config=\"container_name=tfiac\" -backend-config=\"subscription_id=$(AZURERM_subscription_id)\" -backend-config=\"tenant_id=$(AZURERM_tenant_id)\" "
I'm sure you have heard of least privilege
. Click on the AccessControl (IAM) button and go to Roles to get an idea of what options you have. After all, we need to first allow the automation to do anything to your Storage Account, key is to find the role that has just enough rights.
You should also check the Networking tab of the Storage Account: In a production setup, you need additional controls, so you would run bootstrap from a single controlled machine with a well known IP, and only this IP can access the storage account.
We are skipping the network security portion in the exercise, since previous experience shows that this will cause the debugging to be endless.
Your automation as declared inside a pipeline.yml file, will be executed on a temporary runtime referred to as agent. This agent needs to non-interactively obtain an access token that is correctly scoped for its automation task (in this case terraform).
Thus, you need to give the service-principal
that terraform
will assume, the correct roles to do its job, whilst observing least privilege
:
Via the UI, assign the role of Storage Blob Data Contributor
to service-student-bootstrap-
on the storage account and container level. Else you will get 403
or Role Mismatch
.
Add this to your drawing, bonus exercise: how would you restrict the network access?
Service Connection
In this Lab, we use Azure, so we pair it with the cloud-native CI/CD tool that seamlessly integrates with Azure and allows us to connect to the Key Vault
without ever touching any values inside the vaults.
Go to Pipelines-> Libraries, create a Library iac-admin-student-
and link your service-principa
l into it.
In order to authorize
it, you need to correctly scope the service connection by pressing the blue button down arrow and filling it out like so
The secret "key" is what TF is looking for a "name" inside the storage container (called tfiac see next screenshot).
Library Contexts
They need to match your line 42
in iac-student.yml
and do not
set Enable System Diagnostics
-> it will leak all your secrets and you will have to rotate all of them.
well, its a balance, or: you need to find one. That, or meditation
The settings are (all over the place), but for example here:
Just because the terraform syntax is rather copy/paste from hashicorps docs https://registry.terraform.io , doesn't mean terraform is easy:
you need to understand the infrastructure that underlies the syntax. If, for example, you are dealing with DNS, you need to be aware that DNS is typically a very sensitive thing, even though changing records via terraform is extremely easy.
Once it's setup, it's a joy. But in its seeming simplicity lies a lot of danger, and even though you were likely annoyed by the amount of approval pop-ups -> they can save your production system.
Always test IaC workflows by creating entirely identical systems on branches, before accepting a PR.
IaC testing is time-consuming and often requires multiple cloud accounts (some of which you might want to occasionally completely delete).
Here is a classical Terraform situation, where you have a resource that cannot directly be destroyed
and the typical solution is : to run terraform destroy
twice 😛
You survived the setup: Awesome, take a moment to be grateful
Now, we can use the power of terraforming to create pretty much anything. We will be using Azure for this exercise, as the network (compared to GCP) and the IAM (compared to AWS) is most straightforward. Most other clouds are similar enough, but network and IAM are always the most proprietary. Later, in May, we will compare the network setup with that on Openstack and on GCP, for an equivalent cluster deployment and in June, we will work with hybrid cloud deployments. The following exercise is not a best practice architecture, in fact, the network and the ssh access via traditional firewall is decidedly NOT a good idea.
In the homework, you ll work towards a more best-practice design ( buzzword "ZeroTrust")
Behind the scenes: where is our build agent?
We will now setup step by step a simple VM and then we will setup a destroy pipeline for the simple VM.
Write a loop using coun
t and one using for_each
, report (or write down) the difference.
Note: Please use a very small array to loop over
To compare with Ansible, l run a simple script on the VM to create a cron job that updates it and shuts it off, you would have to add a pipeline step that runs ansible. In iac-student.yaml, only on your branch
(Taken from Terraform up and running, 3rd edition)
We will discuss billing before your homework , make sure you check out that you can see your budget in the Azure subscription (maybe bookmark it)
Remember the discussion on billing https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/tutorial-acm-create-budgets
In this homework, you will NOT be using any SKUs other than the ones specified, no public IPs, create no new resource groups and do not use any licensed products from the marketplace.
In your homework, you will design a private network , a subnetwork and a network security group that has a "default deny".
In this subnet lives a linux host of the smallest SKU you can find and priority
= spot . You must configure it such that you can connect from your laptop to it via tailscale. (you can use https://www.merrell.dev/using-tailscale-with-an-azure-linux-vm-and-terraform/ as inspiration)
Your linux vm must be sleeping by default, please monitor your budget consumption and report the cost and how to use it in your branch in a Readme.md
You are expected to have all this by Tuesday March 12 2024, such that you can do the next class on it.
Congratulations, you've successfully setup Terraform and deployed your first VM using 100% automation!