Infrastructure as Code Notes
学习来自reference
infrastructure as code
an important shift in mindset: you can manage almost everything in code, including servers, databases, networks, log files, application configuration, documentation, automated tests, deployment processes, and so on.
categories of IAC tools:
Ad hoc scripts
Ad hoc script 特指一类具体的script,完成一系列动作。如bash/python安装软件,启动软件等。
Configuration management tools
典型代表:ansible。通过配置,管理许多机器,实现ad hoc scripts幂等。
Server templating tools
相比上面的configuration management tools,它是另一种思路。通过将server打包成image,进而直接安装在host上,而不需要进一步ansible config。
分为两大类:
Virtual machines:模拟真实的OS,有CPU/memory/network等,它很重,因为要virtualize所有hardware在OS层面上。如Packer/Vagrant
Containers:一种特殊的隔离进程。轻量。如docker
Orchestration tools
image有了,如何编排它们:
- deployment
- monitoring / auto healing / auto scaling
- load balancing
- service discovery
- …
处理这些的工具有:Kubernetes / Amazon ECS / Nomad / Docker Swarm
Provisioning tools
上面的configuration management、server templating、orchestration tools只是定义了如何在server上run
而server和整套infra的创建 就需要如 Terraform / CloudFormation 这种工具去provisioning。
Terraform
HashiCorp开源。底层通过API calls不同cloud providers(AWS Azure GCP)来provisioning。
| |
上面的例子就是通过Terraform创建一个AWS instance,然后DNS配在 Google Cloud上。进而跨了多个cloud providers
tips
EOF: heredoc syntax,允许你创建多行string,不需要\n
interpolation:
${...},允许在字符串中 插入变量,如 ${var.server_port}terraform output [output_name]:without apply changes情况下查看outputs
data sources: a piece of read-only information that is fetched from the provider。不同于Resource的是:resources cause Terraform to create, update, and delete infrastructure objects, data resources cause Terraform only to read objects.
manage terraform state:Terraform提供workspace概念,默认是default。它提供了一种能力:构建不同env的state
- 这里想说的是,基于workspace的构建方式,并不能作为我们区分env的方式。官方文档里有句话:A common use for multiple workspaces is to create a parallel, distinct copy of a set of infrastructure in order to test a set of changes before modifying the main production infrastructure.
- Non-default workspaces are often related to feature branches in version control. The default workspace might correspond to the “master” or “trunk” branch, which describes the intended state of production infrastructure.
- Instead, use one or more re-usable modules to represent the common elements, and then represent each instance as a separate configuration that instantiates those common elements in the context of a different backend
- 针对不同环境的隔离,我们基于file layout。也就是说一个环境,一个目录。
dynamic:Terraform处理loops的一个关键字,通过for_each动态的去generate tag block如下
1 2 3 4 5 6 7 8 9dynamic "tag" { for_each = var.custom_tags content { key = tag.key value = tag.value propagate_at_launch = true } }count的问题:Terraform requires that it can compute
countandfor_eachduring theplanphase, before any resources are created or modified. This means thatcountandfor_eachcan reference hardcoded values, variables, data sources, and even lists of resources (so long as the length of the list can be determined duringplan), but not computed resource outputs.terraform plan:plan比较的对象是state文件,如果有manually change则apply时会出问题。所以,要么所以infra change都通过Terraform,要么补救措施通过Terraform import命令(现成的工具Terraforming)
Refactoring Can Be Tricky:针对infra的重构和以往对code的重构有很大不一样,比如change name对Terraform来说,默认就是先delete 再create一个新的。中间必然会产生downtime。
plan:carefully scanning output manuallycreate_before_destroy:在lifecycle中加入create_before_destroy=true。terraform state:只针对Resource rename情况下。可以通过手动执行terraform state mv aws_security_group.instance aws_security_group.cluster_instance。immutable parameters:一些Resource的参数是不可变的,change意味着destroy+create。所以要小心看文档。
production-grade infra
这意味着很多事: servers, data stores, load balancers, security functionality, monitoring and alerting tools, building pipelines, and all the other pieces of your technology that are necessary to run a business.
然而大多数情况下,工作量的预估都是错误的,尤其是devops方面的。why,
- devops as an industry 还很年轻,有许多坑要踩。像Terraform也才出现于2010s左右
- devops工作很容易受到yak shaving:牵一发动全身的感觉,比如需要一个部署服务,而它的依赖configuration/SFTP/TLS/DNS/Login等等,比如部署APP出发bug,进而引起连锁反应TLS issue、timeout等等。这些都是牵一发动全身。
- accidental complexity:devops牵涉到的是:everything from build to deployment to security and so on。所以一切可能遇到的问题深浅都是未知的。比如pipeline agent/network、线上timeout/OOM等等。
Production-Grade Infrastructure Checklist 摘抄于terraform-up&running
| Task | Description | Example tools |
|---|---|---|
| Install | Install the software binaries and all dependencies. | Bash, Chef, Ansible, Puppet |
| Configure | Configure the software at runtime. Includes port settings, TLS certs, service discovery, leaders, followers, replication, etc. | Bash, Chef, Ansible, Puppet |
| Provision | Provision the infrastructure. Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc. | Terraform, CloudFormation |
| Deploy | Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments. | Terraform, CloudFormation, Kubernetes, ECS |
| High availability | Withstand outages of individual processes, servers, services, data centers, and regions. | Multidatacenter, multiregion, replication, auto scaling, load balancing |
| Scalability | Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers). | Auto scaling, replication, sharding, caching, divide and conquer |
| Performance | Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling. | Dynatrace, valgrind, VisualVM, ab, Jmeter |
| Networking | Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access. | VPCs, firewalls, routers, DNS registrars, OpenVPN |
| Security | Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening. | ACM, Let’s Encrypt, KMS, Cognito, Vault, CIS |
| Metrics | Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting. | CloudWatch, DataDog, New Relic, Honeycomb |
| Logs | Rotate logs on disk. Aggregate log data to a central location. | CloudWatch Logs, ELK, Sumo Logic, Papertrail |
| Backup and Restore | Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account. | RDS, ElastiCache, replication |
| Cost optimization | Pick proper Instance types, use spot and reserved Instances, use auto scaling, and nuke unused resources. | Auto scaling, spot Instances, reserved Instances |
| Documentation | Document your code, architecture, and practices. Create playbooks to respond to incidents. | READMEs, wikis, Slack |
| Tests | Write automated tests for your infrastructure code. Run tests after every commit and nightly. | Terratest, inspec, serverspec, kitchen-terraform |
module tips
small module:新手容易把所有环境都写到一个module或文件里。坏处和写代码一样,很明显。infra更是如此,我们需要保证 小的独立的 单元。
composable modules:unix philosophy。function composition。minimize side effects。
releasable module:use Git tag semantic versioning。可以release到https://registry.terraform.io/
beyond terraform modules:思路转变很重要。尽管module都在说Terraform code,但是module folder里也可以放其他infra code。参考run-vault Bash script。也就是说,避免不了non-terraform code来弥补declarative特性。当然有些work-around:null_resource
provisioners:用来执行script在local/remote机器上。provisioner可以和null_resource结合来跑script在Terraform life-cycle中
external data source:pass data from terraform to external program. external program pass data back to terraform by json。如
| |
testing
The DevOps world is full of fear: fear of downtime; fear of data loss; fear of security breaches
infra change需要通过测试来提高自信。Infra code该怎么测试呢
- manual testing:
- 手动测试。注意的是:有些private subnet需要jumphost才能手动测试
- cleaning up: cloud-nuke 和 aws-nuke 都是可以快速 delete everything in AWS account的工具
- automated testing:
- Terratest 通过deploy real infrastructure in real env,然后validate real infrastructure by api/ssh/…
- 一些工具: pre-commit-terraform / goss
benefits
如果说服采用IaC是件很难的事情,尤其是非developer,因为IaC会带来额外的许多成本。这里记录几个出发点:
I have an idea for how to reduce our outages in half.
deployment process is fully automated, reliable, and repeatable