Infrastructure as Code Notes

2020-05-07 3211 words 7 minutes

Contents

学习来自reference

infrastructure as code

an important shift in mindset: you can manage almost everything in code, including servers, databases, networks, log files, application configuration, documentation, automated tests, deployment processes, and so on.

categories of IAC tools:

Ad hoc scripts

Ad hoc script 特指一类具体的script，完成一系列动作。如bash/python安装软件，启动软件等。

Configuration management tools

典型代表：ansible。通过配置，管理许多机器，实现ad hoc scripts幂等。

Server templating tools

相比上面的configuration management tools，它是另一种思路。通过将server打包成image，进而直接安装在host上，而不需要进一步ansible config。

分为两大类：

Virtual machines：模拟真实的OS，有CPU/memory/network等，它很重，因为要virtualize所有hardware在OS层面上。如Packer/Vagrant

Containers：一种特殊的隔离进程。轻量。如docker

Orchestration tools

image有了，如何编排它们：

deployment
monitoring / auto healing / auto scaling
load balancing
service discovery
…

处理这些的工具有：Kubernetes / Amazon ECS / Nomad / Docker Swarm

Provisioning tools

上面的configuration management、server templating、orchestration tools只是定义了如何在server上run

而server和整套infra的创建就需要如 Terraform / CloudFormation 这种工具去provisioning。

Terraform

HashiCorp开源。底层通过API calls不同cloud providers(AWS Azure GCP)来provisioning。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
resource "aws_instance" "example" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
}

resource "google_dns_record_set" "a" {
  name         = "demo.google-example.com"
  managed_zone = "example-zone"
  type         = "A"
  ttl          = 300
  rrdatas      = [aws_instance.example.public_ip]
}

上面的例子就是通过Terraform创建一个AWS instance，然后DNS配在 Google Cloud上。进而跨了多个cloud providers

tips

EOF: heredoc syntax，允许你创建多行string，不需要\n
interpolation: ${...}，允许在字符串中插入变量，如 ${var.server_port}
terraform output [output_name]：without apply changes情况下查看outputs
data sources： a piece of read-only information that is fetched from the provider。不同于Resource的是：resources cause Terraform to create, update, and delete infrastructure objects, data resources cause Terraform only to read objects.
manage terraform state：Terraform提供workspace概念，默认是default。它提供了一种能力：构建不同env的state
- 这里想说的是，基于workspace的构建方式，并不能作为我们区分env的方式。官方文档里有句话：A common use for multiple workspaces is to create a parallel, distinct copy of a set of infrastructure in order to test a set of changes before modifying the main production infrastructure.
- Non-default workspaces are often related to feature branches in version control. The default workspace might correspond to the “master” or “trunk” branch, which describes the intended state of production infrastructure.
- Instead, use one or more re-usable modules to represent the common elements, and then represent each instance as a separate configuration that instantiates those common elements in the context of a different backend
- 针对不同环境的隔离，我们基于file layout。也就是说一个环境，一个目录。

dynamic：Terraform处理loops的一个关键字，通过for_each动态的去generate tag block如下

1
2
3
4
5
6
7
8
9
dynamic "tag" {
  for_each = var.custom_tags
  
  content {
    key                 = tag.key
    value               = tag.value
    propagate_at_launch = true
  }
}

count的问题：Terraform requires that it can compute count and for_each during the plan phase, before any resources are created or modified. This means that count and for_each can reference hardcoded values, variables, data sources, and even lists of resources (so long as the length of the list can be determined during plan), but not computed resource outputs.
terraform plan：plan比较的对象是state文件，如果有manually change则apply时会出问题。所以，要么所以infra change都通过Terraform，要么补救措施通过Terraform import命令(现成的工具Terraforming)
Refactoring Can Be Tricky：针对infra的重构和以往对code的重构有很大不一样，比如change name对Terraform来说，默认就是先delete 再create一个新的。中间必然会产生downtime。
- plan：carefully scanning output manually
- create_before_destroy：在lifecycle中加入create_before_destroy=true。
- terraform state：只针对Resource rename情况下。可以通过手动执行 terraform state mv aws_security_group.instance aws_security_group.cluster_instance。
- immutable parameters：一些Resource的参数是不可变的，change意味着destroy+create。所以要小心看文档。

production-grade infra

这意味着很多事： servers, data stores, load balancers, security functionality, monitoring and alerting tools, building pipelines, and all the other pieces of your technology that are necessary to run a business.

然而大多数情况下，工作量的预估都是错误的，尤其是devops方面的。why，

devops as an industry 还很年轻，有许多坑要踩。像Terraform也才出现于2010s左右
devops工作很容易受到yak shaving：牵一发动全身的感觉，比如需要一个部署服务，而它的依赖configuration/SFTP/TLS/DNS/Login等等，比如部署APP出发bug，进而引起连锁反应TLS issue、timeout等等。这些都是牵一发动全身。
accidental complexity：devops牵涉到的是：everything from build to deployment to security and so on。所以一切可能遇到的问题深浅都是未知的。比如pipeline agent/network、线上timeout/OOM等等。

Production-Grade Infrastructure Checklist 摘抄于terraform-up&running

Task	Description	Example tools
Install	Install the software binaries and all dependencies.	Bash, Chef, Ansible, Puppet
Configure	Configure the software at runtime. Includes port settings, TLS certs, service discovery, leaders, followers, replication, etc.	Bash, Chef, Ansible, Puppet
Provision	Provision the infrastructure. Includes servers, load balancers, network configuration, firewall settings, IAM permissions, etc.	Terraform, CloudFormation
Deploy	Deploy the service on top of the infrastructure. Roll out updates with no downtime. Includes blue-green, rolling, and canary deployments.	Terraform, CloudFormation, Kubernetes, ECS
High availability	Withstand outages of individual processes, servers, services, data centers, and regions.	Multidatacenter, multiregion, replication, auto scaling, load balancing
Scalability	Scale up and down in response to load. Scale horizontally (more servers) and/or vertically (bigger servers).	Auto scaling, replication, sharding, caching, divide and conquer
Performance	Optimize CPU, memory, disk, network, and GPU usage. Includes query tuning, benchmarking, load testing, and profiling.	Dynatrace, valgrind, VisualVM, ab, Jmeter
Networking	Configure static and dynamic IPs, ports, service discovery, firewalls, DNS, SSH access, and VPN access.	VPCs, firewalls, routers, DNS registrars, OpenVPN
Security	Encryption in transit (TLS) and on disk, authentication, authorization, secrets management, server hardening.	ACM, Let’s Encrypt, KMS, Cognito, Vault, CIS
Metrics	Availability metrics, business metrics, app metrics, server metrics, events, observability, tracing, and alerting.	CloudWatch, DataDog, New Relic, Honeycomb
Logs	Rotate logs on disk. Aggregate log data to a central location.	CloudWatch Logs, ELK, Sumo Logic, Papertrail
Backup and Restore	Make backups of DBs, caches, and other data on a scheduled basis. Replicate to separate region/account.	RDS, ElastiCache, replication
Cost optimization	Pick proper Instance types, use spot and reserved Instances, use auto scaling, and nuke unused resources.	Auto scaling, spot Instances, reserved Instances
Documentation	Document your code, architecture, and practices. Create playbooks to respond to incidents.	READMEs, wikis, Slack
Tests	Write automated tests for your infrastructure code. Run tests after every commit and nightly.	Terratest, inspec, serverspec, kitchen-terraform

module tips

small module：新手容易把所有环境都写到一个module或文件里。坏处和写代码一样，很明显。infra更是如此，我们需要保证 小的独立的 单元。
composable modules：unix philosophy。function composition。minimize side effects。
releasable module：use Git tag semantic versioning。可以release到https://registry.terraform.io/
beyond terraform modules：思路转变很重要。尽管module都在说Terraform code，但是module folder里也可以放其他infra code。参考run-vault Bash script。也就是说，避免不了non-terraform code来弥补declarative特性。当然有些work-around：null_resource
provisioners：用来执行script在local/remote机器上。provisioner可以和null_resource结合来跑script在Terraform life-cycle中
external data source：pass data from terraform to external program. external program pass data back to terraform by json。如

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
data "external" "echo" {
  program = ["bash", "-c", "cat /dev/stdin"]

  query = {
    foo = "bar"
  }
}

output "echo" {
  value = data.external.echo.result
}

output "echo_foo" {
  value = data.external.echo.result.foo
}

testing

The DevOps world is full of fear: fear of downtime; fear of data loss; fear of security breaches

infra change需要通过测试来提高自信。Infra code该怎么测试呢

manual testing：
- 手动测试。注意的是：有些private subnet需要jumphost才能手动测试
- cleaning up: cloud-nuke 和 aws-nuke 都是可以快速 delete everything in AWS account的工具
automated testing：
- Terratest 通过deploy real infrastructure in real env，然后validate real infrastructure by api/ssh/…
- 一些工具: pre-commit-terraform / goss

benefits

如果说服采用IaC是件很难的事情，尤其是非developer，因为IaC会带来额外的许多成本。这里记录几个出发点:

I have an idea for how to reduce our outages in half.

deployment process is fully automated, reliable, and repeatable