Road to startup infra - Part 2 - Infrastructure as a code

In this post, we will start digging in a series of talks about a journey through the creation of modern information technology infrastructure.

In my previous post Road to startup infra - Part I, I talk about how to create a Certification Authority, that step is optional, you can start with this post.

We will create a secure, strong environment where all systems will be deployed. I call this infrastructure, barebones. Is the lower layer that keeps everything running.

On top of the barebones, we will add containers. On top of that, we will take advantage of the architecture, process, and orchestration to deploy customer-facing systems, create automation, and bring competitive value to any company.

This is not a theory blog series and I will go deep into some technical stuff. I'm building and documenting, putting together my best talents. Hopefully, this will make a difference and helps you.

Let's start. But first tell me one thing: Can you manage 10 servers? What about 100? 1000?

Currently, I have 10, I would be glad if I had 100, that means a serious value is being made, but can you grow? It takes a lot of effort to manage one, two servers. Ten is not that crazy but is when automation get really paid off.

For a small startup, ten servers may be seen as an exaggeration, some IT guy has criticized me saying I spend too much money, or I'm losing my time, "I worked on a project with 10000 users in one server, and the company makes much more money than you". But I hope you will understand the rationale behind it.

So, pick an automation tool and keep your sanity. I started with Ansible, Nowadays I'm using SaltStack. I will keep saltstack as a reference. My reasons to pick salt are on this blog post.

The cloud provider

One thing I'm not doing is managing hardware. I did that in the past, performance per dollar is the best, but cheap cloud providers do exist and some offer decent services.

One thing I do on a regular basis is to check what is available, and test providers. I'm not afraid to move everything to another provider if that saves me money, and different workloads have different notions of value, you can find a server that is more valuable in another provider, and I don't like to get too much dependency.

This is why multicloud practices appeal to me. Is not easy, and takes an effort to move, if you can do it on a per server or on small steps if you can spread your infrastructure across data centers and different providers, you have the power.

It turns out you can. One central part of it is owning your network. Mesh VPN is the game-changing technology here.

  • It has no single point of failure
  • It has little impact on latency and performance
  • It is secure and encrypted
  • It scales really well

The solution I use is ZeroTier

Pick a cloud provider. I recommend: Hetzner and Scaleway crazy value for money. Hetzner really impressed me with stability and robustness, its new on town with fear features, but these Germans did a great engineering job, for a price difficult to compete against. Maybe fewer is more in the end.

The majority of things on this blog series is cloud-agnostic.

Provisioning servers

Terraform is really useful, salt cloud is promising.

But I have to tell, with 10 servers you can still provision by hand.

I do like cloud init. You could provision the saltstack minion agents on each server using cloud-init and manager everything else then after that check cloudinit.readthedocs.io/en/latest/topics/m..

This is the approach I will investigate, especially when time to scale

#cloud-config
apt_sources:
  - source: "ppa:saltstack/salt"
packages:
  - salt-minion
package_upgrade: true
package_reboot_if_required: true
salt_minion:
conf:
    master: xx.xx.xx.xxx

The piece I would like to add is since I'm paranoid about security, and since I want to own my IP address, I would like the master IP conf to be on Zerotier network. That means to provision the Zerotier network with cloud-init as well. Then accept the new client on Zerotier console.

runcmd:
  - curl -s 'https://pgp.mit.edu/pks/lookup?op=get&search=0x1657198823E52A61' | gpg --import && if z=$(curl -s 'https://install.zerotier.com/' | gpg); then echo "$z" | sudo bash; fi
  - zerotier-cli join xxxxxxxxx

This adds the manual step of accept the new client on Zerotier console and then waiting for it to appear on the saltstack master, or reboot.

Get a salt master server ready and we should be able to start.

The servers

My current infra is composed of the servers (VM's):

  • salt - The salt master
  • atlas - Install project management and wiki
  • gitlab - Git repository and tools (CI, Issues, etc..). Gitlab server installed
  • vault - Security vault and certificates
  • postgres - Postgresql server
  • master0 - Kubernetes master
  • master1 - Kubernetes master
  • master2 - Kubernetes master
  • worker0 - Kubernetes worker
  • worker1 - Kubernetes worker

Install salt minion on all VM's with the help of cloud-init, terraform, or salt cloud, choose the best for you.

In my case, the gitlab server was provisioned outside salt, I have also configured salt gitfs to point to my barebones git repository. This step is optional, you can write your infrastructure direct on server.

This repository will have pillar and stack folders with configuration keys and recipients.

Pillar is designed to have sensitive information, but I do write more sensitive files like passwords on the pillar folder on the salt server, not on GIT. These things are merged.

Note: My first goal was to get EVERYTHING on Kubernetes. Unfortunately, state is not a mature aspect of Kubernetes yet. There are solutions, but when not expensive, they are not production ready. After trying hard and testing a lot of solutions, I came to a more conventional approach of using traditional VM's for state heavy applications (all machines except vault, which is easy to move to Kubernetes). That means gitlab, postgres, atlas, and vault machines may not be necessary for a full Kubernetes solution.

I think The state of state :-) on Kubernates worth a post about it.

Preparing the servers

Disclaimer: As I mention, I'm migrating from Ansible to SaltStack. With the lots of things to do, this is a work in progress. So parts of this scripts will be on ansible, parts on saltstack. For the good side, it makes possible for you to compare a little bit this two.

If I managed to migrate all the things to salt you will not read this.

Salt barebones git folder structure:

pillar
  backup.sls
  security.sls
  top.sls
salt
  common
  backup
  monitoring
  security
  top.sls

Inside salt folder there are the folders common, backup, monitoring, security. Each folder represents an aspect of your infra. The top.sls file includes all aspects and targets machines when you apply all states with salt '*' state.apply, salt uses this file.

Common Things

These are things all servers have in common:

  • zip, vim, python, screen, sudo, curl packages installed. All packages upgraded to latest
  • minion grains. Using salt to deliver minion information to all hosts by itself, this keeps server roles and locations at git check, easy to change and maintain (I think is a clever trick ;-)
  • certificates. The public certification authority (custom certificate) to all machines.

The common folder has:

files
deploy-certificate-authority-ca.sls
init.sls
minion-grains.sls
packages.sls

The files folder has your public root certification.

The init.sls file include all. If you run salt '*' state.apply common this is the file that is executed inside the common folder.

I will not explain all salt states, but lets see what packages.sls and minion-grains.sls are doing:

'install common packages':
pkg.installed:
  - pkgs:
    - python
    - zip
    - vim
    - screen
    - sudo
    - curl

Very simple, and self explanatory: It ensures the list of packages is installed. The minion-grains.sls is a little more complex:

/etc/salt/minion.d/grains.conf:
  file.managed:
    - source: salt://common/files/minions_grains/{{grains.id}}.conf
    - user: root
    - group: root
salt-minion:
  service.running:
    - watch:
      - file: /etc/salt/minion.d/grains.conf

Well if you are wondering what is a minion and grain? Minion is the machine controlled by salt, the machine with the salt agent. Grains are information about the minion like Operating systems, domain name, IP address, memory, and many others.

First, it copies the file files/minions_grains/id.conf where id is the name of minion, to the /etc/salt/minion.d/grains.conf file in the machine.

You can show all ids with salt-key command, or salt-run manage.allowed.

Then it restarts the salt-minion service if the file changes.

This file has custom information about the system.

For each minion in your system, you can create a file in the repository and put custom grains information. I use that change system roles and server location.

Example file:

grains:
  roles:
    - confluence-server
    - jira-server
    - backup-client
  server_location:
    region: german
    city: nuremberg
    provider: bestcloudproviderever
    continent: europe
    datacenter: xpto

Roles are used to targeting machines. For instance, all machines with backup-client role will have the backup strategy applied.

But how target works?

There are two ways:

  1. The salt command can target machines. Examples: salt 'worker*' state.apply foo will target all machines which ids start with worker
  2. The top.sls file on salt folder can target machines, filtering what states get applied to machines. That way you simple run salt '*' state.apply and all states that match your machines will be executed.

This is my top.sls file:

base:
 '*':
    - common
production:
  '*':
    - common
    - security
    - monitoring.netdata
  'roles:backup-client':
    - match: grain
    - backup.install
  'roles:etcd':
    - match: grain
    - backup.etcd

All machines in the production environment get the states: common, security, monitoring.netdata, al machines on production environment with roles backup-client get the state backup.install.

There is a huge number option to target machines, this is one good feature of salt. Check it out.

For the next states and configurations, I will not explain how they do but just say what they do.

The files will be provided on Github.

Harder Security

  • install failtoban.
  • change a few ssh settings.
  • install and configure vault-ssh-helper agent. This adds One-Time-Passwords to all systems authenticating using vault login, which in turn authenticates to LDAP. This quite handy, even super admins don't need to know the root password. It's way better than a shared password or ssh-key and can be revoked. You still need the ultimate trust of those who access as root. But is both user-friendly and a security improvement.

Backup

All servers have a backup to offline storage. To reach this, I did install Restic on each machine and configure system.d to execute periodic backups.

The backups are fully encrypted and stored in an external cloud provider.

There is also a custom way to configure multiple backups (like ETCD and system), execute pre and post actions like dumping databases.

This is all deployed with salt. And really show the power of this kind of system for a simple reason:

"I spend a lot of time tweaking the backup on one system, experimenting and running the state many times, once it gets perfect, in a matter of seconds all others servers got the same"

The same thing happened with monitoring and makes me think every time I get one command distribution of a new state: How yeah, this is fucking cool.

This is subject for a separate post, don't you think?

Monitoring

We will start by simply using Netdata. This is not a complete monitoring solution but we will build upon this as the blog series evolves.

I did not install using the system package, instead, I use the script. The is a bug in the installation that prevents plugins to work properly.

There is also a PostgreSQL plugin configuration

Puting everything together

You noticed that I have omitted information, and there is no how to. This is intentional.

As a compensation for that, the repository github.com/giovannicandido/barebones has all salt files. They are as generic as I could make them, so you can install on any infra. Tested on Ubuntu Server Edition 18.04

Conclusion

We reached:

  • We provisioned servers in a quick way
  • Installed common packages on all
  • Improved security
  • Backup
  • Initial monitoring
  • All this day's (weeks, even months) of work can be redone with a single command

With these things in place, we can continuously manage and evolve your barebones infrastructure, with almost the same easy no matter the number of servers (hundreds, or thousands).

I do not talk about: Installing PostgreSQL, vault, gitlab, and Kubernetes. We can for sure install and maintain all this using salt. But I let that for you, maybe you don't use these things.

Equally important, we can bring people on-board and they should quick start contributing and auditing all the work we have done, they don't necessary need to have access to production servers, as long as they can replicate the environment on QA or even their local computers, if they have a proper system to work (I7, 16GB, SSD as minimum requirement, lots of VM's dude), pull requests can be used to control the application of infra. This can be automated.

Hum... now you should get the phrase Infrastructure as a code, we are literally coding the state of your infra and applying automatically. We could even use GIT and pull request to improve your teamwork.

As of Kubernetes installation, I will not use Ansible or Salt. This is because each Kubernetes distribution has tools to provision, but we prepare the terrain for the installation.

Any feedback is appreciated, we will keep going in the next posts:

  • We will install a High Available Kubernetes Cluster.
  • We will install helm package manager on Kubernetes and use your certificates.
  • We will install ingress and external DNS and use Letsencrypt with Kubernetes.
  • We will deploy your first service on Kubernetes and show it for the world.
  • We will tweak your backup to include ETCD system.
  • We will improve your monitoring solution.
  • We will design systems for scale and take advantage of Kubernetes.
  • We going to migrate workloads for Kubernetes on the long term.

I plan to also do a quick post explaining the backup solution I put together.

I hope this post have give you a glimpse of what a DevOps culture could do for your organization.

I dream, that with enough resources, some organizations will be able to create self service workflows, with interfaces that all tech staff could provision, monitor, grow, decommissioning, scale and bill all multicloud/hybrid infrastructure with a small team behind it. That may sound scare for these that are left on dust, what we did here with limited resources, already challenges the status quo of many organizations and some inflated jobs with it. Don't be scared, embrace the change, or die without it :-)

That is all. Thanks.