This post introduces the Calvinverse project which provides the source code for the different resources required to create the infrastructure for a build pipeline. The Calvinverse resources have been developed for two main reasons:
- To provide me with a way to experiment with and learn more about immutable infrastructure and infrastructure-as-code as applied to build infrastructure.
- To provide resources that can be used to set up the infrastructure for a complete on-prem build system. The system should provide a build controller with build agents, artefact storage and all the necessary tools to monitor the different services and diagnose issues.
The Calvinverse resources can be configured and deployed for different sizes of infrastructure, from small setups only used by a few developers to a large setup used by many developers for the development of many products. How to configure the resources for small, medium or large environments and their hardware requirements will be discussed in future posts.
The resources in the Calvinverse project are designed to be run as a self-contained system. While daily maintenance is minimal it is not a hosted system so some maintenance is required. For instance OS updates will be required on a regular basis. These can either be applied to existing resources, through the automatic updates, or by applying the new updates to the templates and then replacing the existing resources with a new instance. The latter approach case can be automated, however there is no code in any of the Calvinverse repositories to do this automatically.
The different resources in the Calvinverse project contain a set of tools and applications which provide all the necessary capabilities to create the infrastructure for a build pipeline. Amongst these capabilities are service discovery, build execution, artefact storage, metrics, alerting and log processing.
The following applications and approaches are used for service discovery and configuration storage:
- Using Consul to provide service discovery and machine discovery via DNS inside an environment. An environment is defined as all machines that are part of a consul datacenter. It is possible to have multiple environments where the machines may all be on the same network but in general will not be communicating across environments. This is useful for cases where having multiple environments makes sense, for instance when having a production environment and a test environment. The benefit of using Consul as a DNS is that it allows a resource to have a consistent name across different environments without the DNS names clashing. For instance if there is a production environment and a test environment then it is possible to use the same DNS name for a resource, even though the actual machine names will be different. This allows using the Consul DNS name in tools and scripts without having to keep in mind the environment the tool is deployed in. Finally Consul is also used for the distributed key-value store that all applications can obtain configuration information from thereby centralizing the configuration information.
- Using one or more Vault instance to handle all the secrets required for the environment. Vault provides authenticated access for resources to securely access secrets, login credentials and other information that should be kept secure. This allows centralizing the storage and distribution of secrets
For the build work Calvinverse uses the following applications:
- Jenkins is used as the build controller.
- Build executors connect to jenkins using the swarm plugin so that agents can connect when it starts. In the Calvinverse project there are currently only Windows based executors.
For artefact storage Calvin verse uses the Nexus application. The image is configured such that a new instance of the image will create artefact repositories for NuGet, Npm, Docker and general ZIP artefacts.
For message distribution Calvinverse uses the RabbitMQ application. The image is configured such that a new instance of the image will try to connect to the existing cluster in the environment. If no cluster exists then the first instance of the image will form the start of the cluster in the environment.
Metrics, monitoring and alerting capabilities are provided by the Influx stack, consisting of:
- InfluxDB for metrics collection.
- Grafana for dashboards.
- Telegraf is installed on each resource to collect metrics and send them to InfluxDb.
- Chronograf for alert configurations.
- Kapacitor for alerting.
Build and system logs are processed by the Elastic stack consisting off:
- Elasticsearch for log storage, both system logs and build logs
- Kibana for log dashboards.
- Logs collected via syslog-ng on linux and a modified version of filebeat on windows. Logs are sent to rabbitmq to ensure that the unprocessed logs aren't lost when something any part of the log stack goes offline.
- Logstash for processing logs from RabbitMQ to Elasticsearch using filters
It should be noted that while the Calvinverse resources combine to create a complete build environment the resources might need some alterations to fit in with the work flow and processes that are being followed. After all each company is different and applies different workflows. Additionally users might want to replace some of the resources with versions of their own, e.g. to replace Influx with Prometheus.
Continuous integration (CI) systems originally and build pipelines recently have traditionally been available on-prem only with systems like Jenkins, TeamCity, Bamboo and TFS. This is possibly due to the fact that these systems needs relatively powerful hardware, mostly consisting of powerful CPU and fast IO, something which was not easily available in the cloud until the last few years.
However in the last few years a number of cloud based CI systems have appeared e.g. Azure DevOps, AppVeyor, CircleCi), CloudShip, Google cloud build and Travis CI. This has lead to the question of where to locate a CI system? Should it be on-prem or in the cloud or potentially even a combination of the two. This post should provide some suggestions on how to make the selection between the different options.
Cloud-based CI systems
As with other cloud systems when using a cloud based CI system the user gets the benefits of not having to worry about the underlying infrastructure and resources and having the ability to scale the CI system to the size required, provided one pays for the additional resources.
The other side of the coin is that because the user has no influence on the infrastructure of the CI system there is also no direct control over the hardware or the controller software. Thus the user cannot increase the hardware specs for the controller or the agents and the user cannot determine which plugins or capabilities are available in the CI system. As a side effect this also means that the user does not have access to the logs, metrics and file system for the underlying system, which provide information that may be useful when issues arise. In general the controller specific logs and metrics are only useful if you have access to the controller, however the build specific information is useful either for diagnostics or future planning.
Besides the CI part of the system in some cases the entire pipeline will require other resources, e.g artefact storage or test systems. Some cloud systems provide these additional systems as well, for a price of course. Other systems require that these additional resources are provided in some other way.
On-prem CI systems
When running the CI system on-prem one has to both provide and maintain the infrastructure, hardware and networking etc., and the controller and executor software. This increases the overhead for running a CI system. Additionally scaling the system either requires manual intervention or building the scaling capabilities.
On the other hand having control over the infrastructure means that the CI system can be configured so that it fits the use case for the development teams, the desired plugins installed, executors with all the right tools, full control over executor workspaces and with that the ability to lock down sensitive information. Additionally logs and metrics can be collected from everywhere which helps diagnostics, alerting and predictive capabilities on both the infrastructure side and the build capacity side.
Finally having full control over the CI system means that it is possible to extend the system if that is required with custom capabilities, either directly added to the CI system or as additional services. It should of course be noted that this requires resources and is thus not free.
Selecting a location for your CI system
So how does one select a location for a CI system. Both cloud and on-prem have pros and cons and in the end the location of the system depends very much on the situation fo the development team. If the team works for a company where there is no on-prem server infrastructure then a cloud based system will be the only sensible approach. However there will also be cases where an on-prem system is the only sensible option.
In order to decide for one system or the other the first thing that should be done is a cost comparison, comparing the total cost of ownership, i.e. initial purchasing costs, running costs, staff costs, training costs etc.. As part of the cost comparison the costs for additional parts of the system should also be included, e.g. artefact storage or test systems. One should also note that while cloud systems reduce maintenance, they are not maintenance free. The maintenance of the infrastructure disappears but the maintenance of the builds and the workflow does not, after all no matter where the build pipeline is located it is still important that it delivers the accuracy, performance, resilience and flexibility.
Once the cost comparison is done there are other things to bring into the decision process. Because while costs are important they are not the only reason to select one system or another. For instance other comparison elements could be related to regulations that specify how source needs to be treated or specific processes that should be followed, or the capabilities of the different CI systems for example is the ability to execute builds on a specific OS. Not all cloud CI systems provide executors for all the different OSes.
In the end the decision to select a cloud build system or a on-prem build system depends very strongly on the situation the company is in. It is even possible that as time progresses the best type of system may change from on-prem to cloud or visa versa. Both systems have their own advantages and disadvantages. In the end all that matters is that a system that fits the development process is selected, independent of what the different vendors say is the best thing.
In one of the post from a while ago we discussed what a software development pipeline is and what the most important characteristics are. Given that the pipeline is used during the large majority of the development, test and release process it is fair to say that for a software company the build and deployment pipeline infrastructure should be considered critical infrastructure because without it the development team will be more limited in their ability to perform their tasks. Note that at no stage should any specific tool, including the pipeline, be the single point of failure. More on how to reduce the dependency on CI systems and the pipeline will follow in another post.
Just like any other piece of infrastructure the development pipeline will need to be updated and improved on a regular basis, either to fix bugs, patch security issues or to add new features that will make the development team more productive. Because the pipeline falls in the cricital infrastructure category it is important to keep disturbances to a minimum while performing these changes. There are two main parts to providing (nearly) continuous service while still providing improvements and updates. The first is to ensure that the changes are tracked and tested properly, the second is to deploy the exact changes that were tested to the production system in a way that no or minimal interruptions occur. A sensible approach to the first part is to follow a solid software development process so that the changes are controlled, verified and monitored which can be achieved by creating infrastructure resources completely from information stored in source control, i.e. using infrastructure-as-code, making the resources as immutable as possible and performing automated tests on these resources after deploying them in a test environment using the same deployment process that will be used to deploy the resources to the production environment.
Using this approach should allow the creation of resources that are thoroughly tested and can be deployed in a sensible fashion. It should be noted that no amount of automated testing will guarantee that the new resources are free of any issues so it will always be important use deployment techniques that allow for example quick roll-back or staged roll-outs. Additionally deployed resources should be carefully monitored so that issues will be discovered quickly.
To achieve the goal of being able to deploy updates and improvements to the development infrastructure the following steps can be taken
- Using infrastructure-as-code to create new resource images each time a resource needs to be updated. Trying to create resources by hand will drastically reduce the ease at which they can be build and be made consistently. Resources that are deployed into an environment should never be changed. If bugs need to be fixed or new features need to be added then a new version of the resource image should be created, tested and deployed. That way changes can be tested before deployment and the configuration of the deployed resources will be known.
- Resources should be placed on virtual machines or in (Docker) containers. Both technologies provide an easy way to create one or more instances of the resource which is required in order to test or scale a service. The general idea is to have one resource per VM / container instance. One resource may contain multiple services or daemons but it always serves a single goal. Note that in some cases people will state that you should only use containers and not VMs but there are still cases where a VM works better, e.g. in some cases executing software builds works better in a VM or running a service that stores large quantities of data. Additionally if all or a large part of the infrastructure is running on VMs then using VMs might make more sense. In all cases the correct approach, container or VMs, is the one that makes sense for the environment the resources will be deployed into.
- Some way of getting configurations into the resource. Some configurations can be hard-coded into
the resource, if they are never expected to be changed. The draw-back of encoding a configuration
into a resource is that this configuration cannot be changed if the resource is used in different
environments, e.g. a test environment and a production environment. Configurations which are
different between environments should not be encoded in the resource since that may prevent the
resource from being deployed in a test environment for testing. Provisioning a resource requires
that you can apply all the environment specific information to a resource which is a difficult
problem to solve especially for the initial set of configurations, e.g. the configurations which
determine where to get the remaining configurations. Several options are:
- For VMs you can use DVD / ISO files that are linked on first start-up of the resource.
- Systems like consul-template can generate configurations from a distributed key-value store.
- Resources can be pull their own configurations from a shared store.
- For containers often environment variables are used. These might be sufficient but note that they are not secure, both inside the container and outside the container.
- Configurations that should be provided when a resource is provisioned should be stored
in source control, just like the resource code is, in order to be able to automate the verification
and delivery of the configuration values.
- The infrastructure should have it's own shared storage for configurations so that the ‘build’ process can push to the shared storage and configurations are distributed from there. That ensures that the build process doesn't need to know where to deliver exactly (which can change as the infrastructure changes). One option is to use SQL / no-SQL type storage (e.g. Elasticsearch), another option is to use a system like consul which has a distributed key-value store
- Automatic testing of a resource once it is deployed into an environment. For the very least the smoke tests should be run automatically when the resource is deployed to a test environment.
- Automatic deployments when a new resource becomes available or approved for an environment, for the very least to the test environment but ideally to all environments. Using the same deployment system for all environments is highly recommended because this allows testing the deployment process as well as the resource.
A general workflow for the creation of a new resource or to update a resource could be
- Update the code for the resource. This code can consist of Docker files, Chef or Puppet, scripts etc.. The most important thing is that the files are stored in source control and a sensible source control strategy is used.
- Once the changes are made a new resource can be created from the code.
- It is sensible to validate the sources using one or more suitable linters. Especially for infrastructure resources it is sensible to validate the sources before trying to create the resource because it potentially takes a long time to build a resource. Any errors that can be found sooner in the process will reduce the cycle time.
- Execute unit tests, e.g. ChefSpec, against the sources. Again, building a resource can take a long time so validation before trying to create the resource will reduce the cycle time.
- Actually create the new resource. For Docker containers this can be done from a docker file. For a VM this can be done with Packer. Building a VM will take longer than building a docker container in most cases. Note that building resources will in general take longer than building applications it is sensible to use the build / deployment pipeline to build the resources that make up the build / deployment pipeline. By using the pipeline it is possible to create the artefacts for the services and then use these artefacts to create the resource.
- Deploy the resource to a (small) test environment and execute the tests against the newly created resource.
- Once the tests have passed the newly made image can be ‘promoted’, i.e. approved for use in the production environment.
Using the approaches mentioned above it is possible to improve the development pipeline without causing unnecessary disturbances for the development team.
Edit: Changed the title from
software delivery pipeline to
software development pipeline to match
the other posts.
Over the last few years the use of build pipelines has been gaining traction backed by the ever growing use of Continuous Integration (CI) and Continuous Delivery and Deployment (CD) processes. By using a build pipeline the development team get benefits like being able to execute parts of the build, test, release and deployment processes in parallel, being able to restart the process part way through in case of environmental issues, and vastly improved feedback cycles which improve the velocity at which features can be delivered to the customer.
Most modern build systems have the ability to create a build pipelines in one form or another, e.g. VSTS / Azure Devops builds, Jenkins pipeline, GitLab, BitBucket and TeamCity. With these capabilities built into the build system it is easy for developers to quickly create a new pipeline from scratch. While this is quick and easy often the pipeline for a product is created by the development team without considering if this is the best way to achieve their goal, which is to deliver their product faster with higher quality. Before using the built-in pipeline capability in the build system the second question a development team should ask is when should one use this ability and when should one not use this ability? Obviously the first question is, do we need a pipeline at all, which is a question for another post.
The advantages of creating a pipeline in your build system are:
- It is easy to quickly create pipelines. Either the is a click and point UI of some form or the pipeline is defined by a, relatively, simple configuration file. This means that a development team can configure a new build pipeline quickly when one is desired.
- Pipelines created in a build system can often use multiple build executors or have a job move from one executor to another if different capabilities are required for a new step, for instance if different steps in the pipeline need different operating systems to be executed.
- In many cases, but not all, the build system provides a way for humans to interact with a running pipeline, for instance to approve the continuation of the pipeline in case of deployments or to mark a manual test phase as passed or failed.
- If the configuration of the pipeline is stored in a file it can generally be stored in a source control system, thus providing all the benefits of using a source control system. In these cases the build system can generally update the build configurations in response to a commit / push notification from the version control system. Thus ensuring that the active build configuration is always up to date.
- The development team has nearly complete control over the build configuration which ensures that it is easy for the development teams to have a pipeline that suits their needs.
Based on the advantages of having a pipeline in the build system it seems pretty straight forward to say that having the pipeline in the build system is a good thing. However as with all things there are also drawbacks to having the pipeline in the build system.
Having the pipeline in the build system makes some assumptions that may not be correct in certain cases.
- The first assumption is that the build system is the center of all the work being done because the pipeline is controlled by the build system, thus requiring that all actions feed back into said build system. This however shouldn't be a given, after all why would the build system be the core system and not the source control system or the issue tracker. In reality all systems are required to deliver high quality software. This means in most cases that none of these systems have enough knowledge by themselves to make decisions about the complete state of the pipeline. By making the assumption that the build system is at the core of the pipeline the result will be that the knowledge of the pipeline work flow will end up being encoded in the build configurations and the build scripts. For simple pipelines this is a sensible thing to do but as the pipeline gets more complex this approach will be sub-optimal at best and more likely detrimental due to the complexity of providing all users with the overview of how the pipeline functions.
- The second, but potentially more important, assumption is that the item the development teams care most about is ‘build’ or 'build job'. This however is not the case most of the time because a ‘build’ is just a way to create or alter an artefact, i.e. the package, container, installer etc.. It is artefacts that people care about most because artefacts are the carrying vehicle for the features and bug fixes that the customer cares about. From this perspective it makes sense to track the artefacts instead of builds because the artefact flows through the entire pipeline while builds are only part of the pipeline.
- A third assumption is that every task can somehow be run through the build system, but this is not always the case and even when it is possible it is not necessarily sensible. For instance builds and deploys are fundamentally different things, one should be repeatable (builds) and can just be stopped on failure and restarted if necesary and the other is often not exactly repeatable (because artefacts can only be moved from a location once etc.) and should often not just be stopped (but rolled-back or not ‘committed’). Another example is long running tests for which the results may be fed back into the build system if required but that doesn't necessarily make sense.
If the build system is the the center of the pipeline then that means that the build system has to start storing persistent data about the state of the pipeline with all the issues that come with this kind of data, for instance:
- The data stored in the pipeline is valuable to the development team both at the current time and in the future when the development team needs to determine where an artefact comes from. This means that the data potentially needs to be kept safe for much longer than build information is normally kept. In order to achieve this the standard data protection rules apply for instance access controls and backups.
- The information about the pipeline needs to be easily accessible and changable both by the build system and by systems external to the build system. It should be possible to add additional information, e.g. the versions / names of artefacts created by a build. The status of the artefact as it progresses through the pipeline etc.. All this information is important either during the pipeline process or after the artefacts have been delivered to the customer. Often build systems don't have this capability, they store just enough information that they can do what they need to do, and in general they are not database systems (and if they are it is recommended that you don't tinker with them and in general it is made difficult to append or add information).
- Build systems work much better if they are immutable, i.e. created from standard components (e.g. controller and agents) with automatically generated build jobs (more about the reasons both of these will follow in future posts). This allows a build system to be expanded or replaced really easily (cattle not pets even for build systems). That is much harder if the build system is the core of your pipeline and stores all the data for it.
Having the pipeline in the build system in general provides more control for the development teams, which is a great benefit, but less control for the administrators. Because the pipeline provides the development teams with all the abilities there is, in general, less ability for the admins to guide things in the right direction or to block developers from doing things that they shouldn't be doing or have access to. While this may seem to be a benefit for the developers, no more annoying admins getting in the way, it is in fact a drawback because this behaviour means that the developers take on the responsibility to administer some or all of the underlying build system. Examples of the change of control are for instance in the Jenkins pipeline it is possible for developers to use all the credentials that jenkins has access to. However this might not be desirable for high power credentials or credentials for highly restricted resources. An other example is that the selection of the build executor is done in the pipeline configuration, however in some cases it may make sense to limit access to executors, after all having a build that can migrate from node to node makes sense in some cases but it's not free. Further the ease with which parallel steps can be created will lead to many parallel jobs. This might be great for one pipeline but isn't necessarily the best for the overall system. In some cases serializing the steps for a single pipeline can lead to greater overall throughput if there are many different jobs for many different teams.
Based on all the advantages and disadvantages that are listed here it may be difficult to decide whether or not a development team should use the pipeline in their build system or not. In general it will be sensible to use the pipeline capabilities that are build into your build system in cases where you either have a fairly simple pipeline that is easy to reason about or where no external systems need to interact with the data in the pipeline.
Once the pipeline gets more complicated, external systems need access to the metadata describing the pipeline or the pipeline gets stages that are incompatible with being executed by a build system it will be time to migrate to a different approach to the build and deployment pipeline. In this case it is worth it to develop some custom software that tracks artefacts through the pipeline. This makes it possible to treat the pipeline system as the critical infrastructure that it is, with the appropriate separation of data and business rule processing, data security and controlled access to the data for external systems.
The fourth property to consider is flexibility, i.e. the ability of the pipeline to be able to be modified or adapted without requiring large changes to be made to the underlying pipeline code and services.
A pipeline should be flexible because the products being build, tested and deployed with that pipeline may require different workflows or processes in order for them to complete all the stages in the pipeline. For example building and packaging a library will require a different approach then building, testing and deploying a cloud service. Additionally the different stages in the pipeline will require different approaches, e.g. build steps will in general be executed by a build system returning the results in a synchronous way, however test steps might run on a different machine from the process that controls the test steps so those results might come back via an asynchronous route. Finally flexibility in the pipeline also improves resilience since in case of a disruption an adaptable or flexible pipeline will allow restoring services through alternate means.
Making a flexible pipeline is achieved in the same way flexibility is achieved in other software products, by using modular parts, standard inputs and outputs and carefully considered design. Some of the appropriate options are for instance:
- Split the pipeline into stages that take standard inputs and deliver standard outputs. There might be many different types of inputs and outputs but they should be known and easily shared between processes and applications. There can be one or more stages, e.g. build, test and deploy, which are dependent on each other only through their inputs and outputs. This allows adding more stages if required.
- Allow steps or stages in the pipeline to be started through a response to a standard notification. That allows each step to determine what information it needs to start execution. Additional information can be downloaded from the appropriate sources upon receiving a notification. This approach allows notifications to be generic while steps can still acquire the information they need to execute. Additionally having pipeline steps respond to notifications means that it is very easy to add new steps in the process because a new executor only has to be instantiated and connected to the message source, e.g. a distributed queue.
- If a stage consists of multiple, dependent steps, then it should be easy to add and remove steps based on the requirements. In these cases it would generally be preferred that a stage like this executes one or more scripts as they are easier to extend than services. As with the stages steps should ideally use well-known inputs and produce well-known outputs.
- Inputs for stages and steps are for instance
- Source information, e.g. a commit ID
- Artefacts, e.g. packages installers, zip files etc.
- Meta data, additional information attached to a given output or input, e.g. build or test results
- Outputs generated by stages and steps are for instance
Flexibility of the workflow is can further be improved by making sure that the artefacts generated in the pipeline are not created, tested and deployed in a single monolithic process even if the end result should be a single artefact. In many cases artefacts can be assembled from smaller components. Using this approach improves the workflow for the development teams because smaller components can be created much quicker and in general assembly of a larger piece from components is quicker and more flexible than regeneration of the entire piece from scratch. In many cases only a few components will be recreated which both saves time and allows much of the process to be executed in parallel.
The exact implementation of the pipeline determines how flexible and easy to extend it will be. Given that the use and implementation of the pipeline vary quite a lot it is hard to provide detailed implementation details, however some standard suggestions are:
- Keep the build part of the pipeline described in the scripts given that scripts are, in general, easier to adapt. By pulling the scripts from a package, e.g. a NuGet or NPM package, it is quick and easy to update to a later version of these scripts. An additional benefit of keeping the process in the scripts is that developers can execute the individual steps of the pipeline from their local machines. That allows them to ensure builds / tests work before pushing to the pipeline and provides a means of building things if the pipeline is not available.
- Any part of the process that cannot be done by a script, e.g. test systems, items that need services, e.g. certificate signing, which require that the certificates are present on the current machine, something which might not be possible to do on every machine etc., should have a service that is available to both the pipeline and the developers executing the scripts locally. For any services that should only be provided to the build server, e.g. signing, the scripts should allow skipping the steps that need the service.
- For stages that execute scripts, e.g. the build stage, jobs can be automatically generated from information stored in source control. This makes it easy to update the actions executed by these stages without requiring developers to perform the configuration manually.
As a final note one should consider how the pipeline will be described. It is easier to reason about a pipeline if the entire description of that pipeline is stored in a single file, ideally in source control. However as the pipeline evolves and more steps and stages are executed in parallel it will become increasingly difficult to capture the entire pipeline in a single file. While harder to reason about it is in the end simpler and more flexible to let the pipeline layout, as in the stages, steps and orders of these items, be determined by the executors that are available and listening for notifications. That way it's easy to change the layout of the pipeline.
And with that we have come to the end of this journey into the guiding principles of designing a build and release pipeline. There are of course many additions that can be made with regards to the general design process and even more additions for specific use cases. Those however will have to wait until another post.
- December 3rd 2018: Fixed a typo in the post title