One of the final chapters in the description of the development pipeline deals with security. In this case I specifically mean the security of the pipeline and the underlying infrastructure, not the security of the applications which are created using the pipeline.
The first question is why should you care about the security of the pipeline? After all developers use the development pipelines via secured networks and their access permissions will be set at the source control level. Additionally high trust levels exist between the pipeline processes, the infrastructure and the source repository. In general this leads to the security of the pipeline being placed lower on the priority list.
Which issues could you run into if you deem the security of the pipeline less critical? One argument comes from pen tests which show that CI/CD systems are a great way into corporate networks. Additionally there have been a number of attacks aimed at distributing malicious code through trusted software packages. These so called supply chain attacks try to compromise the user by inserting malicious code in third-party dependencies, i.e. the source code supply chain.
In essence the problem comes down to the fact that the build pipeline and its associated infrastructure have access to many different systems and resources which are normally not easily accessible for its users. This makes your pipeline a target for malicious actors who could abuse some of the following states for their own purposes.
- The development pipeline runs all tasks with the same user account on the executors and thereby the same permissions. Obviously the worst case scenario would be running as an administrator.
- Multiple pipeline invocations executed on a single machine, either in parallel or sequential, which allows a task in a pipeline to access the workspace of another pipeline. This ability can for instance be used to by-pass access controls on source code.
- Downloading packages directly from external package repositories, e.g. NPM or Docker.
- Direct access to the internet, which allows downloading of malicious code and uploading of artefacts to undesired locations.
- The development pipeline has the ability to update or overwrite existing artefacts.
- The executors have direct access to different resources that normal pipeline users don't have access to. Specifically if the same infrastructure is used to build artefacts and to deploy them to the production environment.
One of the problems with securing the development pipeline is that all the actions mentioned above are in one way or another required for the pipeline to function, after all the pipeline needs to be able to build and distribute artefacts. The follow up question then becomes can you distinguish between normal use and malicious use?
It turns out that distinguishing that this will be difficult because both forms of actions are essentially the same, they both use the development pipeline for its intended purpose. So then in order to prevent malicious use put up as many barriers to malicious use as possible, aka defence in depth. The following are a number of possible ways to add barriers:
- Grant the minimal possible permissions for the executors, both on the executor and from the executor to the external resources. It is better to run the pipeline actions as a local user on the executor, rather than using a domain user. Grant permissions to a specific resource to the action that interacts with the resource.
- Execute a single pipeline per executor and never reuse the executor.
- Limit network connections to and from executors. In general executors do not need internet access, save for a few pre-approved sites, .e.g. an artefact storage. There is also very little reason for executors to connect to each other, especially if executors are short lived.
- Pull packages, e.g. NPM or Docker, only from an internal feed. Additions to the internal feed are made after the specific package has been reviewed.
- The artefacts created with the pipeline should be tracked so that you know the origin, creation time, storage locations and other data that can help identity an exact instance of an artefact. Under ideal circumstances you would know exactly which sources and packages were used to create the artefact as well.
- Artefacts should be immutable and never be allowed to overwritten.
- Do not use the executors that perform builds for deployments, use a set of executors that only have deployment permissions but no permissions to source control etc..
Beyond these changes there are many other ways to reduce the attack surface as documented in the security literature. In the end the goal of this post is more to point out that the security of the development pipeline is important, rather than providing ways to make a pipeline more secure. The exact solutions for pipeline security depend very heavily on the way the pipeline is constructed and what other forms of security validation have been placed around the pipeline.
The last post I explained a few ways to improve the development pipeline infrastructure while keeping downtime minimal. One important consideration for the resilience of the pipeline is to reduce the dependencies between the pipeline and the infrastructure.
So what does unwanted coupling between the pipeline and the infrastructure mean? After all the pipeline code makes assumptions about the capabilities of the infrastructure it runs on, just like every other piece of code. For development pipelines examples of unwanted dependencies are:
- The pipeline is created by using tasks specific to a CI/CD system which means that the artefacts can only be created on the infrastructure
- Pipeline stages expect the outputs from previous stages to be available on the executor, or worse, they expect the outputs from other pipelines to be on the executor
- A pipeline task assumes that certain tools have been installed on the executors
- A pipeline task assumes that it has specific permissions on an executor
The first two items mentioned impacts the velocity at which development can take place. By relying completely on the CI/CD system for artefact creation cycle times increases. Additionally making changes to the pipeline often requires executing it multiple times to debug it.
The second set of items are related to the ease of switching tooling, e.g. when changing vendors or during disaster recovery. These may or may not be of concern depending on the direction of development and technology. In my experience vendor lock-in is something to keep in mind if for no other reason then that switching vendors can be prohibitively complicated if pipeline processes are too tightly coupled to the CI/CD system.
If any of the issues mentioned are concerning to you then partially or completely decoupling the development pipeline from the infrastructure will be a worthwhile exercise. This can be achieved with some of the following steps.
Versions for everybody
Ensure that all code, tools, scripts and resources used in the pipeline are versioned. That way you know exactly what is required for a specific pipeline execution. Which make executions repeatable in the future when newer versions of tools and resources have been released.
Only the workspace is yours
Pipeline actions should work only in their own workspace. Any code, scripts or tools they require are put in the workspace. This reduces scattering of data during the execution of a pipeline and reduces pollution of the executors.
Make your pipeline actions assume they will run on a bare minimum runtime. This will ensure that your pipelines will obtain their preferred version of tools they need and install them in their own workspace. The benefit of doing this that a pipeline can be executed on any available executor, either in your CI/CD system or on a developer machine, as it will not be making any assumptions about the capabilities of an executor.
Single use executors are cleaner
Configure the CI/CD system to use a clean executor for every job. This will ensure that changes made during a previous job don't interfere with the current job. Additionally it will enforce the use of immutable infrastructure for the executors, thus allowing versioning of the executors.
Use the CI/CD system as a task executor
Keep the configuration for the jobs in the CI/CD system to a minimum. Ideally the entire configuration is the execution of a script or tools with a simple set of arguments. By reducing the job of the CI/CD system to executing a script or tool it is simple to execute the pipeline actions somewhere else, e.g. on a developer machine or a different CI/CD system.
Treat pipeline services as critical
All services used by the pipeline should be treated as mission critical systems. After all a failure in one of these systems can stop all your pipelines from executing. So reduce the number of services you use in your pipeline and improve the resilience the services you have to rely on using the standard approaches.
This post introduces the Calvinverse project which provides the source code for the different resources required to create the infrastructure for a build pipeline. The Calvinverse resources have been developed for two main reasons:
- To provide me with a way to experiment with and learn more about immutable infrastructure and infrastructure-as-code as applied to build infrastructure.
- To provide resources that can be used to set up the infrastructure for a complete on-prem build system. The system should provide a build controller with build agents, artefact storage and all the necessary tools to monitor the different services and diagnose issues.
The Calvinverse resources can be configured and deployed for different sizes of infrastructure, from small setups only used by a few developers to a large setup used by many developers for the development of many products. How to configure the resources for small, medium or large environments and their hardware requirements will be discussed in future posts.
The resources in the Calvinverse project are designed to be run as a self-contained system. While daily maintenance is minimal it is not a hosted system so some maintenance is required. For instance OS updates will be required on a regular basis. These can either be applied to existing resources, through the automatic updates, or by applying the new updates to the templates and then replacing the existing resources with a new instance. The latter approach case can be automated, however there is no code in any of the Calvinverse repositories to do this automatically.
The different resources in the Calvinverse project contain a set of tools and applications which provide all the necessary capabilities to create the infrastructure for a build pipeline. Amongst these capabilities are service discovery, build execution, artefact storage, metrics, alerting and log processing.
The following applications and approaches are used for service discovery and configuration storage:
- Using Consul to provide service discovery and machine discovery via DNS inside an environment. An environment is defined as all machines that are part of a consul datacenter. It is possible to have multiple environments where the machines may all be on the same network but in general will not be communicating across environments. This is useful for cases where having multiple environments makes sense, for instance when having a production environment and a test environment. The benefit of using Consul as a DNS is that it allows a resource to have a consistent name across different environments without the DNS names clashing. For instance if there is a production environment and a test environment then it is possible to use the same DNS name for a resource, even though the actual machine names will be different. This allows using the Consul DNS name in tools and scripts without having to keep in mind the environment the tool is deployed in. Finally Consul is also used for the distributed key-value store that all applications can obtain configuration information from thereby centralizing the configuration information.
- Using one or more Vault instance to handle all the secrets required for the environment. Vault provides authenticated access for resources to securely access secrets, login credentials and other information that should be kept secure. This allows centralizing the storage and distribution of secrets
For the build work Calvinverse uses the following applications:
- Jenkins is used as the build controller.
- Build executors connect to Jenkins using the swarm plugin so that agents can connect when it starts. In the Calvinverse project there are currently only Windows based executors.
For artefact storage Calvin verse uses the Nexus application. The image is configured such that a new instance of the image will create artefact repositories for NuGet, Npm, Docker and general ZIP artefacts.
For message distribution Calvinverse uses the RabbitMQ application. The image is configured such that a new instance of the image will try to connect to the existing cluster in the environment. If no cluster exists then the first instance of the image will form the start of the cluster in the environment.
Metrics, monitoring and alerting capabilities are provided by the Influx stack, consisting of:
- InfluxDB for metrics collection.
- Grafana for dashboards.
- Telegraf is installed on each resource to collect metrics and send them to InfluxDb.
- Chronograf for alert configurations.
- Kapacitor for alerting.
Build and system logs are processed by the Elastic stack consisting off:
- Elasticsearch for log storage, both system logs and build logs
- Kibana for log dashboards.
- Logs collected via syslog-ng on Linux and a modified version of Filebeat on windows. Logs are sent to RabbitMQ to ensure that the unprocessed logs aren't lost when something any part of the log stack goes offline.
- Logstash for processing logs from RabbitMQ to Elasticsearch using filters
It should be noted that while the Calvinverse resources combine to create a complete build environment the resources might need some alterations to fit in with the work flow and processes that are being followed. After all each company is different and applies different workflows. Additionally users might want to replace some of the resources with versions of their own, e.g. to replace Influx with Prometheus.
Continuous integration (CI) systems originally and build pipelines recently have traditionally been available on-prem only with systems like Jenkins, TeamCity, Bamboo and TFS. This is possibly due to the fact that these systems needs relatively powerful hardware, mostly consisting of powerful CPU and fast IO, something which was not easily available in the cloud until the last few years.
However in the last few years a number of cloud based CI systems have appeared e.g. Azure DevOps, AppVeyor, CircleCi), CloudShip, Google cloud build and Travis CI. This has lead to the question of where to locate a CI system? Should it be on-prem or in the cloud or potentially even a combination of the two. This post should provide some suggestions on how to make the selection between the different options.
Cloud-based CI systems
As with other cloud systems when using a cloud based CI system the user gets the benefits of not having to worry about the underlying infrastructure and resources and having the ability to scale the CI system to the size required, provided one pays for the additional resources.
The other side of the coin is that because the user has no influence on the infrastructure of the CI system there is also no direct control over the hardware or the controller software. Thus the user cannot increase the hardware specs for the controller or the agents and the user cannot determine which plugins or capabilities are available in the CI system. As a side effect this also means that the user does not have access to the logs, metrics and file system for the underlying system, which provide information that may be useful when issues arise. In general the controller specific logs and metrics are only useful if you have access to the controller, however the build specific information is useful either for diagnostics or future planning.
Besides the CI part of the system in some cases the entire pipeline will require other resources, e.g artefact storage or test systems. Some cloud systems provide these additional systems as well, for a price of course. Other systems require that these additional resources are provided in some other way.
On-prem CI systems
When running the CI system on-prem one has to both provide and maintain the infrastructure, hardware and networking etc., and the controller and executor software. This increases the overhead for running a CI system. Additionally scaling the system either requires manual intervention or building the scaling capabilities.
On the other hand having control over the infrastructure means that the CI system can be configured so that it fits the use case for the development teams, the desired plugins installed, executors with all the right tools, full control over executor workspaces and with that the ability to lock down sensitive information. Additionally logs and metrics can be collected from everywhere which helps diagnostics, alerting and predictive capabilities on both the infrastructure side and the build capacity side.
Finally having full control over the CI system means that it is possible to extend the system if that is required with custom capabilities, either directly added to the CI system or as additional services. It should of course be noted that this requires resources and is thus not free.
Selecting a location for your CI system
So how does one select a location for a CI system. Both cloud and on-prem have pros and cons and in the end the location of the system depends very much on the situation of the development team. If the team works for a company where there is no on-prem server infrastructure then a cloud based system will be the only sensible approach. However there will also be cases where an on-prem system is the only sensible option.
In order to decide for one system or the other the first thing that should be done is a cost comparison, comparing the total cost of ownership, i.e. initial purchasing costs, running costs, staff costs, training costs etc.. As part of the cost comparison the costs for additional parts of the system should also be included, e.g. artefact storage or test systems. One should also note that while cloud systems reduce maintenance, they are not maintenance free. The maintenance of the infrastructure disappears but the maintenance of the builds and the workflow does not, after all no matter where the build pipeline is located it is still important that it delivers the accuracy, performance, resilience and flexibility.
Once the cost comparison is done there are other things to bring into the decision process. Because while costs are important they are not the only reason to select one system or another. For instance other comparison elements could be related to regulations that specify how source needs to be treated or specific processes that should be followed, or the capabilities of the different CI systems for example is the ability to execute builds on a specific OS. Not all cloud CI systems provide executors for all the different OSes.
In the end the decision to select a cloud build system or a on-prem build system depends very strongly on the situation the company is in. It is even possible that as time progresses the best type of system may change from on-prem to cloud or visa versa. Both systems have their own advantages and disadvantages. In the end all that matters is that a system that fits the development process is selected, independent of what the different vendors say is the best thing.
In one of the post from a while ago we discussed what a software development pipeline is and what the most important characteristics are. Given that the pipeline is used during the large majority of the development, test and release process it is fair to say that for a software company the build and deployment pipeline infrastructure should be considered critical infrastructure because without it the development team will be more limited in their ability to perform their tasks. Note that at no stage should any specific tool, including the pipeline, be the single point of failure. More on how to reduce the dependency on CI systems and the pipeline will follow in another post.
Just like any other piece of infrastructure the development pipeline will need to be updated and improved on a regular basis, either to fix bugs, patch security issues or to add new features that will make the development team more productive. Because the pipeline falls in the critical infrastructure category it is important to keep disturbances to a minimum while performing these changes. There are two main parts to providing (nearly) continuous service while still providing improvements and updates. The first is to ensure that the changes are tracked and tested properly, the second is to deploy the exact changes that were tested to the production system in a way that no or minimal interruptions occur. A sensible approach to the first part is to follow a solid software development process so that the changes are controlled, verified and monitored which can be achieved by creating infrastructure resources completely from information stored in source control, i.e. using infrastructure-as-code, making the resources as immutable as possible and performing automated tests on these resources after deploying them in a test environment using the same deployment process that will be used to deploy the resources to the production environment.
Using this approach should allow the creation of resources that are thoroughly tested and can be deployed in a sensible fashion. It should be noted that no amount of automated testing will guarantee that the new resources are free of any issues so it will always be important use deployment techniques that allow for example quick roll-back or staged roll-outs. Additionally deployed resources should be carefully monitored so that issues will be discovered quickly.
To achieve the goal of being able to deploy updates and improvements to the development infrastructure the following steps can be taken
- Using infrastructure-as-code to create new resource images each time a resource needs to be updated. Trying to create resources by hand will drastically reduce the ease at which they can be build and be made consistently. Resources that are deployed into an environment should never be changed. If bugs need to be fixed or new features need to be added then a new version of the resource image should be created, tested and deployed. That way changes can be tested before deployment and the configuration of the deployed resources will be known.
- Resources should be placed on virtual machines or in (Docker) containers. Both technologies provide an easy way to create one or more instances of the resource which is required in order to test or scale a service. The general idea is to have one resource per VM / container instance. One resource may contain multiple services or daemons but it always serves a single goal. Note that in some cases people will state that you should only use containers and not VMs but there are still cases where a VM works better, e.g. in some cases executing software builds works better in a VM or running a service that stores large quantities of data. Additionally if all or a large part of the infrastructure is running on VMs then using VMs might make more sense. In all cases the correct approach, container or VMs, is the one that makes sense for the environment the resources will be deployed into.
- Some way of getting configurations into the resource. Some configurations can be hard-coded into
the resource, if they are never expected to be changed. The draw-back of encoding a configuration
into a resource is that this configuration cannot be changed if the resource is used in different
environments, e.g. a test environment and a production environment. Configurations which are
different between environments should not be encoded in the resource since that may prevent the
resource from being deployed in a test environment for testing. Provisioning a resource requires
that you can apply all the environment specific information to a resource which is a difficult
problem to solve especially for the initial set of configurations, e.g. the configurations which
determine where to get the remaining configurations. Several options are:
- For VMs you can use DVD / ISO files that are linked on first start-up of the resource.
- Systems like consul-template can generate configurations from a distributed key-value store.
- Resources can be pull their own configurations from a shared store.
- For containers often environment variables are used. These might be sufficient but note that they are not secure, both inside the container and outside the container.
- Configurations that should be provided when a resource is provisioned should be stored
in source control, just like the resource code is, in order to be able to automate the verification
and delivery of the configuration values.
- The infrastructure should have it's own shared storage for configurations so that the ‘build’ process can push to the shared storage and configurations are distributed from there. That ensures that the build process doesn't need to know where to deliver exactly (which can change as the infrastructure changes). One option is to use SQL / no-SQL type storage (e.g. Elasticsearch), another option is to use a system like consul which has a distributed key-value store
- Automatic testing of a resource once it is deployed into an environment. For the very least the smoke tests should be run automatically when the resource is deployed to a test environment.
- Automatic deployments when a new resource becomes available or approved for an environment, for the very least to the test environment but ideally to all environments. Using the same deployment system for all environments is highly recommended because this allows testing the deployment process as well as the resource.
A general workflow for the creation of a new resource or to update a resource could be
- Update the code for the resource. This code can consist of Docker files, Chef or Puppet, scripts etc.. The most important thing is that the files are stored in source control and a sensible source control strategy is used.
- Once the changes are made a new resource can be created from the code.
- It is sensible to validate the sources using one or more suitable linters. Especially for infrastructure resources it is sensible to validate the sources before trying to create the resource because it potentially takes a long time to build a resource. Any errors that can be found sooner in the process will reduce the cycle time.
- Execute unit tests, e.g. ChefSpec, against the sources. Again, building a resource can take a long time so validation before trying to create the resource will reduce the cycle time.
- Actually create the new resource. For Docker containers this can be done from a docker file. For a VM this can be done with Packer. Building a VM will take longer than building a docker container in most cases. Note that building resources will in general take longer than building applications it is sensible to use the build / deployment pipeline to build the resources that make up the build / deployment pipeline. By using the pipeline it is possible to create the artefacts for the services and then use these artefacts to create the resource.
- Deploy the resource to a (small) test environment and execute the tests against the newly created resource.
- Once the tests have passed the newly made image can be ‘promoted’, i.e. approved for use in the production environment.
Using the approaches mentioned above it is possible to improve the development pipeline without causing unnecessary disturbances for the development team.
Edit: Changed the title from
software delivery pipeline to
software development pipeline to match
the other posts.