Data silos. Cutting corners. A lack of transparency between devs and DevOps.
These are some of the common mistakes dev teams make when working in their staging environments, according to three Chicago technical professionals with 40 years of combined development and IT experience.
Below, you’ll hear staging environment best practices from a site reliability engineer at recruiting platform Hireology; a DevOps engineer at real estate tech company Neighborhoods.com; and the director of cloud operations at e-commerce site Zoro.
All three agree that teams should treat their staging environments the same as their production sites and should avoid using shortcuts. Processes for deployment, monitoring and maintenance between the two areas should be the same to eliminate complexity, maintain accuracy and reduce errors in production, they said.
To keep work consistent, their teams host staging environments on the same Kubernetes cluster as their production locales. They also rely on tools like Prometheus to monitor their infrastructure and ensure dependability from staging to live deployment.
Site Reliability Engineer Andre Joseph said technical teams at CRM platform Hireology value making deployments easier for all stakeholders. They pursue that goal by breaking down silos across engineering units and sharing processes across staging and production environments to decrease complexity.
Hireology’s staging best practices: Our team strives to keep our staging environment as close to production as possible. This idea is important because it allows us to deploy to production with greater confidence. The closer our staging environment is to production, the more likely it is that problems with the code or the infrastructure will show up early. Engineers need to be confident that after their code passes staging there will be no problems with deployment to production.
How the team maintains staging environments: We monitor our staging and QA environments the same way we monitor production. By monitoring all of our environments in the same way, we reduce operational complexity and make things easier on ourselves.
We monitor our staging and QA environments the same way we monitor production.”
Our monitoring practices include leveraging metrics and health checks as well as log collection for additional evaluation. And all of our observability user interfaces support filtering by environment. So when we troubleshoot a problem with QA or staging, we’re using the same tools, interfaces and data we would use in production. This philosophy makes troubleshooting issues less intimidating and more democratic.
We use Grafana and Prometheus to monitor our infrastructure. The data we collect from these tools is analyzed to find possible issues in our staging environment. Monitoring allows us to be proactive and these tools help us resolve potential issues quickly so engineers can be productive.
Common mistakes, and fixes, for dev teams: Oftentimes, engineering teams are siloed by specialty, with developers having little insight into the infrastructure on which their code runs. We work hard to make our infrastructure transparent to the rest of our engineering team and encourage them to actively participate in troubleshooting problems. This transparency helps ensure that institutional knowledge is shared across teams. It makes us better operators and helps our developers write better code.
Having a central resource the engineers can access to locate key information is a lifesaver. Failing to keep documentation up to date can cause major problems. We regularly revisit our process documentation to ensure that engineers have the resources they need to get the job done.
DevOps Engineer Rich Jackson said developers can run into issues when they treat the staging environment differently than deployment settings. So his dev team at real estate resource provider Neighborhoods.com is careful to stick to their best practices no matter what environment they’re working in.
Neighborhoods.com’s staging best practices: The point of having a staging environment is to mimic production as much as possible, so it’s imperative we follow our proven processes diligently. We are really careful about how we define service configuration, which is why we use a singular deployable artifact across environments to minimize configuration differences. In our case, that’s Docker images, but it could also be virtual machine images or Linux distribution packages.
When it comes to secret data, such as passwords and API keys, our tools require defined “slots” that must be bound to secrets at runtime. In order for services to start, these slots are defined in a singular location and are used across all environments. So, for example, if we’re making a new staging environment for a service and forget to provision a set of database credentials, the service will not start and users won’t see that there was a problem. Fortunately, we get an immediate error message that tells the operator exactly which secret wasn’t bound, which allows us to fix the problem in a timely manner.
We are really careful about how we define service configuration.”
How the team maintains staging environments: Stable and trustworthy build automation tools are critical to maintaining our staging environments. We do several things to ensure minimal configuration drift when a software change is ready to be deployed and tested in staging, including running staging environments on the same Kubernetes cluster as our production environment. Since our entire infrastructure is defined in code, we have the ability to frequently confirm that everything matches its definitions.
Common mistakes, and fixes, for dev teams: The biggest way to get into trouble is to think of a staging environment as a place where production rules and procedures don’t need to be followed. It’s easy to think, “Oh, it’s just staging, so it’s fine to get a shell to run a service and tweak some code here.” However, letting discipline slip like that can cause configuration drift. When this drift happens, the code deployment is no longer representative of what you can expect to see in production, meaning the value of the staging environment is completely lost.
Treat staging just like production. Changes should only be made via the automated build and deploy pipelines, using reviewed, versioned configuration.
Zoro’s Director of Cloud Operations Richard Staehler III said there needs to be a high degree of collaboration between engineers and DevOps at the e-commerce company to ensure reliable staging environments. A lack of teamwork in staging means time lost and opens the door for mistakes.
Zoro’s staging best practices: The staging environment for both our general website and microservices stays up around the clock. Doing so allows various teams to preview it and run our automated tests at any time. Code is only deployed to the environment using automatically triggered pipelines and we keep it up to date with infrastructure-as-code in case disaster strikes and we have to rebuild.
How the team maintains staging environments: Our staging environment closely resembles our production and it’s vital for them to share nearly the same processes for monitoring and maintainability. So we are monitoring cluster health, scaling, server metrics, APM and more at the same levels as production. Before release, we run a battery of tests on an application and if no metrics fall out-of-bounds, we give it the green light.
Engineering teams need to work lockstep with DevOps.”
Common mistakes, and fixes, for dev teams: Frequently, engineering teams deploy a new application or a change to a lower environment then transition to working with DevOps practices and pipelines. But soon they find out they forgot or missed something and that the staging environment is stricter than they thought.
To get ahead of this issue, engineering teams need to work lockstep with DevOps to make sure best practices are followed. They also need to be more in tune with the upper environments and their configuration.