“Treat Yo Self!” – Parks & Recreation

When was the last time you brought in a shiny new upgrade for your development teams (that wasn’t part of your annual Microsoft true-up)? When those budget talks come around, are you always stuck with repeatedly justifying the tools you need to work effectively? Instead of running your entire development pipeline on hand-me-down hardware—or that one oversized virtual machine that you managed to get the IT department to approve last year—consider rethinking your whole development pipeline. Adopting a cloud-native mindset will enable you to enjoy the cool (or warm, if that’s your thing) breeze of what modern software development should be like.

What’s do we mean by cloud-native mindset?

According to the CNCF Cloud Native Definition v1.0, “… These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.” To put it simply: Your software deployments should be so boring and so well-exercised that you almost forget they’re happening all the time.

By applying a cloud-native mindset to your development pipeline and your application environments, you will address the toughest concerns facing your development, process, finance, AND leadership teams, all while empowering your organization to do more with less (I feel gross saying that, but I mean it in a good way here). The best way to start rethinking your development pipeline is to “treat yo self” to a Landing Zone for your cloud landscape, and exploit those benefits to help your organization fail better, make deployments boring, and better control the costs of running your business in the cloud.

Highway to the Landing Zone

When starting your cloud journey, one of the current best practices is to set up a Landing Zone that provides:

  1. An automated method for partitioning your cloud account into discrete, sandboxed areas
  2. A unified security model for controlling access to these accounts based on your organization’s single sign-on solution
  3. Logging and auditing capabilities to monitor the overall integrity of your sandboxed areas
  4. Itemized cost allocation and reporting across sandboxed areas

The Landing Zone enables you to create light-touch, sandboxed environments, with guardrails already built in, that your team can use to build prototypes, test out new configurations, or learn new strategies—without impacting other development or operations environments. Light-touch means that they can quickly set up these environments without first signing a blood-pact with an elder god. The major cloud vendors use the Landing Zone to create secure, disposable, repeatable, cost-transparent environments. These enable your team to experiment and iterate quickly, improve security and repeatability by leveraging Infrastructure as Code (IaC), and manage costs more effectively using the tools you’re already using to monitor your operations environments.

If you haven’t created a Landing Zone already, work with your cloud vendor to see their instructions and best practices for starting one up (e.g., AWS Control Tower and Azure Landing Zone).

Here are some real-world scenarios in which Landing Zone-enabled environments can, and have, come in handy for cloud development teams:

  1. One of your developers is testing out a new third-party tool and wants to be sure it can work peaceably with the other tools in the ecosystem. They access a new sandbox environment and set up containerized versions of all the tools using the vendor’s published cloud automation template (e.g., Running TeamCity Stack in AWS: TeamCity deployment is automated by an AWS CloudFormation template that can be customized and upgraded automatically). Once the testing is complete, the applications are torn down with a few commands and the environment is cleaned out.
  2. Your site reliability engineers want to create a separate instance of an internally-developed, customer-facing application to isolate the data for a particular customer. They request a new sandbox in the “Production” organization of the cloud environment, triggering some internal processes to assure that all the appropriate documentation is updated and ready to support the expected service-level agreements. They then deploy the application using the automated templates built together with the development team and customize the name of the instance along with the initial compute tiers and data sizes. Once deployed, they can verify the application works and open up traffic to the customer.
  3. You are working on budgets for operating expenses for the coming fiscal year and you want to better understand how much cloud spend is attributed to development versus operations, and how to forecast (and optimize) potential spend in the coming year. With the rolled-up cost reporting, you can run reports on the totalized spend across all the cloud accounts in both the “Production” and “Development” organizations. Working at the top level of the organization, you can then select an appropriate pre-purchase level of resources you plan on using to reduce the overall cost significantly.

Landing Zones enable you to interact with your cloud environment more flexibly and effectively than you could by running a single shared account. Once you have your Landing Zone environment ready to go, you can start setting your team up to fail (better).

Allow your team to fail better

“Ever tried. Ever failed. No matter. Try again. Fail again. Fail better.” – Samuel Beckett, Worstward Ho

Software development is all about experimentation and failure. However, successful software development allows for controlled experimentation and fast failure. Documentation for software is rarely kept evergreen, and tutorials only lay so much groundwork. You need to try things out yourself to see if they’ll work the way you want. However, just failing is not useful, you need to fail fast and learn from those failures to keep moving forward.

To “fail better”, you need to establish two things:

  1. A cultural norm wherein failure is always communicated as a learning opportunity, making a safe space for your team to openly and honestly talk about failures.
  2. A process to create semi-isolated environments that can be created and destroyed without fear of affecting any other ongoing work.

Unfortunately, there isn’t an easy, push-button solution to improve your team’s communication. However, a good starting point is better-understanding empathy.

Your developers can then deploy any desired resources to the sandbox. You’ll enable easy experimentation by deploying a copy of your application without interfering with the existing environments.

It’s important to document the steps needed to set these applications up cleanly and make using them an expectation every time a new application, service, team, or individual is added to the environment. Consider the bus factor of your team and nudge appropriately.

Your development and operations systems should run in the same type of sandboxed environment as well, giving your team a consistent way of deploying internal tools as well as customer-facing applications. Allowing your teams to try out deployment strategies or new combinations of tools in these sandboxes encourages taking more calculated risks. If something goes horribly wrong, they can always delete the sandbox and all its resources and start over.

“Cattle, not pets” applies to your development systems as well as your operations systems. Spinning up a big server and running a lot of applications on it that are manually managed and updated is a recipe for headache and a single point of failure. Embrace the cloud and spread out your tools across compute instances, look into containerized versions of applications, and script your way to a happier future.

cattle ont pets

Source: CC BY-ND 4.0, Randy Bias

For example, if you want to run JetBrains TeamCity for your build automation, you can either choose to create a “pet” virtual machine that you diligently maintain with patches and updates (VERY DILIGENTLY), or you can choose to take the containerized “cattle” approach and install the “stack” version: Running TeamCity Stack in AWS. In the case of the “pet,” you’ll need to maintain the underlying operating system as well as the version of TeamCity and perform the required upgrades when a new version is released. With the “cattle” configuration, you can spin up a copy of the new server in a separate account to try out new features or upgrade the system with a few clicks. JetBrains updates their images with security updates and operating system upgrades periodically, and you can easily set up an automation task to determine the next minor release and perform the upgrade automatically.

Enable your teams to experiment freely and fail quickly, then discuss and learn from those failures to keep moving forward. While setting up this process is not easy, the benefits to your team far outweigh the time it takes to integrate this strategy into your organization. The best part is that, once configured, you’re only billed for the resources you use, then billing stops when you tear down the sandbox. The alternative is to find a spare desktop machine or spare virtual machine in your datacenter to reimage and use each time you want to try out a new experiment which is not a good use of anyone’s time.

As you refine your process for creating software sandboxes, you can start the task of making your software deployments as boring as you possibly can—and then make them even MORE boring!

Make deployments as boring as possible

“Simple things should be simple, complex things should be possible.” – Alan Kay

Software development is all about making the complex simpler. However, successful software development tends to breed more and more complexity as new requirements are implemented and the team is turned over little by little as top performers either move up, move across, or get promoted to the role of customer. As your software architecture evolves, so does your development pipeline to handle new versions of your application runtime as well as the old version that you haven’t had time to upgrade yet (nor will you ever). “Making the complex simpler,” therefore, must include your development pipeline.

Few things are more soul-crushing than creating a 30-page wiki guide on how to click through the AWS or Azure Console to create your development environment and customize or configure every piece of software. The only thing worse is keeping that same documentation evergreen and maintained.

Every step of your pipeline can be automated, scripted, and updated through Infrastructure as Code. Azure DevOps, TeamCity, OctopusDeploy, Jenkins, GitHub, Bitbucket (along with Jira, Bamboo, and Confluence), and every other tool on the market right now has an API that can be integrated with your scripting or build processes to create projects, apply templates, check configurations, and execute automatically. Build and deployment agents can be defined and rebuilt from pre-patched base operating systems (on both Windows and Linux). Better yet, you can invest in full Infrastructure as Code ecosystems like Terraform or Pulumi, or more specialized ones such as AWS CDK or Azure ARM and Bicep.

Infrastructure as Code is not easy (in fact, it’s quite difficult due to lackluster documentation from the vendors, incomplete or incompatible examples on Stack Overflow, and terrible error messages all over the place). However, the journey will force you to codify your processes and systems and put them in source control. Tribal knowledge will evolve into common knowledge that can be looked up and kept up to date (instead of manually making sure all your wiki pages are updated). Infrastructure as Code doesn’t absolve you from needing to find at least one good systems engineer, but does let you make the development team more responsible for understanding how their code will run in the operating environments. The ends, of course, justify the means, and you’ll end up with a repeatable, diff-able, automated deployment process for your applications.

All IT organizations benefit from well-oiled, agile change management processes (even if we all love to complain about them). These processes endeavor to surface any unexpected interactions or conflicts between systems and networks to reduce the likelihood of a service level agreement breach or major incident due to a regularly planned change. Heavy processes tend to make people more resistant to changes and deployments happen only as often as change advisory board meetings.

However, a cloud-native mindset towards application management urges smaller, more frequent, less risky deployments. Further, each deployment, if defined using an Infrastructure as Code configuration, will describe exactly what changes are occurring and can be reviewed in pull requests as they are working through lower environments (such as development or testing). If you deploy often enough, the action of deploying becomes a boring, everyday occurrence instead of a special ceremony. This shift allows you to quickly address customer change requests, resolve defects, and maintain customer confidence in the system.

For example, any software product that lasts for more than a year will likely undergo a risky, breaking change in its lifetime (typically around changing your database vendor, integration mechanism, or authentication provider). Unless you’ve already got the budget set aside for a second staging environment, testing out this risky change most likely means creating a reasonable approximation from spare (and some not-so-spare) resources. Spoiler Alert: this won’t be a completely accurate validation and will likely miss some issues. By leveraging Infrastructure as Code and using the sandboxed environments, you can create a clean starting point to validate your changes as well as simulate (over and over again) the upgrade and migration process until you work out all the kinks.

Assuming your software applications are being deployed by Terraform on every change, retargeting your stack to a new sandbox environment is as simple as adding a new target configuration and running the script to build up the environment from scratch. Terraform has integrations for multiple cloud vendors and other supporting services, allowing you to deploy a service into your sandbox, then use the same Terraform process to configure it. Terraform has providers for Azure AD, Kubernetes, Hashicorp Vault, and CloudFlare to name a few. A single set of scripts can coordinate deployments between all these services.

Embracing Infrastructure as Code is a challenging process, but will pay back dividends when you can completely define your software environment in a source control repository: easily checking differences between versions, documenting pull request approvals and reviews, and parameterizing applications to support multiple parallel configurations. As your deployments become more boring and predictable, you can start better managing your costs across your project portfolio in the cloud.

Set budgets and transparently manage costs

“Don’t tell me what you value, show me your budget, and I’ll tell you what you value.” – Joe Biden

Software development, like any other project, costs money. However, teams have been trending in the direction of “free” options rather than paying for services and tools that help them be productive. In the cloud, everything costs something: if a particular resource is “free,” it usually means that you’re getting billed for how that resource ends up being used rather than the resource itself.

For example, AWS typically only bills you on network bandwidth that leaves the AWS datacenter (outbound) and not the data that goes into the AWS datacenter (inbound). AWS knows that inbound data is either going to be stored (and you get billed for storage), transmitted (and you get billed for outbound transmission), or processed (and you get billed for compute).

Teams that understand the value and cost of resources tend to work more effectively and cost-effectively in the cloud. Cloud environments are powerful and flexible, but the wrong sized VM or inappropriate database configuration can run up a hefty bill if it isn’t noticed quickly. Setting budgets and limits on the sandbox accounts, even something as low as $100 a month, provides enough space to work freely. You can start a conversation when the costs are projected to exceed the budget. Being open and transparent with the budgets and enlisting your team to manage them with their sandboxes and development tools will help them make more informed decisions for operations.

On-premise costs tend to be more fixed whereas cloud costs tend to be more variable. Leverage cost management tools in the cloud, like reserves, spot pricing, and resource scheduling. Reserves let you pre-pay for compute resources (your development tooling is running all the time, make sure you have it sized well and are pre-paying for compute to reduce your hourly cost). Spot pricing allows you to distribute larger, less critical tasks (like distributed calculations or data conversions) and bid for spare compute time across the datacenter. Consider how and when your development resources are used: does your team work evenings or weekends? Which environments are actively being used at any time? Each cloud platform has solutions for monitoring and shutting down unused resources.

As with agile processes, regular retrospective meetings are useful to discuss any concerns from the leadership regarding cloud spend. In the spirit of Infrastructure as Code, these concerns should be codified as rules that can be run against cloud environments and flag or correct violations of these rules.

The most popular example of this is Netflix’s Simian Army: these are the bots that Netflix runs on their infrastructure to ensure that rules are followed and any resources that don’t comply will get destroyed, forcing the development teams to account for these in their design and deployment practices. While this seems like a draconian policy (and it is), it correctly enforces the values of the organization and builds better cloud-native habits in the development teams.

In Landing Zone-enabled cloud environments, costs are allocated per partition and are rolled up in groups based on the organization of your organization. These can then be attributed to certain departments, projects, or written off as research and development expenses. The IT leadership team looking at a hefty cloud bill wants to see areas for improvement and how they can reduce costs while still creating sufficient value. Rolling up all the separate cloud costs and linking them at the top lets you apply more cost-saving options, such as AWS Savings Plans or Azure Reservations, across all partitions. As with other consumable resources, if you can adequately plan your consumption, you can negotiate a better rate with your vendors as they can better plan for your usage.

Now that you and your teams have a better handle on what your organization costs to run, you can feed this back into your planning processes and adjust your contract with your cloud vendors to remove variability out of your monthly bill. When your costs are predictable and transparent, you can better manage your automated infrastructure and allow your team to fail better while they work on providing value for the organization at large.

Landing your cloud-native strategy

Implementing Landing Zones is a best practice for managing your cloud infrastructure today that enables your organization to better manage its cloud footprint. Your software development teams can leverage these capabilities to build sandboxed cloud areas, which they can then boringly deploy their applications and tools into, using Infrastructure as Code, and proactively manage and report costs. Your development teams will enjoy the agility and productivity of being able to fail better and deploy infrastructure automatically. Your IT process and change management stakeholders will enjoy the predictability and sheer boredom of repeatedly deploying applications and tools, and the unparalleled visibility into the changes proposed. Meanwhile, your IT leadership team will enjoy complete transparency regarding the costs of running the entire organization, as well as the flexibility of having more strategic pricing options available with the cloud. Finally, as the person or team who has to make the impossible possible, you can enjoy saying “yes” to your stakeholders knowing that it won’t cost you and your team a weekend to move things forward. Landing Zones are the way to go if you’re considering cloud-native development….and you can quote me on that.

Let's Talk