Software Production Engineering: Build & Deploy Decoupling (and get some tests running in between!)

Pre-preface: We have about 50 heavily linked builds that make up our “core” environment. We’re trying to cut "major" release times from ~3 months to ~3 weeks. Check out the prior blog post for more background information.

---

In a continuous integration/continuous deployment shop, if you decouple the deployments away from the builds you will improve tester productivity, decrease test environment outages throughout the day, find defects earlier, speed up the builds, you are encouraged to invest and rewarded by investing in automated tests and ultimately position yourself for continuous production deployments ala Push Engineering. (scary, but woo!) That’s a lot of claims to make, so let’s get on with it.

Preface - this isn’t a step away from the Continuous Integration or Continuous Deployment methodology, it’s a refinement to keep it from becoming Continuous Interruption or Continuous Regression.

Productive testers

I’m willing to bet that if you build on every commit, 95% of your test environment stakeholders (development, QA, product owners, etc) that are using the environment in question aren’t looking for that specific fix/feature/revision at that moment. They want it eventually – usually after they are done testing the story they picked up after they sent the first story back to development with a defect. While they want that fix/feature/etc, they want a Quality change; they want the commit to build successfully so the build isn’t dead on arrival, they want it to pass the unit/system test suite (hopefully you’re at 100% coverage) so there is a high confidence you aren’t regressing existing critical functionality, and they want their test environment to get that revision at an expected time – either scheduled or on demand. The last thing testers want is to have their valuable time spent setting up a test wasted when the environment is reset on them for a defect fix or change they could not care less about.

Outages

A moderately sized software development unit – say, two teams adding up to 10 developers with Quality Assurance, Business Analysts, Project Managers, Database Admins and Developer Operations people – can generate a lot of environmental churn on a daily basis. Your busiest environment will probably be your lowest environment; I frequently see 10 developers able to commit 20-30 revisions every day to the dev branches, and then those revisions ( once they pass whatever gateway test/merge qualifications necessary) being selectively merged to the next more-mature environment – generally it divides by 2 or 3 each environment. So your 30 revisions in Dev turns to 10 in the next environment up, then 5, then 2, then maybe you’re up to Prod and it’s hopefully just one build ;) If you clump these deployment cycles into “a” number – without knowing your business or processes, there is no magic number – but let’s assume for a moment we go to four deployments a day in non-production – 9AM-Noon-3PM-6PM, and only when there are changes to push – you’re now limited to 4 deployment interruptions in your lowest environment, to one or two in all other non-prod environments, and it’s on an expected timetable. From ~50 continuous interruption deployments a day down to ~8 – that’s a slight improvement.

If you have frequent “hot” builds that need to be slammed in as soon as possible, your people environment is probably lacking in software development process maturity. People need to learn the old fashioned way – experience – that there is never going to be a silver bullet fix that solves everything in one build & deploy . You can alleviate the need for “War-room ASAP” builds by working in as close to 100% test coverage percentage as you can afford, level-set the environment stakeholders by explaining how many people/groups are being impacted with a rebuild, communicate the end to end outage length, and put visibility on the rebuild & deployment, each and every one. When the world is watching, that super-important fix can wait 45 minutes until the scheduled, automated deploy happens. And if you’re in a dynamic place where you have multiple groups working on multiple stories simultaneously, it’s extremely rare for a critical group of people to be blocked & waiting for a deployment. They may say they are, but in reality they most likely have several other items ready to test, or items marked resolved that they need to sign off on. Don’t get me wrong – you will need to allow for manual deployment pushes – just make sure the button is pushable by the right stakeholders and there is clear visibility on the deployment occurring, why it’s occurring and who pushed the button/authorized it. Listen to the users too – don’t let a pushy manager terrorize everyone by smashing their builds in. You need to listen and push back if need be.

Defects & defect cost

If your deployments are coupled with your CI builds, even with stepped down commits as you go up in environments, that’s a lot of interruptions in each environment. The closer the environment is to Production makes the outages more expensive in the short-term, due to the perceived instability of the higher environments by senior leadership and external clients and the interruptions of client UAT, sales demos, etc. These outages/interruptions tend to chase the legitimate testers away, and because the lower environments (closest to Development) are being rebuilt most often, you’re training your most valuable testers to stay out and stay away from the environments you actually need them in the most – the sooner you find a defect, the cheaper it is to fix it. When the people who you are making these changes for finally tests a few weeks before release or in an environment right before Production, having the “you didn’t understand our requirements” card get played is somewhere on a level of fun between playing “Sorry” with your significant other (they show no mercy!) and having the Draw-Four card played on you in Uno, except in a large release hundreds of thousands of dollars in time & effort are at stake. That is the long-term cost of low-environment outage and over time will greatly outweigh the short-term cost of your close-to-production environments recycling during the day. Too often we let the “rough” development environments be wild-wild-west, not realizing it’s your most valuable environment to test in.

Speeding up the build

If you have any tests baked into the build, you know they can take just a few seconds to several minutes, depending where they sit between pre-canned, self-contained nunit tests or if you are setting up data and comitting transactions. When you have high percentage levels of test coverage your tests can often exceed the compile times of the build itself; there’s nothing wrong with that. This is where you can successfully argue for more build server horsepower. The time spent on deployments is often out of your hands; you are constrained by network speeds, SAN or local disk speeds, and hard-timed delays waiting for services to stop and restart. The deployment tasks in your build quickly add up; even if deployment tasks are only 10% of your build time, it’s increasing your environment outage by 10%. If you asked QA “Hey, would it be worth trimming 10% off a complete environment rebuild?” they would jump on it. Time is money; a 10% decrease in time is literally dollars – often it’s many dollars added up over dozens of builds a day – being saved. That’s something senior leadership can definitely understand and get behind.

Test coverage

Not slamming the build hot into the environment provides incentive to build up the automated test coverage percentage. Nobody wants a DOA build, and if they are now getting them 4 times a day max (let’s say) they want them to count. Assuming a 9AM/Noon/3PM/6PM deployment cycle, developers at 10AM might have an easier time making sure their change is covered in unit tests when they know the build won’t be deployed until Noon. I have seen project managers and directors literally standing behind a developer waiting for them to commit because “everyone is blocked!” or “The client is waiting to test!”. That’s not a conducive work environment, nor one anyone would want to work in. Taking the deployments out of the build also gives you the ability to put in longer, slower, deeper automated tests and post-build reports, like test coverage reports, in each and every build. It sounds counterintuitive to speed up the build by removing the deploy stage and then slow it down with more automated tests and post-compile reports, but those tests and reports will increase quality, which will decrease the amount of time spent in the whole “code it-throw it over the wall-have it thrown back-fix and throw it over the wall again” cycle.

Continuous Production Deployments

That’s where I am trying to get to, and that’s why I’m removing deployments from our build cycle. I need two major results out of this – our automated tests need to inc rease in quality & importance, and the decoupled deployment stage - broken out into a separate stage – can be perfected in the lower environments and then used in Production. The only reason I haven’t strung together our piecemeal deployment scripts from each build and run them against Production already is we don’t have good automated tests running (so we can’t automatically trust the build), and we don’t always have good visibility on what needs to be released due to how much shared code we use. Yanking deployments out and relying on and improving our test coverage and perfecting our deployment stage gives me half of this; my next post will be on how I’m going to solve the release build of materials question.

-Kelly Schoenhofen

Software Production Engineering

Monday, November 7, 2011

Build & Deploy Decoupling (and get some tests running in between!)

No comments:

Post a Comment