Pre-preface: We have about 50 heavily linked builds that
make up our “core” environment. We’re trying to cut "major" release times from ~3
months to ~3 weeks. Check out the prior blog post for more background
information.
---
In a continuous integration/continuous deployment shop, if
you decouple the deployments away from the builds you will improve tester
productivity, decrease test environment outages throughout the day, find
defects earlier, speed up the builds, you are encouraged to invest and rewarded
by investing in automated tests and
ultimately position yourself for continuous production deployments ala Push Engineering. (scary, but woo!) That’s a lot
of claims to make, so let’s get on with it.
Preface - this isn’t a step away from the Continuous
Integration or Continuous Deployment methodology, it’s a refinement to keep it
from becoming Continuous Interruption or Continuous Regression.
Productive testers
I’m willing to bet that if you build on every commit, 95% of
your test environment stakeholders (development, QA, product owners, etc) that
are using the environment in question aren’t looking for that specific
fix/feature/revision at that moment. They want it eventually – usually after they are done testing the story they
picked up after they sent the first story back to development with a defect. While they want that fix/feature/etc, they
want a Quality change; they want the commit to build successfully so the build
isn’t dead on arrival, they want it to pass the unit/system test suite
(hopefully you’re at 100% coverage) so there is a high confidence you aren’t
regressing existing critical functionality, and they want their test
environment to get that revision at an expected time – either scheduled or on
demand. The last thing testers want is
to have their valuable time spent setting up a test wasted when the environment
is reset on them for a defect fix or change they could not care less about.
Outages
A moderately sized software development unit – say, two teams adding up to 10 developers
with Quality Assurance, Business Analysts, Project Managers, Database Admins
and Developer Operations people – can generate a lot of environmental churn on
a daily basis. Your busiest environment will probably be your lowest
environment; I frequently see 10
developers able to commit 20-30 revisions every day to the dev branches, and then those revisions ( once they pass
whatever gateway test/merge qualifications necessary) being selectively merged to the next
more-mature environment – generally it divides by 2 or 3 each environment. So
your 30 revisions in Dev turns to 10 in the next environment up, then 5, then
2, then maybe you’re up to Prod and it’s hopefully just one build ;) If you
clump these deployment cycles into “a” number – without knowing your business
or processes, there is no magic number – but let’s assume for a moment we go to
four deployments a day in non-production – 9AM-Noon-3PM-6PM, and only when
there are changes to push – you’re now limited to 4 deployment interruptions in
your lowest environment, to one or two in all other non-prod environments, and
it’s on an expected timetable. From ~50
continuous interruption deployments a day down to ~8 – that’s a slight
improvement.
If you have frequent “hot” builds that need to be slammed in
as soon as possible, your people environment is probably lacking in software
development process maturity. People need to learn the old fashioned way –
experience – that there is never going to be a silver bullet fix that solves
everything in one build & deploy . You can alleviate the need for “War-room
ASAP” builds by working in as close to 100% test coverage percentage as you can
afford, level-set the environment stakeholders by explaining how many
people/groups are being impacted with a rebuild, communicate the end to end
outage length, and put visibility on the rebuild & deployment, each and
every one. When the world is watching, that super-important fix can wait 45
minutes until the scheduled, automated deploy happens. And if you’re in a dynamic place where you
have multiple groups working on multiple stories simultaneously, it’s extremely
rare for a critical group of people to be blocked & waiting for a
deployment. They may say they are, but
in reality they most likely have several other items ready to test, or items
marked resolved that they need to sign off on. Don’t get me wrong – you will
need to allow for manual deployment pushes – just make sure the button is
pushable by the right stakeholders and there is clear visibility on the
deployment occurring, why it’s occurring and who pushed the button/authorized
it. Listen to the users too – don’t let a pushy manager terrorize everyone by
smashing their builds in. You need to listen and push back if need be.
Defects & defect cost
If your deployments are coupled with your CI builds, even
with stepped down commits as you go up in environments, that’s a lot of
interruptions in each environment. The closer the environment is to Production makes
the outages more expensive in the short-term, due to the perceived instability
of the higher environments by senior leadership and external clients and the
interruptions of client UAT, sales demos, etc.
These outages/interruptions tend to chase the legitimate testers away, and
because the lower environments (closest to Development) are being rebuilt most
often, you’re training your most valuable testers to stay out and stay away
from the environments you actually need them in the most – the sooner you find
a defect, the cheaper it is to fix it. When the people who you are making these
changes for finally tests a few weeks before release or in an environment right before Production, having the “you
didn’t understand our requirements” card get played is somewhere on a level of
fun between playing “Sorry” with your significant other (they show no mercy!) and
having the Draw-Four card played on you in Uno, except in a large release hundreds of thousands of dollars in time
& effort are at stake. That is the
long-term cost of low-environment outage and over time will greatly outweigh
the short-term cost of your close-to-production environments recycling during
the day. Too often we let the “rough” development environments be wild-wild-west, not realizing it’s
your most valuable environment to test in.
Speeding up the build
If you have any tests baked into the build, you know they
can take just a few seconds to several minutes, depending where they sit
between pre-canned, self-contained nunit tests or if you are setting up data
and comitting transactions. When you have high percentage levels of test
coverage your tests can often exceed the compile times of the build itself;
there’s nothing wrong with that. This is where you can successfully argue for
more build server horsepower. The time spent on deployments is often out of
your hands; you are constrained by network speeds, SAN or local disk speeds,
and hard-timed delays waiting for services to stop and restart. The deployment
tasks in your build quickly add up; even if deployment tasks are only 10% of
your build time, it’s increasing your environment outage by 10%. If you asked
QA “Hey, would it be worth trimming 10% off a complete environment rebuild?”
they would jump on it. Time is money; a 10% decrease in time is literally
dollars – often it’s many dollars added up over dozens of builds a day – being
saved. That’s something senior
leadership can definitely understand and get behind.
Test coverage
Not slamming the build hot into the environment provides
incentive to build up the automated test coverage percentage. Nobody wants a
DOA build, and if they are now getting them 4 times a day max (let’s say) they
want them to count. Assuming a 9AM/Noon/3PM/6PM deployment cycle, developers at
10AM might have an easier time making sure their change is covered in unit
tests when they know the build won’t be deployed until Noon. I have seen
project managers and directors literally standing behind a developer waiting
for them to commit because “everyone is blocked!” or “The client is waiting to
test!”. That’s not a conducive work environment, nor one anyone would want to
work in. Taking the deployments out of the build also gives you the ability to
put in longer, slower, deeper automated tests and post-build reports, like test
coverage reports, in each and every build. It sounds counterintuitive to speed
up the build by removing the deploy stage and then slow it down with more
automated tests and post-compile reports, but those tests and reports will
increase quality, which will decrease
the amount of time spent in the whole “code it-throw it over the wall-have it
thrown back-fix and throw it over the wall again” cycle.
Continuous Production Deployments
That’s where I am trying to get to, and that’s why I’m
removing deployments from our build cycle. I need two major results out of this
– our automated tests need to inc rease in quality & importance, and the
decoupled deployment stage - broken out into a separate stage – can be
perfected in the lower environments and then used in Production. The only
reason I haven’t strung together our piecemeal deployment scripts from each
build and run them against Production already is we don’t have good automated
tests running (so we can’t automatically trust the build), and we don’t always
have good visibility on what needs to be released due to how much shared code
we use. Yanking deployments out and relying on and improving our test coverage
and perfecting our deployment stage gives me half of this; my next post will be
on how I’m going to solve the release build of materials question.
-Kelly Schoenhofen