Software Production Engineering: October 2011

Tip o'the hat to Bryan Gertonson for sending me this from Brian Crescimanno's blog - Why are you still deploying overnight?.

It's a good article; I can see alot of overlap between Brian Crescimanno's article and what Facebook is actually doing on a daily basis (see the Push Engineering video). Unfortunately, I'm going to bet that for most of us the article comes across as a white ivory tower/pie-in-the-sky wishes-were-fishes manifesto. It doesn't mean Brian doesn't hit alot of true notes, but pragmatically most of us are stuck doing midnight deployments to avoid customer impact. For me, midnight deployments have gotten really old, and as we expand outside North America our global market doesn't appreciate middle-of-the-day downtime for them. I want to change this.

A project manager where I work passed around the Push Engineering video to management, and the positive end-result that Facebook is reaping really resonated with them. But the problem with silver bullets is what makes it a werewolf killer for one company doesn't slow down the vampires at another company. One part in particular was FB's rich gatekeeper system of enabling features gradually to the customer base; that feature would be useful to us but it would be a huge effort to implement for a limited payback in our specific e-commerce niche. It may have resonated with our management, but it didn't really resonate with the engineers.

In my opinion, the hardest challenge that Facebook faced (no pun intended) and overcame is the divorce of database changes & code (backwards) compatibility. They have achieved one of the holy grails of release engineering by successfully decoupling the two; database changes and code changes are entirely independent of each other. It gives them a tremendous amount of flexibility and makes their top-to-bottom Push Engineering concept a reality.

So after watching the Push Engineering video, management has challenged us (with some funding) to do some of the same things. Our stated goal is get our software release cycle from 3-4 months down to 3-4 weeks. Our current process & technology has been an evolution for the last few years; we last made major changes to our develop-build-deploy process/platform/methodology about 2 years ago and we've been coasting on it since then, reaping the rewards and writing, testing and releasing software as fast as we can.

So while I don't think we're going to achieve database & code separation any time soon, I think we can pull from the Push Engineering concept some of the achievable bits that best apply to how we work and accomplish a few notable things. A handful of us leads sat down and sketched out what it would take (at a high level) to accomplish this.

Here's a few of "my" issues that I intend on addressing.

Our "core" book of sites and services comprise 50 linked builds in a very complex parent-child relationship with a handful of pseudo-circular references thrown in for fun. A full core rebuild on a build server cluster made from virtual machines takes at least 60 long minutes. Big changes or small changes, it takes 60 minutes to see them.
For those 60 minutes, the environment is intermittently unavailable, which testers just love. There is always something unavailable during the rebuild process; the environment may look like it's up but often the core build cycle is single-threaded on building a back-end process or service that is largely transparent to end-users, so just when testers think the environment is stable, whammo, they are kicked out. While testers have the ability to check the build status, it can be difficult to spot pending builds on a build controller hosting 250+ automated builds. I like to say tell them we're a CI shop - it just stands for Continuous Interruptions ;) They don't laugh :(
Sorta-falsely broken builds - builds are often "broken" because after an atomic commit when a child of a child needs a change from a parent of a parent build; because each build is on a separate timer to check for source control changes, and I can only define a single layer of parent and child relationships, often a grandchild build will try to build before a needed grandparent change and the build goes red.
We use tremendous number of post-build events at the project level to keep developers working locally; these events can be fragile and they copy unnecessary binaries around constantly. We need them because developers build out of Visual Studio, not an msbuild script, so none of the msbuild scripted events to push shared dll's into a common library folder occur. Why do we use library folders for these common build artifacts? Since the build server software isolates each build away from all other source to generate a clean build that will stand up to DoD-level regulations, you can't reference other projects outside your top level solution. You have to use static references and maintain a huge library of binary artifacts from your codebase.
Because we go 3-4 months between major releases, when we do a major release code & db drop into Production, we have days of intense firefighting, then a couple weeks of medium firefighting, then a few more weeks of putting our lingering brushfires and smoldering stumps. Then we gear up for our next major release ;)
Our releases also end up being so large, they take 4 hours to deploy and 8 more hours to validate.
Because we use all static references and no project references (outside the solution), I've seen developers keep eight visual studio's open simultaneously to make a change in one webservice; we don't have the funding to make hot-rod workstations, so we're burning expensive developer time having them wait on their machines. There's only so much performance you can get out of a $600 PC from Dell or HP even after you add another $400 of ram and a $50 video card.
We don't have unit tests running consistently due to the evolution of our build system. The developers have written more than 1000 unit tests, but the automated build server is only running 1/4 of them reliably.
Our test coverage tools are manual run only, they don't really fit in our current build structure.

So to sum it up: we have a long rebuild cycle and the environment is down virtually the entire time, Continuous Integration really means Continuous Interruptions, the build frequently breaks because of the actual build order, we are spending alot of our time copying around large binaries and storing them in source control, we release too infrequently and our major releases take 12 hours to deploy, developers have to run several visual studios and constantly have to trigger complex build orders to get changes to show up locally and their PC's aren't equipped to do any of this, only 1/4 of our unit tests get run outside the developer's workstations and test coverage reports aren't being generated.

There's more, but let's stop with these major ones. We have alot of challenges! My next post after this will be my plans for the next 3 months to resolve them.

Think 1 stone, 20 birds. Unfortunately, I have about 100 birds to take out. But that's only 5 stones if I throw them just right!

-Kelly

Software Production Engineering

Sunday, October 30, 2011

3AM Deployment? Why not?

Wednesday, October 12, 2011

Facebook "Push" system