I’ll start by saying that this is an honest question, and you will not find an answer in this post.
The question was posed to me by the manager of our Tier II support team a few months ago. We had just had a short-lived production outage caused by us rolling over all our traffic to a new deployment. We didn’t have canary releases working yet, and our alternative was to set a 1s TTL on the DNS for this service so we could switch back and forth from new to old quick if the new one wasn’t working.
In our post-incident review, I called out that moving towards Continuous Delivery (CD) was a goal of the engineering department, and while we weren’t going to be trying to cause outages, when we got to canary deployments, we would kinda-sorta purposefully be putting things into production that might be broken, that had certainly been tested, but could only be tested faithfully in a production setting.
Canary deploys and testing-in-prod are sexy things in the DevOps/SRE space these days, but they operate on the assumption that sometimes you’re gonna break prod, in a minor, reversible way. That means you’re gonna generate support tickets sometimes…on purpose.
That manager, who is a very cool dude, took all this in stride, and went and spent a couple days researching CD. He came back and said, “Dave, I found lots of articles on CD from an engineering perspective, but none of them talked about how you support CD. And none of them were written by a support team.”
One of the best things that has happened to our team over the last year has been better teamwork and communication with our support team. The building of that relationship was started by a Tier II support team member, and I have thanked her for it many times. I have started thinking from a support perspective before I make any change that affects production, and what effect the change will have on customers. And yeah…CD is great for engineers…it’s great for customers in the long run…but if you divert 1-5% of traffic to a new version of a service that might be broken, you’re probably gonna give someone a bad experience, and you’re probably going to get some tickets, sometimes.
Since all these articles are written from engineering’s perspective, the perspective is of course that the harm is worth the advantages. But has that been negotiated with support? Has support been considered at all? If they have, has the consideration been simply “yep, this will affect support!"? Or was there buy-in from support teams on adopting CD/canary/testing-in-prod/etc?
We’re not to CD yet with most of our services, but as we get closer I believe the way we will negotiate this will be as a team. We already have a preliminary window set to allow deployments from pushes. It’s not a “don’t deploy on Fridays” sort of schedule, or a “we release at 1:00pm each day” sort of schedule, but it is during a window when support has the most coverage so they can respond to potential tickets inside of their goal response time.
We’re also considering:
- an internal status page that is automatically updated (our public one is carefully managed)
- deployment notifications going directly to a support-oriented channel
- education around how to access and read our internal observability dashboards
I don’t have the answer of how to support CD. But we’re going to need to find one in the next year or so, and I know the way we’ll do it is through communication and collaboration. That’s the only way important things get done.
P.S. and if I find an answer, I’ll post it!