Over the past few days our Dial-A-Service phone number has been unavailable. This blog post takes a quick look at what happened, why it happened, how we fixed it, and how we’re stopping it from happening again.
We store all of our service audio, along with a list of all our services and the times they happen, in a place called Github. We then use a platform called Twilio to do the actual heavy lifting of connecting to the telephone network.
Github isn’t really designed for serving up static content, so we use an intermediary layer called a CDN(1) which helps lighten the load by temporarily storing versions of our service audio.
On the internet, most things (including our CDN) use something called a certificate to identify who they are and encrypt traffic. Unfortunately, due to a misconfiguration, the CDN which we were using was no longer presenting a valid certificate. When our code on Twilio asked for the latest list of services, and the CDN responded with an invalid certificate, everything behaved exactly as it was supposed to and refused to accept the content.
Why did this happen?
We don’t know. We’re not responsible for maintaining this CDN, but we do know from experience that this problem can be caused by any one of many things.
How did we fix it?
We swapped to an alternative CDN provider with a valid certificate. This was a very small configuration change, with no impact on how the service itself works.
How are we making sure it doesn’t happen in future?
We’re doing a couple of things to help make sure this doesn’t happen again, or at least should it happen it’s better mitigated:
Better error handling
At the moment when things go wrong our code tends to fail entirely, which causes a very generic and very American-sounding message to be played back on the phone line. We’re going to tweak this behaviour so that instead people will get a more polite message letting them know that we’re experiencing some technical difficulties.
Seeing if we can move things to our website
Our own website already contains all of the data and audio files that are needed to put together Dial-A-Service, but this is a relatively new change. We’ll think about ways we might be able to remove a whole third-party dependency, as well as simplify our administration.