Oops: Dial-A-Service outage

Over the past few days our Dial-A-Service phone number has been unavailable. This blog post takes a quick look at what happened, why it happened, how we fixed it, and how we’re stopping it from happening again.

What happened?

We store all of our service audio, along with a list of all our services and the times they happen, in a place called Github. We then use a platform called Twilio to do the actual heavy lifting of connecting to the telephone network.

Github isn’t really designed for serving up static content, so we use an intermediary layer called a CDN(1) which helps lighten the load by temporarily storing versions of our service audio.

On the internet, most things (including our CDN) use something called a certificate to identify who they are and encrypt traffic. Unfortunately, due to a misconfiguration, the CDN which we were using was no longer presenting a valid certificate. When our code on Twilio asked for the latest list of services, and the CDN responded with an invalid certificate, everything behaved exactly as it was supposed to and refused to accept the content.

Why did this happen?

We don’t know. We’re not responsible for maintaining this CDN, but we do know from experience that this problem can be caused by any one of many things.

How did we fix it?

We swapped to an alternative CDN provider with a valid certificate. This was a very small configuration change, with no impact on how the service itself works.

How are we making sure it doesn’t happen in future?

We’re doing a couple of things to help make sure this doesn’t happen again, or at least should it happen it’s better mitigated:

Better error handling

At the moment when things go wrong our code tends to fail entirely, which causes a very generic and very American-sounding message to be played back on the phone line. We’re going to tweak this behaviour so that instead people will get a more polite message letting them know that we’re experiencing some technical difficulties.

Seeing if we can move things to our website

Our own website already contains all of the data and audio files that are needed to put together Dial-A-Service, but this is a relatively new change. We’ll think about ways we might be able to remove a whole third-party dependency, as well as simplify our administration.

References

Oops: Excessive buffering during the service

In this morning’s service a few people noticed that their connection was having to buffer more than usual, and that the quality of the video and audio kept fluctuating. This blog post takes a quick look at what happened, why it happened, and what we’re doing to fix it.

What happened?

The internet connection to the church wasn’t capable of sustaining the speeds needed for smooth streaming of video. This meant that people watching would see the video stutter or sometimes pause entirely.

Why did this happen?

We’re not sure. The path to get an internet connection into the church is a relatively complex one compared to what you might have at home, involving several points where things might slow down. We weren’t able to quickly identify what was wrong during the service.

What are we doing to fix it?

There are two problems we’re fixing:

Making sure we can stream services

We’re reducing the threshold at which we decide to use a backup mobile connection to stream services. It’s difficult to change our connection mid-service (although we can in an emergency), so for the time being we’ll be using our backup connection unless we’re absolutely certain our main one is behaving as expected.

Making sure the internet connection to the church is stable

We’ve put some extra monitoring in place to see if we can narrow down which bit of the chain is at fault and then investigate further, but since the problem doesn’t appear all the time it might take us a few weeks before we’re able to properly identify it.

Depending on where the problem lies the solution could be as simple as a quick configuration change, might need replacement hardware for our network, or might need us to involve our connection provider.

Oops: Audio issues during service

In this morning’s service, the sound quality at the beginning of the service wasn’t up to our usual standard. This blog post takes a quick look at what happened, why it happened, why we didn’t catch it sooner, what we did to fix it, and how we’re making sure it doesn’t happen again.

What happened?

Just after the service started, some of our viewers reported that the sound was “choppy” or “bumpy”. Thanks to those who let us know – as you’ll read later without this feedback we wouldn’t have known there was a problem.

This problem only affected the service when we were using our wide-angle camera; when we switched to the lectern camera the problem disappeared.

Continue reading “Oops: Audio issues during service”