Drivers Are Getting Sent to Africa

Ivan Chebykin

December 7, 2024

The title of this article is the name of a support ticket I've got once. But what happened?

Everyone in tech has that one story — the time they broke the prod, deleted something, messed things up. For me, it happened when I was a junior engineer at a company that was building something like an Uber app. I was working at a service that suggested drivers move to a place with better orders… aaand I’ve sent them to Africa.

In particular, I was refactoring a feature in our app, the component that helped drivers find high-demand areas where rides were needed. Our team was slowly moving away from an old monolithic structure to new microservices, and my job was to pull this feature out of the old code and get it running in its own service. It seemed straightforward, and I was excited to tackle something this important. What could go wrong, right?

Getting Started: Moving the Feature to Microservices

Our codebase was initially divided into two large sets of services by language - Python and C++. Don’t ask why we used C++, this is a story for another article, fortunately, C++ wasn’t the culprit of this bug. So, when the C++ service gets large enough, it takes a really long time to compile it, starting from 30 minutes on each new branch. It also didn’t help that we used templates a lot, service binary was taking multiple gigabytes!

Deploys were super slow. Because of this and because of us moving to a new deployment infra which was way faster we decided to split the feature into a new microservice.

So, the feature itself was relatively simple:

Get orders from the map relatively close to drivers. We used a similar index to Uber’s H3.
Filter and sort places with orders based on drivers vs orders ratio and some other factors.
Return the coordinates which will be sent to the driver in the app. Oh, and before sending coordinates make sure that they “snap” to the correct roadside.

We had a fancy service for snapping the coordinates, which I was supposed to integrate with.

Integrating the "snap" API

A bit more about integrations: Our services were communicating through REST API, and all our API was generated via OpenAPI. We had guidelines for how the API should behave, everything was smooth. We moved from the openapi-generator to our in-house version with the new infra, so the responses were slightly changed.

Going back to that snap service. As it turned out, it had a funny quirk, i.e. it didn’t follow the guidelines. If it couldn’t snap coordinates to a road, it didn’t just throw an error like most services would. Instead, it quietly returned zeroed-out coordinates – latitude and longitude both set to 0. (Null Island) For anyone keeping track, those coordinates just happen to be right off the coast of Africa or the so-called Null Island.

map

I missed this in my tests because I didn’t bother checking what happened when the service failed. They actually had a way to tell if the error happened in the response, but it was uncommon, and I relied on the HTTP error code. I just assumed it’d give us a clear error. But nope – there it was, quietly sending back a big zero.

“Drivers Are Getting Sent to Africa”

Once we started rolling this out, it was a smooth launch for the first few hours. Then, out of nowhere, we received a support ticket called: “Drivers Are Getting Sent to Africa”.

Every time you receive such tickets from support it feels like that scene from Chernobyl:

chernobyl

Hello, is this the fire department #2?
Yes.
What is on fire?

(See full).

Everything was on file, I was pretty shocked. My team lead and I dove into the logs, and it didn’t take long to spot the problem. Those zeroed coordinates were coming through as if they were actual locations, and our service was happily sending drivers off to the middle of the ocean. Thankfully, we were able to stop the rollout fast and switch off the feature flag before it caused too much damage.

What I Learned (the Hard Way)

Looking back, this whole mess taught me a few things that I’ll carry into every project from now on:

Always Test for Negative Edge Cases

It’s not enough to assume things will fail gracefully. In this case, testing for empty or default responses would save us from a very strange bug.

Validate Every Response

Even when something “looks” like a valid response, it doesn’t mean it’s usable. I learned to never trust data from an external service without running it through some checks.

Read the documentation

Don’t assume that the service will behave in “reasonable” way. First of all everyone has a different definition of reason, second it’s just better to be sure. In our case everything was written in the OpenAPI schema, to which I should’ve paid more attention.

Feature Flags/Gradual rollout are lifesavers

Without a feature flag controlling the rollout, this would’ve hit every single driver all at once. Gradual rollouts are key for catching issues before they become massive problems.

Looking Back

This experience taught me more than a hundred tutorials ever could. Breaking things is part of the learning process, especially when you’re new and working with complex systems. Now, I test everything a little more carefully, especially error handling. I’m a lot more aware of those “weird” edge cases I might’ve ignored before. I roll out stuff gradually and if the infra is not in place, I’d rather build it first. (if the product already has users of course)

So, yeah that’s how I accidentally sent drivers to Africa. I’ve learned from it, and that’s what counts.