To code or not to code

Harsh S Kulshrestha
3 min readJun 22, 2021

“I’m not able to access the database. We probably have to redo everything again.”

I got a text at 23:30 IST from a very close client of ours. I jumped out of my sleep and tried understanding what exactly was the reason.

One of his team members wasn’t able to connect to the database. Having setup the database inside a private subnet for added security, I knew it was not going to be straightforward for anyone to access the database. And that was the intent. Having discussed this with the client, we were all on the same page that it was the right thing to do.

Except when someone tried accessing it in the usual way and couldn’t. It just won’t connect. And it shouldn’t, as it’s not public. While this was an issue I was sure we could deal with, I received another text:

“I’m shutting down elastic beanstalk”

“The servers keep coming up even when I shut them down”. This wasn’t a shocker to me. The infra was setup to make sure the app runs all the time and was self healing to unexpected outages. I was sure this was not as big a deal as it seemed to be. But I was nervous. We had to go live the day after, I had to leave for another city in 5 hours and I only had those 5 hours to setup everything back to whatever it was originally. Why the original state you ask? Well because there wasn’t any change required to connect to the database in the first place.

If you are aware of IaC you probably know where I am getting at. It’s the thing that automates your infrastructure setup via code. I was quite relieved we had automated the infra setup from day 1. And it didn’t take us long either. Only a day’s effort I’d say. As a result, I was confident everything was quite easy to restore.

But there was a catch. The infra had been torn down manually. Our infra monitoring stack was not aware of this change and would expect the resources to still be there. What this means is that if I were to redeploy the infra again, there wouldn’t be any change since the observer (for example, the tfstate file in terraform) wasn’t informed of a change in the physical state of the infrastructure.

There was definitely a risk. But worth a try. We started with simulating a teardown via the infra code so that the current state could me mimicked in the state files. But our deployment got stuck. Probably because now that the state file was informed of a teardown, it would go ahead and attempt to teardown some resources. But those resources don’t exist anymore in the first place.

We had to manually abort the deployment and reinitiate another one. Except this time, instead of purely tearing down the existing resource, we requested for a rename of resources. This would first create new resources with the new name and then attempt to tear down the existing resources. Surprisingly, this worked like a charm. All we had to do now was to undo the rename and deploy the stack again. That way it would destroy the newly named resources and create fresh resources with the original name. Complicated it seems, but believe me, it isn’t.

Within 5 minutes we were able to restore the beanstalk, the database, VPCs, subnets, monitoring, connectivity, everything! Oh and about connecting to a privately hosted database, well we just needed a bastion host to act as a hop.

While our client was worried it would be hard to restore everything since a lot of random changes had been tried from the team at his end, he was surprised with the speed in which everything was restored gracefully. It didn’t need to be torn down in the first place. But mistakes do happen. It’s okay! And how easy was it to recover from such an accident was just worth the effort.

I did have doubts initially if investing in IaC is worth an effort at this early stage of the project. 2 weeks later that was the key thing that brought us back from scratch. So I guess, that one day of effort was worth it to save n days of restoring things back again.

--

--

Harsh S Kulshrestha

Developer, consultant, helping startups with their tech