A few days ago, AWS DMS tasks in our environment unexpectedly stopped working due to an issue with AWS Secrets Manager integration. I identified the issue the next day when engineers were complaining.
The tasks were configured to extract data from a database and load it into Amazon S3 using credentials stored in Secrets Manager. I tried restarting the task and it worked fine. At that moment, it was not clear whether the issue is with AWS or whether something has changed on the infrastructure side. There were no notifications in AWS Health Dashboard.
I wanted to start debugging myself but then I thought it is a great opportunity to start using AWS DevOps Agent that went GA recently. I opened DevOps Agent console in browser and started creating an Agent Space.
Since, it is the first time setup, I didn't have any IAM roles. So, I chose to create the IAM roles automatically.
And, that's pretty much it. The Agent got setup quickly and started to build a topology of the entire account. That was great, since, we didn't have that topology at the account level. Once the Agent Space got setup, you will get an Operator access through which you can start chatting with the Agent.
I opened the Operator access and it was still building the topology. However, the chat function was already available. Upon stating my issue, it suggested that the error is stale as the secret is accessible. It checked that using recent events from CloudTrail. It also provided recommended actions. This is a great start!
Then, I told the agent the issue is fixed but I want to find the root cause of the issue. It analysed but this time it provided random answers which was not correct. It suggested some cross account issue with account 925701010101. I needed further clarification about this account as it's not ours. Then, the agent confirmed it's an AWS internal service account that customers can access anyway. Therefore, this analysis was bizarre and incorrect.
It also confirmed that we have not made any changes to the infrastructure recently and therefore the issue is most likely something related to AWS. Finally, it confirmed some tasks failed at ~02:11 UTC which other set of tasks failed at ~04:03 UTC. It suggested 2 possible causes as per below
My expectation was AWS failures are communicated through Health Dashboard which was not there for this issue but at the same time we didn't change anything.
Finally, this morning, the alert came through in health Dashboard that at the exact times mentioned by DevOps agent, there were API issues between DMS and Secret Manager.
Wow! That's amazing and I'm impressed. DevOps Agent can identify the issues and troubleshoot in the correct direction even before AWS notifies the issues publicly. In this scenario, I'm certain AWS Support would have been much slower compared to DevOps Agent which provided me the answers within a few minutes.
Another thing around the topology, the topology diagram that DevOps Agent created is quite impressive. I can't share as it is in my work account, however, it found all the different applications and connected related applications. It even identified any external vendors that are connected. That's pretty specially because this will get updated automatically.
In the next article, we’ll look at additional ways the DevOps Agent can help troubleshoot and automate operational tasks.