Resolved -
We have experienced an instability in Kubernetes CoreDNS that caused our internal name resolution to fail, and as a consequence, our alerting pipeline was impacted, since alerts rely on webhooks that could not be delivered. As an improvement action, we will deploy CoreDNS as a DaemonSet to ensure resilience and consistent availability across all nodes. We will also add an internal DNS probe within the cluster to monitor DNS-related failures and provide early detection even when CoreDNS is degraded.
Dec 8, 12:27 GMT-03:00
Update -
Communication between our services and the workload manager has been restored, and the environment is stable. We continue to monitor the situation and will update this page soon with more information about the incident.
Dec 8, 12:05 GMT-03:00
Monitoring -
A fix has been implemented and we are monitoring the results.
Dec 8, 11:59 GMT-03:00
Update -
We have identified a communication failure between some of our servers and our workload manager, which affected the application distribution process.
Dec 8, 11:38 GMT-03:00
Identified -
The issue has been identified and a fix is being implemented.
Dec 8, 11:22 GMT-03:00
Investigating -
We are currently investigating this issue.
Dec 8, 11:07 GMT-03:00