Azure US Region - BSS - Random failures

Major incident BSS Sites Azure US region
2025-09-30 22:22 EEST · 23 minutes

Updates

Retroactive

On 30 September, 2025, between 19:22 – 19:44 UTC, one of our application servers experienced network connectivity failures that resulted in approximately 22 minutes of random failures for BSS Portal application. While our load balancer is designed to automatically reroute traffic to healthy servers, in this case the switchover process did not complete in time, leading to a brief service interruption. During this period, users may have experienced difficulty accessing BSS Portal, issues during navigation or execution of actions. Service was fully restored at 19:44 UTC, and since then all systems are operating normally.

Impact

  • Users were not able to login to BSS application.
  • Some users experienced slower than usual performance.
  • A portion of requests failed to load during the incident window.
  • The issue was resolved without further actions by our staff within 22 minutes

How We Fixed It
Our team was properly alerted by our monitoring systems, engaged and quickly identified the problematic server. During our investigation on the server, the random networks connectivity failures did not occurred but we found logs that revealed the underlying cause of the failures. Investigation continued to critical backbone services availability (Database Servers, Queuing), but nothing found. The network connectivity failures just disappear after 22 minutes resolving the failures users experienced.

What We’re Doing Next
To prevent this from happening again, we are:

  • Optimizing our internal processes for removing quicker problematic servers from work load during similar incidents.
  • Optimizing load balancer health probes to include more internal health checks.

Current Status
All systems are operating normally. No further problems have been detected since resolution.

October 2, 2025 · 18:32 EEST

← Back