Temporary Service Interruption

Updates

On 16 October, 2025, between 06:31 – 07:08 UTC, we experienced a system-wide performance degradation across multiple applications due to a regressed database performance on our primary production instance. The issue resulted in sustained CPU saturation, increased blocking and waits, and a 37-minute period of elevated response times and partial unavailability.

During this period, users may have experienced difficulty accessing BSS & Storefront Portal, issues during navigation or execution of actions. Service was fully restored at 07:08 UTC, and since then all systems are operating normally.

Impact

Users were not able to login to BSS application.
Some users experienced slower than usual performance.
A portion of requests failed to load during the incident window.
The issue was resolved without further actions by our staff within 37 minutes

Root Cause Analysis
A high-frequency transactional query experienced an execution plan change following recent statistics updates. Instead of using an indexed seek strategy, the optimizer generated a plan with full table scans on a large dataset, causing the following cascade effects:

CPU utilization spiked to >95% on the Database instance
Locking and latch contention across several dependent queries
Connection pool exhaustion at the application layer
Latency propagation to upstream services and APIs

The suboptimal plan was persisted in the plan cache, amplifying the impact across sessions until manual intervention.

How We Fixed It
Our team was properly alerted by our monitoring systems for increased latency and dropped throughput, engaged and quickly identified the issue on database server.

SQL telemetry indicated CPU saturation and high wait times. We have quickly identified a core transactional query that present regressed execution plan. We’ve flushed cached execution plan and the query plan forced to a known good baseline.

After that, CPU utilization on database server began to normalize, application services fully recovered and response times were stabilized.

What We’re Doing Next
To prevent this from happening again, we are are going to:

add more alerts for database server CPU spikes
improve our monitoring systems for execution plan cache anomalies (plan cost deltas >30%)
revise our scheduled index maintenance and statistics updates automations to avoid unexpected plan re-compilations.

Current Status
All systems are operating normally. No further problems have been detected since resolution.

October 17, 2025 · 14:13 EEST

← Back