Profile Write Latency and Errors

Incident Report for Chameleon

Postmortem

Chameleon Incident Report - Feb 10, 2025

Summary

‌Dashboard pages did not load or were slow, and end-user experiences did not show in a timely manner. This was caused by elevated traffic that resulted in high latency and errors primarily affecting database writes to the profiles collection in MongoDB. The issue cascaded into other collections and impacted the overall system health / performance.

Main Impacts

Dashboard availability was reduced leading to errors on pages loading or slowness in loading data
End-user Experiences failed to show, or were delayed, in cases where these depended on updated user attributes (because Chameleon was slow or not accepting updates to user profiles via the client side)

Root Causes

Unexpectedly high database “write” traffic exceeded the capacity of our MongoDB instance, leading to operations being queued beyond expected thresholds
Our utilization of database capacity was higher than best practices in certain metrics (which meant a smaller buffer before we exceeded capacity)
Write locks were held longer than expected, causing significant queuing.
The disk was overutilized due to concurrent read and write operations, further increasing response times.
Insufficient RAM allocation to MongoDB exacerbated performance bottlenecks.

Resolution

This is what we did to resolve the incident as fast as possible:

Upgraded MongoDB instance RAM and disk capacity.
Monitored database performance post-upgrade to ensure stability.
Lowered concurrency of background processing across certain operations that, if left to run in parallel, would cause a flood or contending write operations.
Identified operations that may not need to be run live (this will be incorporated into “Future Prevention” below)

Future prevention

This is what we are doing and intend to do, to prevent this type (similar or related) incident in future:

Improving our capacity to handle growing volume of database traffic

Evaluate further MongoDB scaling options to handle peak loads (short-term)
Improve our auto-scaling rules and capacity management, e.g. conduct additional load testing to anticipate future demands (short-term)
Improve database monitoring to detect and mitigate similar issues (e.g. we’ve already added “total write time” as a new metric) before they escalate; we found that our monitoring did not indicate exactly where the problem lay which delayed finding a resolution
Optimize “write” operations to reduce contention and queue buildup, especially those that are not needed to be live and can run in the background with batching, for example when updating company attributes data
Explore adding rate limiting to our Segment integration to prevent batch flooding of data
Accelerate our plans to upgrade our infrastructure to a more modern and state-of-the-art database system (ClickHouse) to supplement MongoDB in certain cases

Improving our processes and communication around issues

Partly due to the infrequent nature of incidents like this, we uncovered some material gaps in our approach in communicating these effectively, and accordingly will undertake significant improvements to this, including:

Clearer notices in-product, in our help center, and via email when an incident is occurring
Proactively suggesting to new customers (and all existing customers) to join the subscription list for incidents and security updates (we already have this available via our dashboard and status page but want to increase visibility of it)
Adding some expected timelines for regular updates to our Status page and completion of the post-mortem report (this document)
A proactive email when an incident is resolved (seems like “resolution” of an incident does not trigger an email update to subscribers from our Status tool, unlike for other incident categories) and also to send out the post-mortem report
Formalizing support communication (as we did in this incident) including a “status ticket” in Intercom where we attach all messages about the incident, and a note in our support bot
Clearer ownership of tasks (and accountability for these) during an incident response

Posted Feb 11, 2025 - 12:27 PST

Resolved

This incident was kept open for monitoring and has been resolved.

Posted Feb 10, 2025 - 16:00 PST

Investigating

We are currently investigating the cause of intermittent slowness and timeouts resulting in occasional Application errors

Posted Feb 10, 2025 - 14:00 PST

This incident affected: User profile API.