Chameleon Incident Report - Feb 10, 2025
Summary
Dashboard pages did not load or were slow, and end-user experiences did not show in a timely manner. This was caused by elevated traffic that resulted in high latency and errors primarily affecting database writes to the profiles
collection in MongoDB. The issue cascaded into other collections and impacted the overall system health / performance.
Main Impacts
- Dashboard availability was reduced leading to errors on pages loading or slowness in loading data
- End-user Experiences failed to show, or were delayed, in cases where these depended on updated user attributes (because Chameleon was slow or not accepting updates to user profiles via the client side)
Root Causes
- Unexpectedly high database “write” traffic exceeded the capacity of our MongoDB instance, leading to operations being queued beyond expected thresholds
- Our utilization of database capacity was higher than best practices in certain metrics (which meant a smaller buffer before we exceeded capacity)
- Write locks were held longer than expected, causing significant queuing.
- The disk was overutilized due to concurrent read and write operations, further increasing response times.
- Insufficient RAM allocation to MongoDB exacerbated performance bottlenecks.
Resolution
This is what we did to resolve the incident as fast as possible:
- Upgraded MongoDB instance RAM and disk capacity.
- Monitored database performance post-upgrade to ensure stability.
- Lowered concurrency of background processing across certain operations that, if left to run in parallel, would cause a flood or contending write operations.
- Identified operations that may not need to be run live (this will be incorporated into “Future Prevention” below)
Future prevention
This is what we are doing and intend to do, to prevent this type (similar or related) incident in future:
Improving our capacity to handle growing volume of database traffic
- Evaluate further MongoDB scaling options to handle peak loads (short-term)
- Improve our auto-scaling rules and capacity management, e.g. conduct additional load testing to anticipate future demands (short-term)
- Improve database monitoring to detect and mitigate similar issues (e.g. we’ve already added “total write time” as a new metric) before they escalate; we found that our monitoring did not indicate exactly where the problem lay which delayed finding a resolution
- Optimize “write” operations to reduce contention and queue buildup, especially those that are not needed to be live and can run in the background with batching, for example when updating company attributes data
- Explore adding rate limiting to our Segment integration to prevent batch flooding of data
- Accelerate our plans to upgrade our infrastructure to a more modern and state-of-the-art database system (ClickHouse) to supplement MongoDB in certain cases
Improving our processes and communication around issues
Partly due to the infrequent nature of incidents like this, we uncovered some material gaps in our approach in communicating these effectively, and accordingly will undertake significant improvements to this, including:
- Clearer notices in-product, in our help center, and via email when an incident is occurring
- Proactively suggesting to new customers (and all existing customers) to join the subscription list for incidents and security updates (we already have this available via our dashboard and status page but want to increase visibility of it)
- Adding some expected timelines for regular updates to our Status page and completion of the post-mortem report (this document)
- A proactive email when an incident is resolved (seems like “resolution” of an incident does not trigger an email update to subscribers from our Status tool, unlike for other incident categories) and also to send out the post-mortem report
- Formalizing support communication (as we did in this incident) including a “status ticket” in Intercom where we attach all messages about the incident, and a note in our support bot
- Clearer ownership of tasks (and accountability for these) during an incident response