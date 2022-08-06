Kamala Ramasubramanian

Computer Science PhD Candidate

Join us in-person: Engineering 2, Rm 398

Description: Distributed systems are ubiquitous but continue to be challenging to understand, build, and troubleshoot. Fundamentally, reasoning about distributed system behaviors is hard due to the effects of nondeterminism in system executions. Additionally, designers also need to reason about the consistent behaviors in the face of failures in the same class. For example, a system should not exhibit divergent behaviors when two different replicas fail. These problems are exacerbated by the dynamic nature and scale of production systems today. Tooling support has lagged behind the pace at which systems are being deployed, urgently requiring more research in this space.

Building and maintaining internet-scale systems today involves gathering and harnessing system observability to address cross-cutting problems. Prior work uses observability infrastructure to aggregate information from many executions, using metrics of logs from the system as inputs, to solve problems such as fault detection, localization and anomaly detection. Other work compares pairs of executions for interactive debugging, performance diagnostics, workload and capacity modeling. The former approach either disregards the casualty of event interactions within executions or attempts to infer them, producing sub-par results, while the latter is lacking since it only considers a single pair of executions but many, varied execution paths are observed.



Our key insight is that we need to aggregate information from many executions while preserving the causal relationships within individual executions to build models of domain knowledge and reason about systems. To do so, we use provenance graphs and distributed traces as observations of system executions since they capture the causality of event interactions within executions and normalize them to aggregate information across many executions. Over the last several years, distributed traces have seen increased adoption in industry and the use of provenance is a growing area of research.



In our work, we have developed and evaluated techniques for understanding and improving fault tolerance behavior, troubleshooting systems, and identifying behaviors that have applications in feature development and debugging performance issues. We explore how the problems that can be solved are constrained differently or change entirely depending on factors such as the granularity and format of system observations, timeline of expected response, how interactive (or not) techniques are expected to be, and the level of detail in the result produced.