Trustworthy Online Controlled Experiments - Metrics

Table of contents

  1. Introduction

  2. Metrics taxonomy

  3. Metrics formulation

  4. Metrics evaluation

  5. Metrics evolvement

Introduction

Randomized online controlled experiments have been considered as the gold standard for estabilishing causality, which is the key to bridge a new feature we’d like to launch and its business impact, especially for the customer-facing websites and applications in the Internet world. Good metrics are the key to track progress toward the final goals and data-driven decisions.

Metrics taxonomy

Goal metrics

Goal metrics are also called as success metrics, or true north metrics, and it is what the organization ultimately cares about. They are usually a single or a small set of metrics that best captures the ultimate success of the project. They may not be easy to move in short term as each initiative may have only a very small impact on the metric, or impacts take a long time to materialize.

Driver metrics

Driver metrics are also called sign post metrics, surrogate metrics, or indirect or predictive metrics. Driver metrics should reflect a mental causal model of what it takes for the organization to succeed. Useful metrics framework in helping to find what drives success:

  • H - happiness

  • E - engagement

  • A - adoption

  • R - retention

  • T - task success

Driver metrics tend to be shorter-term, faster-moving, and more-sensitive metrics than the above goal metrics.

Guardrail metrics

Guardrail metrics guard against violated assumptions and come in two types:

  • Metrics that protect the business

  • Metrics that assess the trustworthiness and internal validity of experiment results

In summary, the goal, driver, and guardrail metrics offer the right amount of granularity and comprehensiveness. There are also other types of metrics:

  • Asset vs. engagement metrics such as total number of Facebook users and user pageviews

  • Business vs. operational metrics such as daily active users and queries per second

  • Data quality metrics such as data quantiles

  • Diagnosis or debug metrics

Metrics formulation

Key principles when devloping goal and driver metrics are:

  1. Ensure goal metrics
  • Simple

  • Stable

  1. Ensure driver metrics
  • Aligned with the goal

  • Actionable and relevant

  • Sensitive

  • Resistant to gaming

Metrics evaluation

Most metrics evaluation and validation happen during the formulation phase, but there is work that needs to be done over time continously to determine if they are satisfied with the above principles.

The most challenging part of evaluation is to establish the causal relationship between driver metrics and organizational goal metrics. Some high-level approaches to tackle causal validation include:

  • Use other data sources such as surveys, focus groups, or user experience research if available

  • Analyze observational data: even it is difficult to validate a causal relationship with observational data, it could be helpful to help invalidate hypotheses.

  • Check whether similar validation is done at other companies

  • Conduct an experiment with a primary goal of evaluating metrics

  • Use a corpus of historical experiments as “golden” samples for evaluating new metrics

Metrics evolvement

Metric definitions evolve over time with various reasons such as:

  • The business evolved

  • The environment evolved

  • Your understanding of the metrics evolved

For the relationship between correlatoin and causation: 1. Correlation doesn’t imply causation and it can fool the metrics determination process.

  1. Relationships observed in historical data cannot be considered structual, or causal. Policy decisions can alter the structure of economic models and the correlations that held historically will no longer hold.

  2. Finding correlations in historical data does not imply that you can pick a point on a correlational curve by modifying one of the variables and expecting the other to change unless the relationship is proved to be causal.

All the above makes picking proper metrics a challenge.