Probability and Statistics (DRAFT)

Probability Foundations

A random variable is a variable with an unknown value, but known possible values. An event is an observed value of a random variable. An event space contains all possible values of the random variable. Probabilities are values assigned to each event in an event space which represent how likely each event is relative to all other events in the event space.

I’ll use \(p(e)\) to represent the probability of an event. Some special probabilities include the probability of any event in the event space, which is 1. And the probability of any event other than \(e\), also called the compliment of \(e\) which I’ll denote with \(\neg e\). \(p(\neg e)\) is defined as;

\[ p(\neg e) = 1 - p(e) \]

That is, the probability of any event other than \(e\) (\(p(\neg e)\)) is what’s left of the probability of any event (\(1\)) after removing the probability of event \(e\) (\(p(e)\)).

Probability Rules

A joint event refers to two or more events occurring together. They are colloquially talked about as one event ‘and’ another event occurring together. More formally, joint events are called intersections (represented with the \(\cap\) symbol) between events. For events \(e_1\) and \(e_2\), the probability of their joint event will be represented with \(p(e_1 \cap e_2)\). The probability of an intersection of events is the same regardless of which event is considered first;

\[ p(e_1 \cap e_2) = p(e_2 \cap e_1) \]

For two events, \(e_1\) and \(e_2\), their intersection is found by; \[ p(e_1 \& e_2) = p(e_1\cap e_2) = p(e_1|e_2)p(e_2) \] where \(p(e_1|e_2)\) is a conditional probability, discussed in the next section.

Conditional events refer to one event, \(e_1\), after another event, \(e_2\), is known. \(p(e_1|e_2)\) represents the probability of \(e_1\) given, or after knowing, \(e_2\). These can be defined by rearranging the multiplication rule as follows; \[ p(e_1|e_2)p(e_2)=p(e_1\cap e_2) \Rightarrow p(e_1|e_2)= \frac{p(e_1\cap e_2)}{p(e_2)} \]

Noting that, by the reflexively of joint events and the definition of the multiplication rule, \(p(e_1 \cap e_2) = p(e_2 \cap e_1)= p(e_2 | e_1) p(e_1)\), and so the above equation becomes; \[ p(e_1|e_2)=\frac{p(e_2 | e_1) p(e_1)}{p(e_2)} \] This equation is known as Bayes’ Theorem or Bayes’ Rule.

A union (represented with the \(\cup\) symbol) of events refers to at least one of multiple events occurring. For events, \(e_1\) and \(e_2\), their union would include the probability that \(e_1\) occurs, the probability that or \(e_2\) occurs, and the probability that both \(e_1\) and \(e_2\) occur. Colloquially this is talked about the probability of \(e_1\) ‘or’ \(e_2\). Formally this is expressed and computed as; \[ p(e_1\text{ or } e_2) = p(e_1 \cup e_2) = p(e_1) + p(e_2) - p(e_1 \cap e_2) \]

Where \(p(e_1 \cap e_2)\) is a joint probability.

Independence is a common assumption in many statistical techniques. Statisticians assume independence primarily because it simplifies the computation of certain probabilities. Events are said to be independent if knowing one event does not change the probability of the other event. Formally, if events \(e_1\) and \(e_2\) are independent, this would mean that; \[ \begin{aligned} p(e_1|e_2) &= p(e_1) \\ p(e_2|e_1) &= p(e_2) \end{aligned} \]

If \(e_1\) and \(e_2\) are independent, then their joint probability simplifies as so; \[ p(e_1 \cap e_2) = p(e_1|e_2)p(e_2) = p(e_1)p(e_2) \] Where the last equality is only true if \(e_1\) and \(e_2\) are independent (i.e. \(p(e_1|e_2)=p(e_1)\)).

Then the probability of their union simplifies to; \[ \begin{aligned} p(e_1 \cup e_2) &= p(e_1) + p(e_2) - p(e_1|e_2)p(e_2) \\ &= p(e_1) + p(e_2) - p(e_1)p(e_2) \end{aligned} \] Where again, the last part of this equality is only true if \(e_1\) and \(e_2\) are independent.

Events \(e_1\) and \(e_2\) are said to be mutually exclusive if the occurrence of either event precludes the occurrence of the other event. Formally mutual exclusivity means that; \[ p(e_1 | e_2) = 0 \text{ and } p(e_2 | e_1) = 0 \]

If two events are mutually exclusive, then the probability of their joint event is; \[ \begin{aligned} p(e_1 \cap e_2) &= p(e_1|e_2)p(e_2) \\ &= 0*p(e_2) \\ &= 0 \end{aligned} \]

And the probability of their union simplifies to; \[ \begin{aligned} p(e_1 \cup e_2) &= p(e_1) + p(e_2) - p(e_1 \cap e_2) \\ &= p(e_1) + p(e_2) - 0 \\ &=p(e_1) + p(e_2) \end{aligned} \]

Probability Functions

Here I’ll describe the properties and use cases for probability functions and cumulative probability functions. Conventionally, use \(f(x)\) is used represent a probability function, and \(F(x)\) to represent a cumulative probability function. \(f(x)\) is commonly called a probability mass function (pmf) if \(x\) is discrete, and a probability density function (pdf) if \(x\) is continuous. \(f(x)\) and \(F(x)\) are related through integration:

\[ \begin{aligned} f(x) &= \frac{d}{dx}F(x) \\ F(x) &= \int_{-\infty}^{\infty} f(x) d_{x} \end{aligned} \]

For discrete variables, the probability function represents the probability that \(X\) takes a specific value \(x\): \(f(x) = p(X=x)\). For continuous variables, the probability that the variable \(X\) takes any particular value is technically \(0\); \(p(X=x)=0\). However, \(p(x)\) is still related to probability functions \(f(x)\) through areas. For an arbitrarily small \(\epsilon\) the probability that \(X\) takes a value between \(x-\epsilon\) and \(x+\epsilon\) is the area of the distribution over that range;

\[ \begin{aligned} p( x - \epsilon< x < x+\epsilon) &= F(x+\epsilon) - F(x-\epsilon) \\ &= \int_{-\infty}^{(x-+\epsilon)}f(x)d_{x} - \int_{-\infty}^{(x-\epsilon)}f(x)d_{x} \end{aligned} \]

Transformations of Random Variables

Given \(x\sim F_X(x)\) and \(y=g(x)\)

\[ \begin{aligned} F_Y(y) = \begin{cases} F_X(g^{-1}(y)), & \text{ if g is increasing wrt x}\\ 1-F_X(g^{-1}(y)), & \text{ if g is decreasing wrt x} \end{cases} \end{aligned} \]

By the chain rule, the pdf is;

\[ \begin{aligned} f_Y(y)=\frac{d}{dy}F_y(y) = f_X(g^{-1}(y))\left| \frac{d}{dy} g^{-1}(y) \right| \end{aligned} \]

A special case of this is the probability integral transformation;

\[ \begin{aligned} x&\sim F_X(x) \\ Y&=F_X(x) \\ P(Y\leq y) &= P(F_X(X)\leq y) \\ &= P(F^{-1}_X(F_X^{-1}(X))\leq F_X^{-1}(y)) \\ &= P(X \leq F_X^{-1}(y)) \\ &= F_X(F_X^{-1}(y)) \\ &= y \end{aligned} \]

Which shows the relationship between a uniform random variable and the CDF of any other random variable.

Expectation

For any functions of a random variable, \(g(x)\), the expectation of that function is;

\[ E_{x}(g(x))=\int_{-\infty}^{\infty}g(x)f(x)d_{x} \]

For discrete random variables, \(E(x)=\sum_xxp(x)\). For continuous random variables, \(f(x)dx\) is conceptually \(p(x)\) since it is the area under an infinately small segment of the probability function if \(f(x)\) represents the height and \(dx\) represents an infinately small width.

Expectation is a linear operator, so for constants \(a\), \(b\), and \(c\) and random variables \(x\) and \(y\); \(E(ax+by+c) = aE(x)+bE(y)+c\).

Expectation for joint random variables

\[ E_{x,y}(g(x,y))=\int_{s=x}\int_{t=y}g(s,t)f(s,t)d_{s}d_{t} \]

Marginalizing a distribution from a joint probability function (\(f(x,y)\)) or a conditional probability function (\(f(x|y)\)): \[ \begin{aligned} f_x(x)&=\int_yf_{x,y}(x,y)d_{y}\\ &=\int_y f_{x|y}(x|y)f_y(y)d_{y}\\ &=E_y(f(x|y)) \end{aligned} \]

Law of total expectation, and rough proof

\[ \begin{aligned} E_{x}(x)&=E_{y}(E_{x|y}(x|y))\\ &=\int_y\int_{x}xf_{x|y}(x|y)d_{x}f(y)d_{y} \\ &=\int_{x}x\int_{y}f_{x|y}(x|y)f(y)d_{y}d_{x} \\ &=\int_{x}xf(x)d_{x}\\ &=E_{x}(x) \end{aligned} \]

The last steps reflect that the internal integral is just marginalizing across y; \(f_x(x)=\int_y f_{x|y}(x|y)f_y(y)d_{y}\).

Variance

Covariance and variance

\[ \begin{aligned} Cov(X,Y)&=E_{x,y}((X-E_{x}(X))(Y-E_{y}(Y)))\\ Var(X)&=Cov(X,X)=E_{x}((X-E_{x}(X))^2) \\ &= E(X^2)-E(X)^2 \end{aligned} \]

The Bias-Variance tradeoff decomposes the theoretical variance obtained when estimating a function \(g(x)\) with a model \(\hat{g}(x)\) into three components that can be used to motivated changes to the model. Specifically, the Bias-Variance tradeoff is a decomposition of the expected prediction error; \(g(x)-\hat{g}(x)\). To see it’s relation to variance, first rearrange the variance equation above:

\[ \begin{aligned} Var(X)= E(X^2)-E(X)^2 \\ \Rightarrow E(X^2) = Var(X)+E(X)^2 \\ \end{aligned} \]

Then define the random variable \(X\) to be the errors of the model;

\[ \begin{aligned} E((g(x)-\hat{g}(x))^2) &= Var((g(x)-\hat{g}(x))^2)+E((g(x)-\hat{g}(x)))^2 \\ &= (g(x)-E(\hat{g}(x)))^2 + E((E(\hat{g}(x))-\hat{g}(x))^2) + Var(g(x))\\ &= Bias(\hat{g}(x))^2 + Var(\hat{g}(x)) + Var(g(x)) \end{aligned} \]

Here bias represents the squared error between the true function \(g(x)\) and the expected model \(E(\hat{g}(x))\). Variance represents the squared error between the expected model and the obtained model. And \(Var(g(x))\) represents the true noise or irreducible error in the process \(g(x)\).

Variance of linear combinations of random variables:

\[ Var(aX+bY+c)=a^2Var(X)+b^2Var(Y)+2abCov(X,Y) \]

Variance is also crucial for designing experiments, for example if \(a\) or \(b\) have different signs and \(Cov(X,Y)>0\), then the variance between is reduced because the covariance is subtracted off. This is a justification for within-participant designs, where \(X\) represents the first measurement, \(Y\) represents a future measurement, \(a=-1\) and \(b=1\) to reflect a difference comparing the second and first measurements (\(Y-X\)), and since the measurements come from the same person here, it is reasonable to assume they are correlated (\(Cov(X,Y)>0\)). In this case, \(Var(X,Y)=Var(X)+Var(Y)-Cov(X,Y)\), which is not larger than the variance that would be obtained if the measurements were from independent samples.

Law of total variance: The variance of a variable \(X\) can be decomposed into the variation in the variable that remains when \(Y\) is known or irreducible error (\(E(Var(X|Y))\)), and the variation that carries through from the uncertainty in the model estimation \(Y\) (\(Var(E(X|Y)\)). \[ \begin{aligned} Var(X)&=E(Var(X|Y))+Var(E(X|Y)) \end{aligned} \]

This arises in hierarchical models and prediction intervals.

\[ \begin{aligned} x &\sim N(\mu,\sigma)\\ E(x) &=\mu\\ Var(x) &=\sigma^2\\ \hat{\mu}_n &\sim N\left(\mu,\frac{\sigma}{\sqrt{n}}\right)\\ E(\hat{\mu}_n) &=\mu\\ Var(\hat{\mu}_n)&=\frac{\sigma^2}{n}\\ E(x_{n+1}) &= E(E(x_{n+1}|\hat{\mu}_n))= E(\hat{\mu}_n)=\mu \\ Var(x_{n+1}) &=E(Var(x_{n+1})) + Var(E(x_{n+1}|\mu_{n})\\ &=\sigma^2+\frac{\sigma^2}{n}\\ (100-\alpha)PI: & E(x_{n+1}|\hat{\mu}_n) \pm T_\alpha \sqrt{Var(x_{n+1})}\\ \end{aligned} \]

Information Theory

Information or Surprise is a measure of how unexpected an event was: \[ I(x)=-log(f(x))=log\left(\frac{1}{f(x)} \right) \]

Entropy is the expected surprise – higher entropy relates to more uncertainty: \[ H(X)=E(I(x)) = E(-log(f(x))) = \int_X f(x)I(x)dx = -\int_X f(x)log(f(x))d_x \]

K-L Divergance, or relative entropy, is the expected distance between distributions with respect to one of the distributions: \[ \begin{aligned} D_{KL}(f_{x}(x);f_{y}(x)) &= \int_{X} f_{x}(x)log\left(\frac{f_{y}(x)}{f_{x}(x)} \right)d_x\\ &= \int_X f_{x}(x)(log(f_{y}(x))-log(f_{x}(x))d_x \\ \end{aligned} \]

Divergance is an important concept in machine learning, because it can be a loss function when we are estimating a distribution \(f(x)\) with a model \(\hat{f}(x)\). From a Bayesian perspective, Divergance between a prior and posterior measures the information gained by observing the data. Bayesian experimental design focuses on collecting data that maximizes the divergance between the posterior and prior, i.e. the most informative data.

Mutual Information is the divergance from the joint distribution \(f_{x,y}(x,y)\) to the joint when independence is assumed \(f_x(x)f_y(y)\) – i.e. a measure of non-independence; \[ I(x;y)=\int_{y} \int_{x} f_{x,y}(x,y)log\left(\frac{f(x,y)}{f(x)f(y)} \right)d_yd_x \]

Causal Graphs

  • Expressions of dependence between random variables
  • Add some info from Judeal Pearl papers relevant to experimentation and inference?

Causal Inference

The ultimate goal of most statistical work is to discover the causes of some outcome so that the outcome can be controlled. Hence, it is important to understand when causality can be inferred.

From this perspective, there is a treatment assignment \(T\) and an outcome of interest that depends on the treatment \(Y(T)\). Ideally, we would be able to compute the expected treatment effeoct for any individual \(E(Y_i(T=1)-Y_i(T=0))\), but it is impossible to assign an individual to both variants of the treatment under the exact same conditions – e.g. one must be done at a later time than the other. Causal inference approaches specify the assumptions, models, and designs neeed to make causal statements.

Most methods like t-tests and ANOVAs compare the observed outcomes in the treatment group to the observed outcomes in the control group. These methods are based on correlations but can be interpreted causally if additional assumptions are made.

Framework

The general setting I’ll refer to a causal graph where individuals \(i\) are assigned to a particular condition indicated with \(Z_i\), they experience a condition indicated by \(W_i\), and an observed outcome for a level of the assigned and experienced condition \(Y_i(Z_i,W_i)\).

The causal craph is;

\[ Z_i \rightarrow W_i \rightarrow Y_i(Z_i,W_i) \]

Where the assignment and experience of a treatment or control condition are represented with;

\[ \begin{aligned} Z_i &= \begin{cases} 1, & \text{Assigned to treatment} \\ 0 & \text{Assigned to control} \end{cases} \\ W_i &= \begin{cases} 1, & \text{Experienced treatment} \\ 0 & \text{Experienced control} \end{cases} \end{aligned} \]

And there are potential outcomes \(Y\) under any combination of assigned and experienced conditions;

\[ \begin{aligned} Y_i(Z_i,W_i) &= \begin{cases} Y_i(0,0), & \text{Outcome when assigned C and experienced C}\\ Y_i(0,1), & \text{Outcome when assigned C and experienced T} \\ Y_i(1,0), & \text{Outcome when assigned T and experienced C} \\ Y_i(1,1) & \text{Outcome when assigned T and experienced T} \end{cases} \end{aligned} \]

In most lab studies this setup simplifies because \(Z_i=W_i\) since there is no possibility that a lab participant does not experience the condition they are assigned to. However, in online experimentation users can often opt out of a treatment condition, and users assigned to a control condition could end up getting access to features of the treatment condition. Methods to address these complications are discussed below.

Key Assumptions

  • Exclusion: \(Y(0,W_i)=Y(1,W_i)=Y(W_i)\), or probabilistically \(p(Y_i(Z_i,W_i)|Z_i,W_i)=p(Y_i(Z_i,W_i)|W_i)\). The treatment assignment \(Z_i\) influences the outcome only through the experienced condition \(W_i\), i.e. conditional on the experienced condition, the observed outcome \(Y_i\) is independent of the assigned condition.

To simplify notation, I’ll assume exclusion and refer to observed outcomes as \(Y_i(W_i)\).

  • Stable units:
  • units experience one version of the treatment
  • one unit’s assignment doesn’t influence another unit’s outcome

  • Endogeneity refers to a relationship between covariates and the error terms, e.g. through unobserved confounds.
  • Back door exclusion, or no endogeniety, i.e. no direct \(Z\rightarrow Y\) link.

  • Unconfoundedness: \(Y_i(0),Y_i(1) \perp Z_i | X_i\). Conditional on observed covariates (\(X_i\)), assignments (\(Z_i\)) and potential outcomes (\(Y(W_i)\)) are independent.

If unconfoundedness isn’t satisfied, i.e. the distributions of \(X_i\) differs between levels of \(W_i\), then the overlap assumption can be made to allow for the use of propensity scores.

  • Overlap: \(0< p(W_i|X_i) < 1\), \(\forall X_i\): For all levels of the covariates \(X_i\), the probability of experiencing treatment is nonzero, and the probability of experiencing control is nonzero.

This assumption allows the use of propensity scores \(\hat{p}(W_i|X_i)\), which are estimates of \(p(W_i|X_i)\) that are used to match units with similar propensity, or to reweight observed outcomes to account for propensity.

Common Methods

Some summarized in this Medium Post

Diff in Diff

Contexts:

  • A treatment group and a comperable control group can be observed before and after the treatment is applied.

Assumptions and requirements:

  • Requires observed data on a control group pre and post treatment
  • Parallel trends: Assumes the trend in the control group is an appropriate proxy for the trend in the treatment group, had the treatment not occurred. This way the mean difference can be accounted for without accounting for triend other than through the control group.

Model:

\[ \begin{aligned} I_{T}&= \begin{cases} 0, & \text{Experienced Control} \\ 1, & \text{Experienced Treatment} \end{cases}\\ I_{Post}&= \begin{cases} 0, &\text{Before Time of Treatment}\\ 1, &\text{After Time of Treatment} \end{cases}\\ Y &= \beta_0 + \beta_1 I_{T} + \beta_2 I_{Post} + \beta_3 I_{T} I_{Post} \end{aligned} \]

Here \(\beta_3\) captures the treatment effect, \(\beta_0\) captures the mean of the control group before the treatment was applied, \(\beta_1\) captures the difference between the mean of the treatment group and the mean of the control group before treatment was applied, \(\beta_2\) captures the difference in the control group mean after the treatment was applied due to confounds that are assumed to influence the control and treatment groups equivalently.

Synthetic Control and Causal Impact

Synthetic control approaches build a model of the counterfactual. Causal impact is a specific method for building a synthetic control, and was developed by Google – docs here.

Context:

  • Want to estimate the counterfactual for an observed timeseries, pre and post treatment.

Assumptions and requirements:

  • Requires other time series related to the target time series
  • Assumes the other time series are not influenced by the treatment, generally good candidates include; google trends time series, the weather, other countries or markets where no action was taken, unemployment indecies, stock prices, … .

Model:

  • In pre period, train any kind of model to estimate the target time series as a function of the other time series. The model used in the Causal Impact package is a Bayesian structural time series model.
  • Google’s Causal Impact method uses Bayesian structural time series to construct a counterfactual group – an estimate of what the time series would have looked like if the treatment had not been assigned. The method is described in this video. The method estimates a counterfactual using other related time series that were not influenced by the treatment.
  • BSTS uses spike and slab prior for feature selection.
  • In post period, use the model to estimate a counterfactual time series (i.e. synthetic control).
  • Independent Python implementation here.

Benefits:

  • Provides pointwise estimates of the causal effect over time, as well as a cumulative estimate (summing up the pointwise estimates)
  • Bayesian approach, provides credible intervals

Synthetic Controls

Context:

Assumptions and requirements:

  • Requires multiple related time series
  • Assumes a linear combination of the related time sereis is a good proxy for the counterfactual

Model:

  • Regress target time series on related time series
  • Use estimates from this model as the counterfactual

When to use vs causal impact:

  • The other time series are more conceptually identical - e.g. observed outcomes in untreated market segments subject to the same plausiable confounds – e.g. different nearby counties in a state.
  • Causal impact might be better if the time series are thought to be components of the target – stock price, unemployment, page impressions, google trends, … .

IV Analysis

More from Ben Lambert, good explanation here, part II. Case study using quarter and years of education as \(Z\) in place of education as the instrument [here](https://www.youtube.com/watch?v=pI9YGSJ2qPk, relation to 2SLS.

Mostly from Gelman chapter “Causal inference with more complicated observational designs” chapter from Gina.

Context:

  • Goal is to find the causal effect of \(x\rightarrow y\).
  • Issue: \(Cov(x,\epsilon)\neq 0\), a feature is correlated to unobserved confounds, so \(\hat{\beta}_{OLS}\) is unbiased and inconsistent.

Assumptions and Requirements:

  • An instrument \(z\) is measured so it can be included in the model.
  • \(Cov(z,x)\neq 0\), the instrumental variable is related to \(x\).
  • \(Cov(z,\epsilon)=0\), the instrument is not correlated with unobserved confounds.

Model, 2 stage least squares:

Done to address bias in \(\hat{\beta}_{OLS}\)

\[ \begin{aligned} \hat{x}&=\gamma z + u \\ \hat{y}&=\beta_{IV} \hat{x} + \epsilon \\ \beta_{IV} &= (Z^TX)^{-1}Z^TY = \frac{Cov(Z,Y)}{Cov(Z,X)} \end{aligned} \]

Limitation:

  • \(\beta_{IV}\) can still be biased, but is at least consistent (\(\beta_{IV}\overset{p}{\rightarrow} \beta\)).

This relates to the intent to treat (ITT) estimate, where the effect of assignment to the treatment group is consided \(Z\), but the effect of actually being treated \(X\) may not be known or estimated.

Regression Discontinuity

Context:

Some systematic mechanism forces a discontinuity in what would be expected to be a regression line.

  • E.g. 1: Yelp rounds star ratings so that businesses close in underlying rating (4.49 and 4.5 stars) are assigned to different conditions (4 and 4.5 starts).
  • E.g. 2: Schools might filter on exam scores, so that students who are very similar in underlying ability or test score (1 point below the threshold vs at the threshold) might be assigned to different universities.

Fixed Effects Regression

Context:

  • Unobserved heterogineity \(\alpha_i\) that is related to features, \(Cov(\alpha_i,x_{i,t})\neq 0\)
  • Issue: \(Cov(\alpha_i,x_{i,t})\neq 0\), this implies that \(\hat{\beta}\) isn’t consistent for \(\beta\), i.e. the sampling distribution of \(\hat{\beta}\) doesn’t converge to \(\beta\).

Assumptions and Requirements:

  • \(Cov(x_{i,t},u_[i,t])=0\), i.e. weak exogeniety
  • No perfect correlation between \(x\).
  • These assumptions imply the fixed effects estimates (\(\beta_{FE}\)) are consistent.

Model, from here:

The procedure is to average over time to get \(\bar{y}_i\), and subtract this off, taking advantage of \(\alpha_i=\bar{\alpha}_i\) where the later is averaged across time.

\[ \begin{aligned} \alpha_i &: \text{unobserved hetergenity} \\ Cov(\alpha_i,x_{i,t}) &\neq 0 \\ u_{i,t} &: \text{error} \\ y_{i,t}&=\beta x_{i,t} + \alpha_i + u_{i,t}\\ \bar{y}_{t}&=\beta \bar{x}_{i} + \bar{\alpha}_i + \bar{u}_{i} \\ y_{i,t}-\bar{y}_{t}&=\beta_{FE} (x_{i,t}-\bar{x}_{i}) + (\alpha_i - \bar{\alpha}_i)+ u_{i,t} - \bar{u}_{i} \\ &= \beta_{FE} (x_{i,t}-\bar{x}_{i}) + u_{i,t} - \bar{u}_{i} \\ \end{aligned} \]

Limitations:

  • removes anything that is constant over time, meaning effects of any time-constant variables can’t be estimated

First differences

Context:

  • similar to fixed effects, but less asymptotically efficient if there are serially uncorrelated errors see this vid

Random Effects

description in this video.

Context:

Assumptions and requirements:

Model:

Propensity Score Matching

Context:

Treatment assignment is non-random and might be related to the relationship between treatment and outcome. So you match people on their propensity to be assigned to treatment.

Microsoft DoWhy

Efficient Inference

More efficient studies require fewer resources (time, data) to reach a conclusion.

  • Methods to reduce variance,
  • Assessing the variance of estimators
  • Efficient models

Efficient Sampling

  • Assigning at a point proximal to the experimental manipulation (see blogs in pinterest deep dive)

Efficient Designs

  • Independent samples
  • matching
  • repeated measures

Efficient Models

  • Precision variables
  • Estimands
  • Wald test vs LRT vs Bayes vs Score test

Experiment Design

Foundations

  • Hypotheses
  • Treatment and control conditions
  • assumptions (including CI assumptions, SUTVA, …?)

Concerns

  • network effects: front-end, it could be from different interactions, back-end it could be from changes in algorithm behavior for units in C based on different behavior from units in T – e.g. if a new recommender for units in T influences them to watch more of x, then x may also get recommended more to those in C.
  • learning effects (change aversion, novelty seeking)
  • early adopters
  • Non-compliance: Those assicned to treatment may not actually experience the treatment.
  • Crossover: Those assigned to control might gain access to the treatment.
  • Treatment inhomogenaity: Some user segments might respond differently than others.

Treatment Units

What is the unit at which we assign treatments?

Ideally independent individuals, but online it is hard to know who is visiting a webpage, or crucial to keep experiences comparable across devices. Also, in networks, individuals are not independent and it is important to keep user experience consistent across connected individuals.

Proxies for individuals include;

  • User ID or account – most clearly tied to a user, but
  • cookies – device and browser specific, so these could differ across a user’s browsers or devices.
  • IP address (device request return address) – device specific, so a user’s experience might differ across devices.

In networks, loosely connected clusters of individuals can be used as experimental units (see unofficial google data science blog post on this).

  • a solution to SUTVA violations, but decreases sample size and power dramatically
  • cluster-based, stratified, serial, balanced, …
  • propensity scores and matching

Generative Model and Adjustment Variables

  • Specify the hypothesized generative process
  • Confounds
  • Precision
  • Neusance

Common Designs

A/B testing

A/A testing to estimate variation

Variance reduction designs (paired designs, matching, …)

Bandits for limited data or maximizing an objective

Assignment Mechanism

  • maps samples (xi,yi) into treatment or control conditions.
  • randomized control trials ideal, not always possible
  • other options, …?
  • test assignment validity, could use propensity scores or maybe chi square.

Duration

  • learning effects: initial exploration or novelty seeking
  • learning effects: initial change aversion
  • power analysis for sample size
  • time to run to mitigate

Integrety Checks

  • check that those assigned to treatment received it
  • check that stratifications were implemented correctly
  • Check demographics across buckets, if randomization worked then demographics should be approximately equally represented in the buckets.

Maximum Likelihood versus Bayes’ Estimators

likelihood is one term in Bayes’ Theorem

GLMs

Again about the variance – this time it takes a form other than normal.

likelihood, score, fisher information, robust veriance, …?

Common Tests

t-test

  • different variance formulations.
  • one-sample
  • two independent sampels
  • two dependent samples

ANOVA

Linear Model

Z test for proportions

Binary data

Chi-Square

Logistic Regression

Likelihood Ratio and Bayes Factors

always valid p-values (ck optimizely white paper)

Different Estimands For CI

  • Average treatment effect
  • Treatment on treated

more on variance

Emperical Variance Estimation

bootstrapping

A/A testing

Mixed Effects Models

  • Maime’s consulting project? Other use cases?

Effiencicy or Variance Reduction

  • look up variance reduction in A/B tests, a common method might be to incorporate precision variables like demographic data or user data such as device or browser.

e.g. with a two sample t-test, \(Y_t - Y_{t-1} = \beta_0\), versus a model \(Y_t = \beta_0 + \beta_1 Y_{t-1}\)

  • get table for different tests of means from soc sci 10 notes.

GLMs

Stats: descriptive, R-squared, chi-squared;

Probability: distributions, CLT, sampling distributions, p-value,

Stats: k-s, Q-Q plot, hypothesis testing, experimentation;

Probability: Bayes, bootstrap

Probability: maximum likelihood estimation; time series analysis (ARIMA models), granger causality

Dynamic experimentation (multi-armed bandits)

Unit testing; EDA visualization (seaborn, Plotly, Bokeh)

Causal Inference Methods

Good medium blog post here

For each mention the assumptions and the model.

Difference in Differences

Causal Impact

Synthetic Controls

Propensity Score Matching

Fixed Effects Regression

Instrumental variables

Regression Discontinuity

Avatar
Mac Strelioff
Cognitive Scientist (Ph.D.) Statistician (M.S.)
Previous
comments powered by Disqus