Basic models in Bayesian Statistics

Bayesian Estimators

  • MAP: The maximum a posterior estimator is the posterior mode of $\theta|y$, i.e, $\hat\theta_{MAP}=\arg\max_{\theta}f(\theta |x)$.
  • The posterior mean is $\hat\theta =\mathbb E[\theta |y]$.
  • The posterior median is $\hat\theta_{Med}=\mathcal Q_{0.5}(\theta |y)$.

Credible interval

  • Equal-tailed (ET) Interval: $(\theta_L,\theta_R)$ is called the $100% \times (1-\alpha)$ equal-tailed credible interval if $\mathbb P(\theta<\theta_L|y)=\mathbb P(\theta >\theta_{R}|y)= \frac{\alpha}{2}$.
  • Highest posterior density (HPD) region: $s(y)$ is a $1-\alpha$ HPD if
    • $\mathbb P(\theta\in s(y)|y)=1-\alpha$
    • If $\theta_{a}\in s(y)$ and $\theta_{b}\not\in s(y)$, then $p(\theta_{a}|y)>p(\theta_{b}|y)$.
      Pasted image 20230915165417

Conjugate Prior

If $\mathcal F$ is a class of sampling distributions $p(y|\theta)$, and $\mathcal P$ is a class of prior distributions for $\theta$, then $\mathcal P$ is conjugate for $\mathcal F$ if
$$
p(\theta |y)\in\mathcal P \text{ for all }p(y|\theta )\in\mathcal F\text{ and }p(\theta )\in\mathcal P
$$

Binomial Model

A Bayesian model of the process of repeated Bernoulli experiments. The conjugate prior distribution for the binomial model is Beta distribution.

Beta distribution

Suppose $\theta\sim Beta(\alpha, \beta)$, The PDF of $\theta$ is
$$p(\theta)= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^ {\alpha-1}(1-\theta)^{\beta-1}$$
for $0\le\theta\le 1$.

  • $\mathbb E[\theta ]= \frac{\alpha}{\alpha+\beta}$
  • $Var[\theta ]= \frac{\alpha\beta}{(\alpha+\beta+1)(\alpha+\beta)^{2}}$
  • $Mode[\theta]= \frac{\alpha-1}{\alpha+\beta-2}$

Posterior

Suppose a Bernoulli experiment with parameter $\theta$ generates i.i.d samples $y_1,…,y_n$ that $\sum y_i=k$, then $p(y_1,…,y_n|\theta)=\theta^k(1-\theta)^{n-k}$. The posterior distribution will be
$$
p(\theta|y_1,…,y_{n})\propto\theta^k(1-\theta)^{n-k}p(\theta)
$$
If the prior $\theta\sim Beta(\alpha,\beta )$ distribution, the posterior will also be Beta distribution, i.e.,
$$
\theta|y_1,…,y_{n} \sim Beta(\alpha +k, \beta +n-k)
$$

Poisson Model

Poisson distribution

If $Y\sim Poi(\theta)$, then
$$\mathbb P(Y=y|\theta)= \frac{\theta^ye^{-\theta}}{y!}$$

A basic concept in Bayesian Statistics

Exchangeable

Let $p(y_1,…,y_n)$ be the joint density of $Y_1,…,Y_n$. If $p(y_1,…,y_n)=p(y_{\pi_1},…,y_{\pi_n})$ for all permutations $\pi$ of ${1,…,n}$, then $Y_1,…,Y_n$ are exchangeable.

de Finetti’s Theorem

The model can be written as
$$
P(y_1,…,y_{n})=\int \left{\prod_{i=1}^{n}p(y_i|\theta) \right}p(\theta)d\theta
$$
for some parameter $\theta$ if $Y_i$ are exchangeable.

  • $p(\theta)$ represents our beliefs about $\lim_{n\rightarrow\infty}\sum Y_i/n$ in the binary case.
  • $p(\theta)$ represents our beliefs about $\lim_{n\rightarrow\infty}\sum(Y_{i}\le c)/n$ for each $c$ in the general case.

Bagging is a kind of Ensemble Learning, which is designed to improve the stability and accuracy. It reduces the variance of the prediction while retaining the bias.
Given a standard traning set $D$, bagging generates $m$ new training sets $D_i$ by sampling from $D$ uniformly and with replacement. Sampling with replacement ensures each bootstrap is independent from its peers. Then, $m$ models are fitted using the above $m$ bootstrap samples and combined by averaging the output (for regression) or voting (for) classification.Pasted image 20230913151401
After bagging, $m$ weak learners are combined into a strong learner.

Suppose we have a training set $D={x_1,…,x_n}$ and real values $y_i$ associated with each point $x_i$. We assume that there is a function such as $y=f(x)+\varepsilon$, where the noise $\varepsilon\sim N(0,\sigma^2)$.
A function $\hat f(x;D)$ is needed to approximate the true function $f(x)$. A goal of machine learning is to minimize the expected error
$$
\begin{align}
\mathbb{E}{x,D,\varepsilon}[(y-\hat f(x;D))^{2}] &= \mathbb{E}{x,D,\varepsilon}[(y- f(x)+ f(x;D)-\hat f(x))^{2}]\
&=\mathbb{E}{x,D,\varepsilon}[(y-f(x))^2]+\mathbb{E}{x,D}[(f(x)-\hat f(x;D))^{2}]\
&=\mathbb{E}{x,D,\varepsilon}[(y-f(x))^2]+\mathbb{E}{x,D}[(f(x)- \bar f(x)+\bar f(x)-\hat f(x;D))^{2}]\
&=\mathbb{E}{x,D,\varepsilon}[(y-f(x))^2]+\mathbb{E}{x}[(f(x)- \bar f(x))^2]+\mathbb{E}_{x,D}[(\bar f(x)-\hat f(x;D))^{2}]\
&=\text{noise} + \text{bias} + \text{variance}
\end{align}
$$

Boosting is a kind of Ensemble Learning, which iteratively and greedily add weak learners to the ensemble.
A boosting model can be a weighted sum of weak learners:
$$
H_T(x)=\sum^{T}_{t=1}\alpha_th_t(x)
$$
Data which cause more error will be more weighted when fitting the next added weak learner. There are several ways to determine the weights of data.

Gradient boosting

Let $l$ denote a loss function and write
$$
\mathcal L(H) = \frac{1}{N} \sum^{N}{i=1}l(H(x_i),y{i})
$$
Gradient boosting is similar to gradient descent. Instead of add the negative gradient to the parameters, it adds functions to the ensemble.
$$
\begin{align}
h_{i+1} &= {\arg\min}{h,\alpha} \mathcal L(H{i}+\alpha \cdot h)\
&\approx {\arg\min}{h,\alpha} \mathcal{L}(H)+\alpha\nabla\mathcal{L}(H)\cdot h\
&={\arg\min}
{h,\alpha}alpha\nabla\mathcal{L}(H)\cdot h
\end{align}
$$
Sometimes we don’t optimize $\alpha$, so $h$ can be obtained by
$$
\begin{align}
h&={\arg\min}h\sum^{n}{i=1}\frac{\partial\mathcal{L}}{\partial[H(x_{i})]} h(x_i)
\end{align}
$$
The term $\frac{\partial\mathcal{L}}{\partial[H(x_{i})]}$ indicates the importance of the observed data $x_i$. For Gradient boosting, a common loss function is absolute loss $l(y,H(x))=(y-H(x))^2$. With this loss function, data that brings more error needs linearly more attention when fitting the new learner $h$.
If we let $d_{i=} -\frac{\partial\mathcal{L}}{\partial[H(x_{i})]}$, the vector $\boldsymbol{d}$ can also be viewed as a descent direction. Fitting $h$ is the process to find the vector $(h(x_1),…,h(x_n))$ nearest to the direction. I other words, $h$ is learning to predict $d$.
Pasted image 20230913203232

AdaBoost (Adaptive Boosting)

AdaBoost uses the exponential loss
$$
\mathcal L(H)=\sum^{N}{i=1}e^{-y_iH(x_i)}
$$
and learns $\alpha$ adaptively. The gradient is then $d_i=-y_ie^{-y_iH(x_i)}$. For convenience, let $w_i=\frac{e^{-y_iH(x_i)}}{\mathcal L(H)}$, which is the relative contribution of the overall loss. A miss-classified point by $H$ gets a larger weight.
Consider a binary classification problem, i.e., $y_i\in{-1,1}$, we have
$$
\begin{align}
h&=\arg \min_h \sum^{N}
{i=1}d_ih(x_i)\
&=\arg\min_h-\sum^{N}{i=1}w_iy_ih(x_i)\
&=\arg\min_h \sum
{y_i\neq h(x_i)}w_i -\sum_{y_i=h(x_i)}w_i\
&=\arg\min_h \sum_{y_i\neq h(x_i)}w_i
\end{align}
$$
The last equality holds because $\sum w_i=1$.
Now given $h$, we find $\alpha=\arg\min_\alpha \mathcal{L}(H+\alpha h)=\arg\min_\alpha\sum e^{-y_i(H+\alpha h(x_i))}$.
Take derivative and equate it with $0$, finally we get
$$
\alpha = \frac{1}{2}\ln \frac{1-\epsilon}{\epsilon}
$$
where $\epsilon=\sum_{y_i\neq h(x_i)}w_i$. An intuitive interpretation is that if the best $h$ even makes mistakes weighting more than a half, the ensemble will take it away by subtracting $h$ from itself.
Pasted image 20230914162734
AdaBoost is mainly designed for binary classification and usually is utilized to boost the performance of decision tree.

Summary

reduces bias and a little variance. However, boosting too much will eventually increase variance.

  • Weak learner: a learner that performs ralatively poorly – its accuracy is above chance.
  • Strong learner: a learner that achieves arbitarily good performance, much better than random guessing.
    Ensemble learning combines weak learners into a strong learner, i.e., a learner with lower expected error. The expected error can are introduced in Bias-variance Trade-off.
    There are several types of ensemble learning.
  • Bagging (Bootstrap aggregating)
  • Boosting
  • Stacking
  • Mixture of Experts (MoE)