## My Mean Square Error Confusions

Mean square error is defined as $E[(\hat{\theta}-\theta)^2]$. To be defined, it requires that $\theta$ is fixed, constant (albeit unknown). In other words, it’s a very frequentist thing, but it’s such a dominant way of evaluating the performance of estimators that it has to be taken very seriously. My confusion arises from the notation; I thought I’d set my thoughts out here to see if it helped with the confusion.

For a continuous random variable $X$ assumed to follow some distribution $F(x)$ the expectation $E[X]$ can be defined as $E[X] = \int_{-\infty}^{\infty} x f(x) dx$ where $f(x)$ is the probability density function of $X$. Here’s where the fun starts.

1. For compact notation, although $X$ is a function itself, and more formally would be denoted $X(\omega)$ where $\omega$ denotes the subset of some sample space. But no-one seems to get confused if we simplify the notation and just use $X$. So far, so good.
2. In a similar way (albeit using the simpler $X$ notation for the random variable) expectation should perhaps more carefully be written as $E_X[X] = \int_{\infty}^{\infty} x f(x) dx$, telling us the expectation is taken over $X$. Also, so far so good.
3. Now for conditional expectation. I now have two random variables, $X$ and $Y$ and I don’t at this stage care about the relationship between them. I do however wish to find the expectation of $Y$ for some unspecified value of the variable $X$. I denote this $Y$ conditional on $X$ as $Y|X$. So my expectation (in short form) will be denoted $E[Y|X]$, because it is meant to be clear what we mean. What I do is find

$E[Y|X] = \int_{-\infty}^{\infty} y f(y|x) dy$.

One thing to note here is that as $X$ is random variable, so this expectation denotes a function of a random variable, $g(x) = \int_{-\infty}^{\infty} f(y|x) dy$ for some value of $x$. In other words, the function defines a random variable. But it is still an expectation of $Y$. Conversely, if $X$ is fixed, this is not a random variable. Nevertheless, it still requires an integration over $Y$. So the same notation can define either a random variable or some unknown value depending on whether $X$ is a random variable or not. So the type of function depends on $X$, but is still an expectation of $Y$. But both definitions are an integration over $Y$. I therefore think the most formal notation here should be $E_{Y|X}[Y|X] = \int_{-\infty}^{\infty} y f(y|x) dy$ to denote we are taking an expectation over $Y$, it’s the value of $f(y|x)$ that describes the conditioning here.

4. It’s not relevant here (yet), but if I now take the expectation of this expectation, i.e. $E[E[Y|X]$ I am going to get a single number, in fact it collapses to $E[Y]$.
5. However, what’s troubling me is the notation of mean square error, defined above as $E[(\hat{\theta}-\theta)^2]$. I have seen this written (by authors who know their stuff) as $E_{\theta}[(\hat{\theta}-\theta)^2]$. This has been confusing me. I think it would be better to write $E_{\hat{\theta}|\theta}[(\hat{\theta}-\theta)^2]$. What we are trying to find is:

$\int_{-\infty}^{\infty} (\hat{\theta}-\theta)^2 f(\hat{\theta}|\theta) d \hat{\theta}$

because $\hat{\theta}$ is the random variable; it has a pdf $f(\hat{\theta})$ (sometimes called a sampling distribution) and here a conditional pdf $f(\hat{\theta}|\theta)$. We are also interested in evaluating a function of that random variable given by $g(\hat{\theta}) = (\hat{\theta}-\theta)^2$. But this $g(\cdot)$ is not a random variable because we are integrating over $\hat{\theta}$ and conditioning on $\theta$.

So I’m still none the wiser as to the $E_{\theta}$ notation for mean square error $\ldots$.

## Mark Glickman’s Music Page

I’d really like to go to “Valencia” sometime, if nothing else just for the music.   There are some great song parodies here:  Mark Glickman’s Music Page.   Seems Bayesians have all the fun here.

## Multilevel Ordinal Logistic Regression in pymc

I think I’m finally getting closer to having the code to fit a proportional odds logistic regression model in pymc.   There are still a few little places where the code could be tidied up (I’ve yet to figure out how to avoid loops when working with the linear predictors in terms of $\zeta_k - \eta_i$ but maybe I just need to stop worrying about duplicating entries in an array).   I also suspect I’m being over-cautious in explicitly sorting the $\zeta$s; and the slicing logic still frightens me.    But so far the results ***look*** plausibly close to the maximum likelihood estimates when tried on a few datasets.

As blogged earlier, I’m thinking of model that has an ordinal response $y_{ik} \sim (\pi_{ik})$ for individuals $i=1, \ldots, n$  for categories $k = 1, \ldots, K$.    (One of my favourite errors is when I forget to set the y values to be 0,1,K. for use in python).    There are constraints such that $\displaystyle{\sum_{k=1}^K \pi_{k}=1}$ for all $i$.

For ordinal logistic regression we relate $\pi_{ik}$ to a linear predictor $\boldsymbol{\beta X}$.   This is done in two stages.   First we can write $\eta_i = \beta_1 x_{i1} + \ldots + \beta_p x_{ip}$.    There’s no intercept; what makes all this tick is that we have a set of cutpoints $\zeta_k$ where $-\infty < \zeta_{k} < \infty$ for cutpoints $1\mbox{ to }K$ ( need to tidy the notation here as we only need $K-1$ cutpoints).   For ordinal logistic regression we take $\mbox{logit} \big( Prob[Y \leq k |\boldsymbol{X}] \big) = \zeta_k -\eta_i$.   This is then a cumulative link model.

When fitting the model we have to estimate both the values of the “fixed effects” $\boldsymbol{\beta }$ and these cutpoints $\zeta_{k}$.   All ran OK below in jags, now I just want it to work in pymc.

I’ve taken some data from Kevin Quinn which has a seven point scale, and (too many) predictor variables.   This is processed into a file clinton.csv referred to below.



brand = branddata[:,0]
X = branddata[:,1:7]

N = brand.size
K = 7 ## y.max

from pymc import Normal, Lambda, stochastic, deterministic, observed, Categorical, invlogit, LinearCombination
import numpy as np

## Here are the priors

#### First, try to control the initial values somewhat

initbeta = np.ones(6)/200
beta = Normal('beta', mu=1.0, tau=0.001, value=initbeta)

#### which gives rise to the Linear predictor (does it need to be lambda functions?)
eta = Lambda('eta', lambda beta=beta, X=X:np.dot(X, beta))

## Ditto, set initial values for zeta

initzeta = [-4,-3,-2,-1,0,1]
zeta = Normal("zeta", mu=0, tau=0.001, value = initzeta)

#### Follow jags trick of simulating K-1 cutpoints and sort them
#### The $0.02 question is whether I can wrap this with the instantiation of zeta ## (or whether I need it at all) @deterministic def zetasort(zeta=zeta): zsort =np.array(zeta, copy = True) zsort.sort return(zsort) ## cumulative log odds ## some legacy code in here when I was sorting zeta, but then couldn't figure ## how to monitor it, commented out in case I can simplify later @deterministic def cumlogit(eta=eta, zetasort=zetasort, brand=brand, K=K,N=N): cl=np.zeros(shape=(N,K-1)) ##np.array(zeta).sort for index in range(N): for jndex in range(K-1): ##cl[index,jndex] = zeta[jndex]-eta[index] cl[index,jndex] = zetasort[jndex]-eta[index] return(invlogit(cl)) ## next I need to decumulate ## I don't like here are the index counters, might have them wrong (set from zero!) @deterministic def p(cumlogit=cumlogit, K=K, N=N): pee=np.zeros(shape=(N,K)) pee[:,0] = cumlogit[:,0] pee[:,1:K-1] = cumlogit[:,1:K-1] - cumlogit[:,0:K-2]## there's a nasty slicing rule here pee[:,K-1] = 1 - cumlogit[:,K-2] return(pee) ##np.sum(p.value, axis=1) ## just there to remind myself how to check y = Categorical('y', p=p, value=brand, observed=True)  It’s the next bit where I get rather impressed by pymc. Driving the model building process is something here I’m really learning to appreciate. import pymc import clinton mymodel = pymc.Model(clinton) mygraph = pymc.graph.graph(mymodel) mygraph.write_png("graph.png") myMAP = pymc.MAP(clinton) myMAP.fit() myMAP.AIC myMAP.zetasort.value myMAP.beta.value M = pymc.MCMC(clinton) M.isample(10000,5000,thin=5,verbose=1)  I do like being able to get the graphs out (and interrogate them). Now, all that remains is to reconcile the results with those you get in R. There does seem to be some vagueness in the relationship between some of model terms and the response. Posted in McMC, pymc | Tagged , | Leave a comment ## Estimating the chances of something that hasn’t happened yet — The Endeavour There’s a Bayesian version of the rule of three given here. Nice. Posted in Uncategorized | Leave a comment ## Rejection Sampler Here’s some code to animate a rejection sampler  x <- runif(1000) u <- runif(1000) hx <- hist(x, plot = FALSE) M <- 3 post <- x post[u >= dbeta(x, 2.7, 5.3)/(M*dunif(x))] <- NA hp <- hist(post, plot = FALSE) require(rpanel) redraw <- function(panel){ rp.tkrreplot(panel, tkrp) panel } draw <- function(panel){ r <- as.numeric(panel$Replicates)
par(oma =c(0,0,0,0))
#par(mfrow = c(2,2))
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))

curve(dbeta(x, 2.7, 5.3), col = "red", ylab = "f(x) or g(x)", main = "Density Ratio", bty = "n", lwd = 2)
curve(dunif(x, 0, 1), add = TRUE, col = "green", lwd = 2, lty = 1)## was lty 3
arrows(x[r], 0, x[r],  dbeta(x[r], 2.7, 5.3), col = "red", lwd = 4, length = 0.01)
arrows(x[r], 0, x[r],  dunif(x[r]), col = "green", lwd = 2, lty = 1, length = 0.01)   ## was lty 3
##t1 <- bquote(Ratio == .(round(dbeta(x[r], 2.7, 5.3)/(M*dunif(x[r])),2)))
##t1 <-  bquote( paste(alpha== .(panel$A), ' ', beta==.(panel$B)))
t1 <- bquote(paste(U == .(round(u[r],2)), '>', Ratio == .(round(dbeta(x[r], 2.7, 5.3)/(M*dunif(x[r])),2))) )
plotcol <- "red"
if (u[r] < dbeta(x[r], 2.7, 5.3)/(M*dunif(x[r]))){
t1 <- bquote(paste(U == .(round(u[r],2)), '<', Ratio == .(round(dbeta(x[r], 2.7, 5.3)/(M*dunif(x[r])),2))) )
plotcol <- "green"
}
text(0.7, 1.9, t1, col = plotcol)
##t2 <- bquote(U == .(u[r]))
##text(0.8, 1.1, t2)
##decision <- "Reject"
##if (u[r] < dbeta(x[r], 2.7, 5.3)/(M*dunif(x[r]))) decision <- "Accept"
##plotcol <- "red"
##if (u[r] < dbeta(x[r], 2.7, 5.3)/(M*dunif(x[r]))) plotcol <- "green"
##text(0.8, 0.7, decision)
points(x[r], 0, pch = 16, col = plotcol)

hist(x[1:r], breaks = hx$breaks, ylim = c(0, max(hx$counts)), main = "Proposal")
points(x[r], 0, pch = 16, col = "green")

hist(post[1:r], xlim = c(0, 1), breaks = hx$breaks, ylim = c(0, max(hp$counts)), main = "Accepted sample")
points(x[r], 0, pch = 16, col = plotcol)
#acplot(1,1)

panel

}

rejectplot <- rp.control(title = "Rejection Sampler", Replicates = 1)

rp.tkrplot(rejectplot, tkrp, draw)
rp.slider(rejectplot, Replicates, action = redraw, from = 1, to = 1000, resolution = 1)



## [1201.2590] It is Time to Stop Teaching Frequentism to Non-statisticians

Nicely argued!   Little to disagree with here.

## Batch converting pdf to jpg

Doesn’t unix make you feel inadequate sometimes.   I’ve been fiddling with bash
scripts to convert pdfs into jpgs.   Many of the pdfs have been generated by using the burst command from pdft.

Having spent a few happy hours fiddling with the scripts (and nearly getting there), google locates a much better script.   It seems to work.   All I need to do now is figure out how!


for i in *.pdf; do echo $i convert -quality 100$i echo \$i | sed -e 's/\.pdf/-%d\.jpg/g'
done