## Motorcycle fatalities in the GB (excluding Wales)

(Sorry Wales, I just haven’t figured a way of sorting out the population histories given the local government reorganizations that went with Devolution)

These two figures show Lexis diagrams for (a) the number of reported motorcycle fatalities in England and Scotland between 1985 and the end of 2014 and (b) the per capita fatality rate.   The x axis denotes the year and the y axis denotes the age of the casualty.   The shade of red denotes the number of fatalities (left) or the fatality rate (right).

It seems to be that motorcycle deaths were high amongst young riders in the mid-1980s.   Compulsory Basic Training was made compulsory in both jurisdictions on 1st December 1990.   Fatalities have fallen amongst young adults since then, and it is possible/likely that this is due to the CBT stopping people using a motorbike at all as well as any safety benefits of only allowing youngsters on the road with this minimal training package.

Another strong feature of the data is that deaths seem to have fallen noticeably in the 2010s.   It’s maybe striking that there have been drops in the numbers of registered motorcycles coinciding with the recession which started just before then.

The most interesting thing about these Lexis diagrams is that it is possible to follow Cohorts.   The dotted black lines depict the 1955, 1961, 1967, 1973 and 1979 cohorts respectively.    It is therefore possible to see for example that the middle line (the 1967 cohort) recorded the highest number of fatalities around 1986.   But following the middle line it is possible to claim that as this cohort ages the death count/rate fell dramatically until they were about 25.   But as you continue to follow the cohort, the death count/rate remains roughly constant until about 2010 (when this cohort were 43).   It does appear possible to make the same claim of other cohorts.   After an initial rapid decline in fatalities / the death rate, the numbers / rate remain constant until around 2010.    The reason this is so interesting is because there was a strong narrative in the 2000s of the “Born Again Biker”, people who had ridden when they were younger but took up riding again in their middle ages.   It would be unwise to read too much into these data, but they don’t sit comfortably with that interpretation.   It looks more as if we have some very strong cohort effects.   Biking was very popular among the 1967 cohort.   By the 2000s a lot of bikers on the road (a high proportion given that CBT seems to have deterred youngsters from riding) were from that cohort.   So the large number of middle aged motorcycle casualties in the 2000s could be due to the relative numerical importance of this cohort more than it was due to the “Born Again Biker” phenomena.

If this were just a narrative about interpreting / over interpreting data from some simple visuals it would be slightly interesting.   But the point seems to be that the competing interpretations suggest very different interventions to reduce motorcycle injury.

## Lexis Diagrams for Road Injuries

This is based on the crude road injury rate (all injuries), and a reconstructed population up to 90 for England and Scotland from 1985 to the end of 2014.   The colors show the injury rate (per capita), for all police reported road injuries.    It’s very clear that “around 20 year olds” have the highest per capita injury rate.   Some dotted lines have been added for particular cohorts.   It is possible to persuade yourself there are cohort effects.

It bothers me a little in fitting an Age-Period-Cohort model that this impression will be reinforced by assumptions as much as by data.

When you subset the data a bit, there are some very clear cohort effects, for example for motorcycle injuries.

## Barnard’s views on statistical education (from

I really enjoy reading Deborah Mayo’s blog.   I don’t really understand enough about the underlying philosophy – and even gently reading about ideas such as Popper asking for severe challenges (rather than out and out falsifiability) bring this home.   I guess one reason for the blog was to practice articulating ideas I’m not so comfortable about but use all the time.

But I’ve just seen a post which quotes from George Barnard:

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

Apparently the comments were made around 1981 and published in a monograph around 1985.   I wonder how many such books have been written since then that have really thumbed their nose at his points.   And I must go away and check whether he was considering less formal issues such as appraising the representativeness of a particular (real) sample to the population it purports to represent.

## Regression notation

Aaargh.   A A R G H.

I like David Freedman’s book on Statistical Modelling.   I like it a lot, and I particularly like the clarity.   It’s very clear that $\beta$ is a population parameter, a fixed but unknown constant (OK, so I prefer to think of these as random variables, but we’ll leave that aside for now).   Whatever your statistical perspective, $\hat{\beta}$ is a random variable and hence has a realisation, a sampling distribution (population distribution), variance and so on.   So it’s quite clear that the hat symbol denotes an estimator which as a function of random variables has to be a random variable.

I therefore got rather queasy about using $\hat{y}$ to denote the projection of $y$ onto $\boldsymbol{X b}$ where $\boldsymbol{b}$ are the least squares solutions.   Many people spend a little time looking at the least squares solutions, if nothing else because the geometry is so cute.   But in this case, neither $\boldsymbol{X}$ nor $\boldsymbol{b} = \boldsymbol{(X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y}$ are random variables because we don’t have to regard this as a statistical model.   It’s just a projection of some data.   We may go on to assume that at least $\boldsymbol{y}$ is a realisation of a random variable and make assumptions about it’s properties which let us use $\boldsymbol{b}$ as an estimator $\hat{\boldsymbol{\beta}}$ but we don’t have to.   It’s an exercise in geometry, pure and simple.   So why use the $\hat{y}$ notation?   Doesn’t it imply that $\hat{y}$ is a random variable, which adds a conceptual layer to the development of the material that isn’t necessary yet.

Why not call it Shadow y, and then call the Hat matrix the shadow matrix (projection y would do as well wouldn’t it))?    The Hat matrix would become  Hat matrix only when we are making assumptions about $\boldsymbol{Y}$ instead of looking at projections of $\boldsymbol{y}$?

A large part of my queasiness is about over-emphasis on $\hat{\boldsymbol{y}}$.   I know it’s a nice examinable exercise to give someone a formula and ask them to compute some value of $\hat{y}$ but that seems wrong as well.    If we’re fitting a regression model isn’t the point that we are calculating $E[Y_i | \boldsymbol{x}_i ] = \mu_i|\boldsymbol{x}_i$because that we believe $Y_i \sim Normal(\mu_i|\boldsymbol{x}_i, \sigma^2)$.   Given the probability  $P(Y_i=\hat{y}_i)=0$ it seems a strange thing to get quite so hung up on $\hat{y}$ in a statistical model, whereas $Shadow(y)$ (or whatever else we should call it) seems like a natural thing to consider when we are working with the geometry of linear regression.

## Developing code

The wonder of the blogosphere is that I rediscovered a paper I had been meaning to think about for a long time: the paper reports Cook, Gelman and Rubin’s proposals that we:

1. Specify a model and then simulate parameters from the prior distributions (which is a bit of a nuisance if you have improper flat priors)
2. Now that you have some parameters, simulate data from the likelihood, conditional on the simulated parameters
3. Run the MCMC sampler
4. Compare the true parameter values to the samples. Generate a statistic for each parameter representing the proportion of samples that are greater than the true value
5. Repeat many times

I think what interested me the most is that it highlights a gulf between my work and that of many programmers. And it doesn’t just seem to be me. We don’t seem to develop statistical software the way software developers would automatically do. And the main difference seems to be the lack for formal approaches to testing what we do.

Python for example is very well equipped with Test Drive Development packages. Standard Python comes with Unit testing built in (is can find a rather elderly book athttp://www.onlamp.com/pub/a/python/2004/12/02/tdd_pyunit.html). There is even a doctest facility which works out of your docstrings. Now, it’s certainly true that there is a unit testing package in R (http://cran.r-project.org/web/packages/svUnit/index.html), but I don’t get the sense that it is used as routinely as seems to be the case in Python. Yes, install.packages(“foo”) does download, compile, install and check something. But it doesn’t appear that testing is built into software development to the rigour I see in much of the Python packages that I use. In fact, I think all the packages I installed this year suggested I run nose tests – nose being a Python package that implements Test Driven Development in a more advanced way than is possible in standard Python.

I appreciate that much of what I do is either visual or relies on Monte Carlo methods, hence designing a test is non-trivial. But hunting the blogosphere around TDD does suggest I could usefully learn a lot about more structured approaches to software development. You do innately perform a lot of checking on steps within an algorithm – and of course do the kind of checking envisaged by the gurus mentioned at the start. But I get the impression that TDD provides a discipline to writing code that doesn’t exisit in my work, and that my work could benefit from that in many ways – even just the job satisfaction of giving myself milestones.

(and as it happens, there is another buzzword “Agile Development” that perhaps I should check out).

## Emerging thoughts on the Poisson as a limit of the Binomial

I am trying to set down some thoughts on the relationship between the Binomial and the Poisson. The specific problem emerges from modelling data where I assume $Y_{i1} \sim Poisson(\lambda_i)$ and $Y_{i2} | n_i \sim Binomial(n_i, p_i)$. The specific problem is that I assume that $Y_{i1}$ and $Y_{i2}$ are correlated in some way, and have essentially been modelling this by assuming that $\mbox{logit} (p_i) = \eta_{i1} + u_{i1}$ and $\log(\lambda)=\eta_{i2}+u_{i2}$ and that the correlation can be induced by assuming $U_{i1},U_{i2}$ are bivariate Normal random variables. However the models don’t work the way I think they should.

So the first thing to look at is the variance of estimators of $p$ and $\lambda$ for simple univariate samples. For the Binomial, $E[X] = np$ and $Var(X)=np(1-p)$, so if we have $\hat{p}=\frac{x}{n}$then $Var(\hat{p})=Var(\frac{X}{n}) = \frac{p(1-p)}{n}$. For the Poisson, we have $E[X]=\lambda$$Var(X)=\lambda$ and given that $\hat{\lambda}=\frac{\sum_{i=1}^n x_i}{n}$ we have $Var(\hat{\lambda})=\frac{\lambda}{n}$.

The interesting thing about this is what happens when you think about taking limits. One explanation for the Poisson suggests that if you let $\lambda = np$ in the Binomial and take $n \to \infty$ holding $\lambda$ constant. If you do this you find that:

• Poisson: $Var(X) \lambda$
• Binomial: $Var(X) = np(1-p) = \lambda (1-p)$

However, as $n \to \infty$, to hold $\lambda$ constant then clearly $p \to 0$ in some way and so in the limit $Var(X)$ is identical. Given that I’m usually rather dismissive of asymptotics, it’s interesting to see just how quickly these two converge.

But look what happens if you consider the variance of the estimators.

• Direct $Var(\hat{\lambda}) = \frac{\hat{\lambda}}{n}$
• Derived from the Binomial $Var(\hat{\lambda}) = \hat{p} (1-\hat{p})$ (note that $p=\frac{\lambda}{n}$)

Well, I suppose one thing to say is that the limiting behaviour only applies where we have $np$ held constant where $p \to 0$ so perhaps it’s no surprise that there is a large discrepancy where $p \geq 0.5$.

But the speculation concerns the way the inferred Binomial has lower variance than the equivalent Poisson. As there is no such thing in real life as Poisson or Binomial random variables, this looks to me as if assuming one or the other has a bearing on the assumed precision of their estimators. In my modelling situation, assuming a Binomial will have a stronger influence on the model fit than assuming a Poisson.

## Python Notebook

It has been a busy year, for various reasons. However, one loose end that urgently needs tidying up are an assortment of bits and pieces relating to the use of python as a computing language. Somewhere during the year the iPython notebook seemed to catch on. So, we have reworked the famous Coal Disaster changepoint model as a Bayesian model fitted with a Gibbs sampler using python. There are all sorts of nice features about the notebook:

• Code chunks can be run as a block (great for debugging, and checking out sensitivity to starting values
• The R magic tool lets you send objects to R for work there (I am still taking a long time to learn python – at the moment rather a lot of plotting functions are farmed out that way
• There is an online notebook viewing service available via the nbviewer website. The current version of my notebook can be viewed here: Gibbs Sampler Changepoint

One small issue I’m still struggling with in terms of the nbviewer is the ability to output directly to html – I can’t find a template file html-blogger for example. But still, this looks like a very promising tool indeed.