These two figures show Lexis diagrams for (a) the number of reported motorcycle fatalities in England and Scotland between 1985 and the end of 2014 and (b) the *per capita* fatality rate. The *x* axis denotes the year and the *y* axis denotes the age of the casualty. The shade of red denotes the number of fatalities (left) or the fatality rate (right).

It seems to be that motorcycle deaths were high amongst young riders in the mid-1980s. Compulsory Basic Training was made compulsory in both jurisdictions on 1st December 1990. Fatalities have fallen amongst young adults since then, and it is possible/likely that this is due to the CBT stopping people using a motorbike at all as well as any safety benefits of only allowing youngsters on the road with this minimal training package.

Another strong feature of the data is that deaths seem to have fallen noticeably in the 2010s. It’s maybe striking that there have been drops in the numbers of registered motorcycles coinciding with the recession which started just before then.

The most interesting thing about these Lexis diagrams is that it is possible to follow Cohorts. The dotted black lines depict the 1955, 1961, 1967, 1973 and 1979 cohorts respectively. It is therefore possible to see for example that the middle line (the 1967 cohort) recorded the highest number of fatalities around 1986. But following the middle line it is possible to claim that as this cohort ages the death count/rate fell dramatically until they were about 25. But as you continue to follow the cohort, the death count/rate remains roughly constant until about 2010 (when this cohort were 43). It does appear possible to make the same claim of other cohorts. After an initial rapid decline in fatalities / the death rate, the numbers / rate remain constant until around 2010. The reason this is so interesting is because there was a strong narrative in the 2000s of the “Born Again Biker”, people who had ridden when they were younger but took up riding again in their middle ages. It would be unwise to read too much into these data, but they don’t sit comfortably with that interpretation. It looks more as if we have some very strong cohort effects. Biking was very popular among the 1967 cohort. By the 2000s a lot of bikers on the road (a high proportion given that CBT seems to have deterred youngsters from riding) were from that cohort. So the large number of middle aged motorcycle casualties in the 2000s could be due to the relative numerical importance of this cohort more than it was due to the “Born Again Biker” phenomena.

If this were just a narrative about interpreting / over interpreting data from some simple visuals it would be slightly interesting. But the point seems to be that the competing interpretations suggest very different interventions to reduce motorcycle injury.

]]>

This is based on the crude road injury rate (all injuries), and a reconstructed population up to 90 for England and Scotland from 1985 to the end of 2014. The colors show the injury rate (per capita), for all police reported road injuries. It’s very clear that “around 20 year olds” have the highest per capita injury rate. Some dotted lines have been added for particular cohorts. It is possible to persuade yourself there are cohort effects.

It bothers me a little in fitting an Age-Period-Cohort model that this impression will be reinforced by assumptions as much as by data.

When you subset the data a bit, there are some very clear cohort effects, for example for motorcycle injuries.

]]>But I’ve just seen a post which quotes from George Barnard:

[I]t seems to be useful for statisticians generally to engage in retrospection at this time, because there seems now to exist an opportunity for a convergence of view on the central core of our subject. Unless such an opportunity is taken there is a danger that the powerful central stream of development of our subject may break up into smaller and smaller rivulets which may run away and disappear into the sand.

I shall be concerned with the foundations of the subject. But in case it should be thought that this means I am not here strongly concerned with practical applications, let me say right away that confusion about the foundations of the subject is responsible, in my opinion, for much of the misuse of the statistics that one meets in fields of application such as medicine, psychology, sociology, economics, and so forth. It is also responsible for the lack of use of sound statistics in the more developed areas of science and engineering. While the foundations have an interest of their own, and can, in a limited way, serve as a basis for extending statistical methods to new problems, their study is primarily justified by the need to present a coherent view of the subject when teaching it to others. One of the points I shall try to make is, that we have created difficulties for ourselves by trying to oversimplify the subject for presentation to others. It would surely have been astonishing if all the complexities of such a subtle concept as probability in its application to scientific inference could be represented in terms of only three concepts––estimates, confidence intervals, and tests of hypotheses. Yet one would get the impression that this was possible from many textbooks purporting to expound the subject. We need more complexity; and this should win us greater recognition from scientists in developed areas, who already appreciate that inference is a complex business while at the same time it should deter those working in less developed areas from thinking that all they need is a suite of computer programs.

Apparently the comments were made around 1981 and published in a monograph around 1985. I wonder how many such books have been written since then that have really thumbed their nose at his points. And I must go away and check whether he was considering less formal issues such as appraising the representativeness of a particular (real) sample to the population it purports to represent.

]]>I like David Freedman’s book on Statistical Modelling. I like it a lot, and I particularly like the clarity. It’s very clear that is a population parameter, a fixed but unknown constant (OK, so I prefer to think of these as random variables, but we’ll leave that aside for now). Whatever your statistical perspective, is a random variable and hence has a realisation, a sampling distribution (population distribution), variance and so on. So it’s quite clear that the hat symbol denotes an estimator which as a function of random variables has to be a random variable.

I therefore got rather queasy about using to denote the projection of onto where are the least squares **solutions**. Many people spend a little time looking at the least squares solutions, if nothing else because the geometry is so cute. But in this case, neither nor are random variables because we don’t **have** to regard this as a statistical model. It’s just a projection of some data. We may go on to assume that at least is a realisation of a random variable and make assumptions about it’s properties which let us use as an estimator but we don’t have to. It’s an exercise in geometry, pure and simple. So why use the notation? Doesn’t it imply that is a random variable, which adds a conceptual layer to the development of the material that isn’t necessary yet.

Why not call it Shadow y, and then call the Hat matrix the shadow matrix (projection y would do as well wouldn’t it))? The Hat matrix would become Hat matrix only when we are making assumptions about instead of looking at projections of ?

A large part of my queasiness is about over-emphasis on . I know it’s a nice examinable exercise to give someone a formula and ask them to compute some value of but that seems wrong as well. If we’re fitting a regression model isn’t the point that we are calculating because that we believe . Given the probability it seems a strange thing to get quite so hung up on in a statistical model, whereas (or whatever else we should call it) seems like a natural thing to consider when we are working with the geometry of linear regression.

]]>- Specify a model and then simulate parameters from the
- Now that you have some parameters, simulate data from the likelihood, conditional on the simulated parameters
- Run the MCMC sampler
- Compare the true parameter values to the samples. Generate a statistic for each parameter representing the proportion of samples that are greater than the true value
- Repeat many times

I think what interested me the most is that it highlights a gulf between my work and that of many programmers. And it doesn’t just seem to be me. We don’t seem to develop statistical software the way software developers would automatically do. And the main difference seems to be the lack for formal approaches to testing what we do.

Python for example is very well equipped with Test Drive Development packages. Standard Python comes with Unit testing built in (is can find a rather elderly book athttp://www.onlamp.com/pub/a/python/2004/12/02/tdd_pyunit.html). There is even a doctest facility which works out of your docstrings. Now, it’s certainly true that there is a unit testing package in R (http://cran.r-project.org/web/packages/svUnit/index.html), but I don’t get the sense that it is used as routinely as seems to be the case in Python. Yes, install.packages(“foo”) does download, compile, install and check something. But it doesn’t appear that testing is built into software development to the rigour I see in much of the Python packages that I use. In fact, I think all the packages I installed this year suggested I run nose tests – nose being a Python package that implements Test Driven Development in a more advanced way than is possible in standard Python.

I appreciate that much of what I do is either visual or relies on Monte Carlo methods, hence designing a test is non-trivial. But hunting the blogosphere around TDD does suggest I could usefully learn a lot about more structured approaches to software development. You do innately perform a lot of checking on steps within an algorithm – and of course do the kind of checking envisaged by the gurus mentioned at the start. But I get the impression that TDD provides a discipline to writing code that doesn’t exisit in my work, and that my work could benefit from that in many ways – even just the job satisfaction of giving myself milestones.

(and as it happens, there is another buzzword “Agile Development” that perhaps I should check out).

]]>

So the first thing to look at is the variance of estimators of and for simple univariate samples. For the Binomial, and , so if we have then . For the Poisson, we have , and given that we have .

The interesting thing about this is what happens when you think about taking limits. One explanation for the Poisson suggests that if you let in the Binomial and take holding constant. If you do this you find that:

- Poisson:
- Binomial:

However, as , to hold constant then clearly in some way and so in the limit is identical. Given that I’m usually rather dismissive of asymptotics, it’s interesting to see just how quickly these two converge.

But look what happens if you consider the variance of the estimators.

- Direct
- Derived from the Binomial (note that )

Well, I suppose one thing to say is that the limiting behaviour only applies where we have held constant where so perhaps it’s no surprise that there is a large discrepancy where .

But the speculation concerns the way the inferred Binomial has lower variance than the equivalent Poisson. As there is no such thing in real life as Poisson or Binomial random variables, this looks to me as if assuming one or the other has a bearing on the assumed precision of their estimators. In my modelling situation, assuming a Binomial will have a stronger influence on the model fit than assuming a Poisson.

]]>- Code chunks can be run as a block (great for debugging, and checking out sensitivity to starting values
- The R magic tool lets you send objects to R for work there (I am still taking a long time to learn python – at the moment rather a lot of plotting functions are farmed out that way
- There is an online notebook viewing service available via the nbviewer website. The current version of my notebook can be viewed here: Gibbs Sampler Changepoint

One small issue I’m still struggling with in terms of the nbviewer is the ability to output directly to html – I can’t find a template file html-blogger for example. But still, this looks like a very promising tool indeed.

]]>For a continuous random variable assumed to follow some distribution the expectation can be defined as where is the probability density function of . Here’s where the fun starts.

- For compact notation, although is a function itself, and more formally would be denoted where denotes the subset of some sample space. But no-one seems to get confused if we simplify the notation and just use . So far, so good.
- In a similar way (albeit using the simpler notation for the random variable) expectation should perhaps more carefully be written as , telling us the expectation is taken over . Also, so far so good.
- Now for conditional expectation. I now have two random variables, and and I don’t at this stage care about the relationship between them. I do however wish to find the expectation of for some unspecified value of the variable . I denote this conditional on as . So my expectation (in short form) will be denoted , because it is meant to be clear what we mean. What I do is find
.

One thing to note here is that as is random variable, so this expectation denotes a function of a random variable, for some value of . In other words, the function defines a random variable. But it is still an expectation of . Conversely, if is fixed, this is not a random variable. Nevertheless, it still requires an integration over . So the same notation can define either a random variable or some unknown value depending on whether is a random variable or not. So the type of function depends on , but is still an expectation of . But both definitions are an integration over . I therefore think the most formal notation here should be to denote we are taking an expectation over , it’s the value of that describes the conditioning here.

- It’s not relevant here (yet), but if I now take the expectation of this expectation, i.e. I am going to get a single number, in fact it collapses to .
- However, what’s troubling me is the notation of mean square error, defined above as . I have seen this written (by authors who know their stuff) as . This has been confusing me. I think it would be better to write . What we are trying to find is:
because is the random variable; it has a pdf (sometimes called a sampling distribution) and here a conditional pdf . We are also interested in evaluating a function of that random variable given by . But this is not a random variable because we are integrating over and conditioning on .

So I’m still none the wiser as to the notation for mean square error .

]]>As blogged earlier, I’m thinking of model that has an ordinal response for individuals for categories . (One of my favourite errors is when I forget to set the y values to be 0,1,K. for use in python). There are constraints such that for all .

For ordinal logistic regression we relate to a linear predictor . This is done in two stages. First we can write . There’s no intercept; what makes all this tick is that we have a set of cutpoints where for cutpoints ( need to tidy the notation here as we only need cutpoints). For ordinal logistic regression we take . This is then a cumulative link model.

When fitting the model we have to estimate both the values of the “fixed effects” and these cutpoints . All ran OK below in jags, now I just want it to work in pymc.

I’ve taken some data from Kevin Quinn which has a seven point scale, and (too many) predictor variables. This is processed into a file clinton.csv referred to below.

from numpy import loadtxt branddata = loadtxt('clinton.csv', delimiter=',') brand = branddata[:,0] X = branddata[:,1:7] N = brand.size K = 7 ## y.max from pymc import Normal, Lambda, stochastic, deterministic, observed, Categorical, invlogit, LinearCombination import numpy as np ## Here are the priors #### First, try to control the initial values somewhat initbeta = np.ones(6)/200 beta = Normal('beta', mu=1.0, tau=0.001, value=initbeta) #### which gives rise to the Linear predictor (does it need to be lambda functions?) eta = Lambda('eta', lambda beta=beta, X=X:np.dot(X, beta)) ## Ditto, set initial values for zeta initzeta = [-4,-3,-2,-1,0,1] zeta = Normal("zeta", mu=0, tau=0.001, value = initzeta) #### Follow jags trick of simulating K-1 cutpoints and sort them #### The $0.02 question is whether I can wrap this with the instantiation of zeta ## (or whether I need it at all) @deterministic def zetasort(zeta=zeta): zsort =np.array(zeta, copy = True) zsort.sort return(zsort) ## cumulative log odds ## some legacy code in here when I was sorting zeta, but then couldn't figure ## how to monitor it, commented out in case I can simplify later @deterministic def cumlogit(eta=eta, zetasort=zetasort, brand=brand, K=K,N=N): cl=np.zeros(shape=(N,K-1)) ##np.array(zeta).sort for index in range(N): for jndex in range(K-1): ##cl[index,jndex] = zeta[jndex]-eta[index] cl[index,jndex] = zetasort[jndex]-eta[index] return(invlogit(cl)) ## next I need to decumulate ## I don't like here are the index counters, might have them wrong (set from zero!) @deterministic def p(cumlogit=cumlogit, K=K, N=N): pee=np.zeros(shape=(N,K)) pee[:,0] = cumlogit[:,0] pee[:,1:K-1] = cumlogit[:,1:K-1] - cumlogit[:,0:K-2]## there's a nasty slicing rule here pee[:,K-1] = 1 - cumlogit[:,K-2] return(pee) ##np.sum(p.value, axis=1) ## just there to remind myself how to check y = Categorical('y', p=p, value=brand, observed=True)

It’s the next bit where I get rather impressed by pymc. Driving the model building process is something here I’m really learning to appreciate.

import pymc import clinton mymodel = pymc.Model(clinton) mygraph = pymc.graph.graph(mymodel) mygraph.write_png("graph.png") myMAP = pymc.MAP(clinton) myMAP.fit() myMAP.AIC myMAP.zetasort.value myMAP.beta.value M = pymc.MCMC(clinton) M.isample(10000,5000,thin=5,verbose=1)

I do like being able to get the graphs out (and interrogate them).

Now, all that remains is to reconcile the results with those you get in R. There does seem to be some vagueness in the relationship between some of model terms and the response.

]]>