Category Archives: Data

Data: World Statistics Day And Access

20 October 2010 was UN sponsored World Statistics Day

The celebration of the World Statistics Day was meant to acknowledge the service provided by the global statistical system at national and international level, and hoped to help strengthen the awareness and trust of the public in official statistics. The day serves as an advocacy tool to further support the work of statisticians across different settings, cultures, and domains.

Official statistics are data produced and disseminated by national statistics offices, other government departments’ statistical units and indeed by many UN, international and regional statistical units.

In the UK the history of statistics ranges from the Domesday Book in 1086 when William I commissioned a detailed inventory of all the land and property in England and Wales. The results of this first major statistical enumeration were set out in the Domesday Book. To the statistical order in 2009 for the Pre-release Access to Official Statistics.

The latest order decreases the time that journalists, and others, have access to statistics prior to their official release. The five day period has been decreased to a maximum of 24 hours to be exceeded only for exceptional circumstances.

I mention this last and latest act because data driven journalism has been able to flourish with the advent of available data – free or inexpensive – and with access to software that allows its manipulation. Previous Data columns have explored some ways in access and exploitation –in the nicest possible sense – have been pursued. This column is a reminder that access to data is governed by those who create it and as such its availability is not always certain.

A useful resource for connecting and meeting other people interested in data driven journalism is the European Journalism Centre and the group Data Driven Journalism:

Data: Almost Elementary

“I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” p.189 The complete Sherlock Holmes, Volume 2

A colleague once asked me what the best tools were for doing data analysis. Initially, I wasn’t sure what she meant and then realised that she wanted some kind of script to run and get an answer. Data arrives, you do what you need to do and then report on what you find by following a formula.

It doesn’t work like that and even when it’s easy, it’s not usually that simple.

There is an article in the Guardian about ‘How to be a data journalist‘ and when answering the question of where to begin they suggest the following:

“So where does a budding data journalist start? An obvious answer would be “with the data” – but there’s a second answer too: “With a question”.

I don’t agree with the above, for me it’s always about the question. When you don’t have a question but you do have data then you use it to find a question. It’s always about the question because otherwise what are you writing about and how will you know when you’re done?

The premise of the Guardian article is that ‘data journalists’ are compartmentalised and do certain things but to me it seems slightly backwards. You don’t walk around looking for numbers >> ‘The Guardian’s Charles Arthur suggests “Find a story that will be best told through numbers”‘ which sounds a bit like a data journalist is someone who walks around with a metaphorical hammer looking for just the right nail.

The implication being that if someone didn’t know how to use ‘data’ they wouldn’t be able to answer questions that involved analysing numbers? That just can’t be right. You have a question and you use any (within reason) means to answer it.

If the source of information is numerical data then there are certain skills you can use and some of them involve statistics, presentation, context or analysis. Other data can include text, documents, speeches, actions, music, stories etc.

When Sherlock talks about obtaining more data, he is not talking about numbers, although that may be part of it. He is talking about information on which to base a conclusion, to find a solution. I agree with providing knowledge of skills which are helpful: finding data, interrogating data, visualising data, mashing data. The latter concept is a new one to me but I will look it up later.

For now I’ll finish up with what I told my colleague when she wanted to know what to do: what is the question? what is the data? what do you want to do with it? how will you know when you’re done?

Data: all about me, or is it you?

TED is an organisation that promotes itself with the tag line ‘ideas worth spreading’. Presenters have included Tony Robbins, Steve Jobs, Elizabeth Gilbert, Richard Dawkins and Malcolm Gladwell. There are also thousands more on the website (which also include transcripts). TED seems to be a place where all the cool people tend to converge when they want to spread the message about inspiring others and changing the world. This next person suggests that changing the world starts with us as individuals and can be done with the use of a device very close to most of our hearts, the smartphone.

Gary Wolf, contributing editor to Wired and blogger for the Quantified Self,  is a journalist who gives a five minute introduction to an intriguing new pastime: using mobile applications and always-on gadgets to track and analyze your body, mood, diet, spending, just about everything in daily life you can measure in glorious detail.

Click to see a video of the presentation

Wolf talks about how numbers are “useful when we reflect, learn, remember, and want to improve.” He goes on to add that “[t]he self is our operation center, our consciousness, our moral compass. So, if we want to act more effectively, we have to get to know ourselves better.”

The talk ends with the tempting thought that numbers are the way to get to know ourselves better but it doesn’t go on to say how. I suggest that the next step is to find some way to present the data and then to find some meaning in it through analysis.

David McCandless, another TED talker, is a ‘data journalist’ because he uses data and graphics to present a story. I attended his TEDx presentation in Brussels in 2009 and he made the fascinating point that the brain takes time to interpret numbers and text in order to give them context but it can absorb graphics instantly. See the figure below:

The keyword coverage during elections and terror alerts is presented graphically to show an apparent association between the two over time. Most show an association although 2008 does not. The question at the end of the article is: ‘any correlation?’ Our eyes tell us there is a correlation, that the two events are related to the extent that when one happens we can increase our prediction rate of the other happening, but what does it mean? The association could be random, a comment made on the site by Tim suggests that he took the ‘time’ factor and even after accounting for it still found some correlation which he calls highly significant. He probably means that there is less than 0.0001% of a change that the pattern we see is due to randomness.

However, and to use a favourite statistical saying, correlation does not equal causation. Just because we can see an association, it does not mean that one event causes the other. To find out whether events did occur randomly, have some significant (non-random) relationship and what is the strength, direction and cause of this relationship, we would have to analyse the data and there are various techniques available.

However, that’s a matter for another day, for now it is useful to note how seeing numbers in a graphical format can add context very quickly and suggest some pattern or story. If you are collecting information about your caffeine intake and regular moods then you might find an association between those which may not be so easy to spot when you look at them as numbers.

There are three elements to quantifying the self: 1) collecting the data; 2) presenting the data in order to identify associations; and 3) to understand what the associations mean.

A useful source for exploring the data collection stage is the Quantified Self site which has tools to help. Ideas on presenting data are brilliantly presented on the site and McCandless is also published in the Guardian. The third part however is not often presented easily and as prettily as the rest.

I’ll leave that last one for another day.

Data: in the eye of the beholder

A few weeks ago I was having breakfast at Primrose Cafe in Clifton. The sun was shining, the radio was on too loud, the place was crowded as usual and the conversation was almost flowing. In the midst of all this my companion made the point that there seemed to be more beautiful people in Clifton than there were, say, in Bedminster, and didn’t I think so? I looked around, and as the source of the comment was a single man, I tried to spot and remember how many young, slender, brunettes we had passed on our way.

He insisted that it wasn’t just about young women so I asked if it was related to age, are there more young people in Clifton? is it the clothes, the brushed hair, the jewelry, the make up, the colour of the skin, were there more white people? At this point he started to get a tad defensive at the suggestion that I might be calling him either shallow or racist or both. We didn’t get very far as he insisted he knew what beautiful meant and he didn’t have to explain it while I persisted with the thought that he should learn to quantify these abstract notions.

There’s always a chance that we were both somewhat wrong and right at the same time but I’ll stick to arguments that favour my own particular biases as this will be quicker.

“Nothing is considered to be beautiful by all peoples everywhere” says Desmond Morris. “Every revered object of beauty is considered ugly by someone, somewhere … There is so often the feeling that this, or that, particular form of beauty really does have some intrinsic value, some universal validity that simply must be appreciated by everyone. But the hard truth is that beauty is in the brain of the beholder and nowhere else” (pp 421-2).

Morris goes on to write of how humans are master-classifiers of information. When it comes to identifying beautiful and ugliness then he suggests that we have an internal classification and according to the properties we assign to this category we call something beautiful when it excels in those particular qualities and ugly where it doesn’t (p423).

This is where data comes into it because if we can identify characteristics it means that we can measure them and compare Bedminster and Clifton. I didn’t go ahead and measure them but I do know that when I think of people or places as beautiful or scummy or amazing or poor etc that there are plenty of biases that underline the concepts.

There are also plenty of sites which make data available on locations and which already provide categories. is a website that uses demographic information to provide snapshots of areas. 1.4 miles separate the Royal York Crescent in Clifton from West St in Bedminster but in terms of household income, interest in current affairs and education there are vast worlds of difference.

Bedminster, West St

Family income, educated to degree level and interest in current affairs are all high in Clifton whereas in Bedminster family income and educated to degree level are medium and interest in current affairs is below medium.

I’m using demographics and as examples of what data can add to meaning. There is a lot of information about data journalism at the moment and how it’s the new big thing and that can’t be a bad thing since apparently, “a lot of journalists are innumerate and a lot don’t know much about history” (CJR). What I think it comes down to is adding a meaning where facts just aren’t enough and by the way, without context, facts may be sacred by they are rarely enough.

When the Guardian advertises its credentials in promoting the West Country and suggests that Bristol featured in their [readers’] top ten UK cities in the 2009 Guardian and Observer reader Travel Awards you would probably not need help to figure out that Clifton features more than Bedminster. If you weren’t from the South West or Bristol, however, there is a fair amount of data out there that would help you figure it out and that’s the beauty of it.

Visualising Bristol, Prisoners

Oliver Conner recently shared some of his favourite free technology tools in a recent post and I thought I would take the time and try some. The first one was Tableau Public and it is a visualisation tool apparently which as far as I can tell provides an ‘easy’ way to make graphs and tables out of data.

My first output is a figure of the compensation paid to prisoners in Bristol as claims in civil litigation proceedings by type of claim.

Medical negligence payments of £119k diminish the impact of all other claims which at their highest, £18,540 for unlawful detention, are £100k less. The cause for this outlier is £112k paid out in 2006-07 which may make an interesting story but at the moment it hides everything else. Medical negligence gets removed for the rest of the examination.

I find graphics a useful way of presenting information that is immediately understandable rather than using data tables which need some translation to gain some meaning. The amount of money paid out by year is presented on its own to note trends over time.

The sum paid out almost triples between 04-05 and 05-06 and then continually decreases in the following years. So where has the change come from?

Sheet 1

(Source: Guardian Data Store )

In 2004-05, £4,545 was paid out for Injuries – slips & trips and falls, and property (damaged or lost). In 2005-06, £10,920 was paid out for injury due to assault by other prisoners and £2,600 for unlawful detention.

The downward trend in the last three years is split over large payments for Injury (assault by prisoners) in 2005-06 and unlawful detention in the 2006-07 and 2007-08.

Four years don’t show much of a trend because of the differences between the first and last two years. The next stage is to obtain data that isn’t available from the Guardian. The Bristol Evening Post has an article on compensation from Bristol prisons for the years 2007 to 2009 based on Freedom On Information requests. There is more useful information, namely the names of the prisons. “The figures were provided by the National Offender Management Service (NOMS) for Horfield Prison and Eastwood Park – a prison for female inmates in Falfield. NOMS does not hold information for Ashfield Young Offenders Institute in Pucklechurch, which is run by a private company.”

I extrapolated the 2008-09 figures by deleting the sums of 2007-08 from the two years from the Evening Post data.

The table shows that injury payments were made in the first two years but nothing in the following, medical negligence payments appeared for the two years after that but nothing for 2008-09. The most consistent payment for four out of the five years is for unlawful detention. £12,600 was paid out between 2007-09 by Eastwood Park.

A report on this prison was presented after an announced visit in 2000 stating that it had many functions to perform, such as: drug rehabilitation, house foreign nationals, child protection and various training programs. However the report claimed that “it lacks the capacity to carry out any of these tasks completely”. By 2008, however, a further report stated that the prison “in spite of the considerable challenges, is performing reasonably well in all areas, and is carrying out some innovative and supportive work”.

I could look for further information on the unlawful detention at Eastwood Park but I think that this was enough to show what you can get from some data. The Tableau Public program was interesting to use but I would often have to press the back button to change things when I couldn’t figure out how to fix them. The biggest issue was converting the data from the Guardian into list form so I could import it. A lot of the functionality is already present in Excel pivot tables. A benefit to the free software is the ability to export it to the web and discuss it as part of the community. I used one embedded chart from the site and the rest were easier to use as screenshots.

Nevertheless there is something pleasing about the ability to use software with data and it was fun to try it out. The functionality did not feel as intuitive as I expected and I have a fair amount of experience with statistical software.

I would love to know about anyone else’s experience with the visualisation tool.