Tag Archives: data

Data: Almost Elementary

“I have no data yet. It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” p.189 The complete Sherlock Holmes, Volume 2

A colleague once asked me what the best tools were for doing data analysis. Initially, I wasn’t sure what she meant and then realised that she wanted some kind of script to run and get an answer. Data arrives, you do what you need to do and then report on what you find by following a formula.

It doesn’t work like that and even when it’s easy, it’s not usually that simple.

There is an article in the Guardian about ‘How to be a data journalist‘ and when answering the question of where to begin they suggest the following:

“So where does a budding data journalist start? An obvious answer would be “with the data” – but there’s a second answer too: “With a question”.

I don’t agree with the above, for me it’s always about the question. When you don’t have a question but you do have data then you use it to find a question. It’s always about the question because otherwise what are you writing about and how will you know when you’re done?

The premise of the Guardian article is that ‘data journalists’ are compartmentalised and do certain things but to me it seems slightly backwards. You don’t walk around looking for numbers >> ‘The Guardian’s Charles Arthur suggests “Find a story that will be best told through numbers”‘ which sounds a bit like a data journalist is someone who walks around with a metaphorical hammer looking for just the right nail.

The implication being that if someone didn’t know how to use ‘data’ they wouldn’t be able to answer questions that involved analysing numbers? That just can’t be right. You have a question and you use any (within reason) means to answer it.

If the source of information is numerical data then there are certain skills you can use and some of them involve statistics, presentation, context or analysis. Other data can include text, documents, speeches, actions, music, stories etc.

When Sherlock talks about obtaining more data, he is not talking about numbers, although that may be part of it. He is talking about information on which to base a conclusion, to find a solution. I agree with providing knowledge of skills which are helpful: finding data, interrogating data, visualising data, mashing data. The latter concept is a new one to me but I will look it up later.

For now I’ll finish up with what I told my colleague when she wanted to know what to do: what is the question? what is the data? what do you want to do with it? how will you know when you’re done?


Data: all about me, or is it you?

TED is an organisation that promotes itself with the tag line ‘ideas worth spreading’. Presenters have included Tony Robbins, Steve Jobs, Elizabeth Gilbert, Richard Dawkins and Malcolm Gladwell. There are also thousands more on the website (which also include transcripts). TED seems to be a place where all the cool people tend to converge when they want to spread the message about inspiring others and changing the world. This next person suggests that changing the world starts with us as individuals and can be done with the use of a device very close to most of our hearts, the smartphone.

Gary Wolf, contributing editor to Wired and blogger for the Quantified Self,  is a journalist who gives a five minute introduction to an intriguing new pastime: using mobile applications and always-on gadgets to track and analyze your body, mood, diet, spending, just about everything in daily life you can measure in glorious detail.

Click to see a video of the presentation

Wolf talks about how numbers are “useful when we reflect, learn, remember, and want to improve.” He goes on to add that “[t]he self is our operation center, our consciousness, our moral compass. So, if we want to act more effectively, we have to get to know ourselves better.”

The talk ends with the tempting thought that numbers are the way to get to know ourselves better but it doesn’t go on to say how. I suggest that the next step is to find some way to present the data and then to find some meaning in it through analysis.

David McCandless, another TED talker, is a ‘data journalist’ because he uses data and graphics to present a story. I attended his TEDx presentation in Brussels in 2009 and he made the fascinating point that the brain takes time to interpret numbers and text in order to give them context but it can absorb graphics instantly. See the figure below:

The keyword coverage during elections and terror alerts is presented graphically to show an apparent association between the two over time. Most show an association although 2008 does not. The question at the end of the article is: ‘any correlation?’ Our eyes tell us there is a correlation, that the two events are related to the extent that when one happens we can increase our prediction rate of the other happening, but what does it mean? The association could be random, a comment made on the site by Tim suggests that he took the ‘time’ factor and even after accounting for it still found some correlation which he calls highly significant. He probably means that there is less than 0.0001% of a change that the pattern we see is due to randomness.

However, and to use a favourite statistical saying, correlation does not equal causation. Just because we can see an association, it does not mean that one event causes the other. To find out whether events did occur randomly, have some significant (non-random) relationship and what is the strength, direction and cause of this relationship, we would have to analyse the data and there are various techniques available.

However, that’s a matter for another day, for now it is useful to note how seeing numbers in a graphical format can add context very quickly and suggest some pattern or story. If you are collecting information about your caffeine intake and regular moods then you might find an association between those which may not be so easy to spot when you look at them as numbers.

There are three elements to quantifying the self: 1) collecting the data; 2) presenting the data in order to identify associations; and 3) to understand what the associations mean.

A useful source for exploring the data collection stage is the Quantified Self site which has tools to help. Ideas on presenting data are brilliantly presented on the informationisbeautiful.net site and McCandless is also published in the Guardian. The third part however is not often presented easily and as prettily as the rest.

I’ll leave that last one for another day.

Visualising Bristol, Prisoners

Oliver Conner recently shared some of his favourite free technology tools in a recent post and I thought I would take the time and try some. The first one was Tableau Public and it is a visualisation tool apparently which as far as I can tell provides an ‘easy’ way to make graphs and tables out of data.

My first output is a figure of the compensation paid to prisoners in Bristol as claims in civil litigation proceedings by type of claim.

Medical negligence payments of £119k diminish the impact of all other claims which at their highest, £18,540 for unlawful detention, are £100k less. The cause for this outlier is £112k paid out in 2006-07 which may make an interesting story but at the moment it hides everything else. Medical negligence gets removed for the rest of the examination.

I find graphics a useful way of presenting information that is immediately understandable rather than using data tables which need some translation to gain some meaning. The amount of money paid out by year is presented on its own to note trends over time.

The sum paid out almost triples between 04-05 and 05-06 and then continually decreases in the following years. So where has the change come from?

Sheet 1

(Source: Guardian Data Store http://bit.ly/a4tmIC )

In 2004-05, £4,545 was paid out for Injuries – slips & trips and falls, and property (damaged or lost). In 2005-06, £10,920 was paid out for injury due to assault by other prisoners and £2,600 for unlawful detention.

The downward trend in the last three years is split over large payments for Injury (assault by prisoners) in 2005-06 and unlawful detention in the 2006-07 and 2007-08.

Four years don’t show much of a trend because of the differences between the first and last two years. The next stage is to obtain data that isn’t available from the Guardian. The Bristol Evening Post has an article on compensation from Bristol prisons for the years 2007 to 2009 based on Freedom On Information requests. There is more useful information, namely the names of the prisons. “The figures were provided by the National Offender Management Service (NOMS) for Horfield Prison and Eastwood Park – a prison for female inmates in Falfield. NOMS does not hold information for Ashfield Young Offenders Institute in Pucklechurch, which is run by a private company.”

I extrapolated the 2008-09 figures by deleting the sums of 2007-08 from the two years from the Evening Post data.

The table shows that injury payments were made in the first two years but nothing in the following, medical negligence payments appeared for the two years after that but nothing for 2008-09. The most consistent payment for four out of the five years is for unlawful detention. £12,600 was paid out between 2007-09 by Eastwood Park.

A report on this prison was presented after an announced visit in 2000 stating that it had many functions to perform, such as: drug rehabilitation, house foreign nationals, child protection and various training programs. However the report claimed that “it lacks the capacity to carry out any of these tasks completely”. By 2008, however, a further report stated that the prison “in spite of the considerable challenges, is performing reasonably well in all areas, and is carrying out some innovative and supportive work”.

I could look for further information on the unlawful detention at Eastwood Park but I think that this was enough to show what you can get from some data. The Tableau Public program was interesting to use but I would often have to press the back button to change things when I couldn’t figure out how to fix them. The biggest issue was converting the data from the Guardian into list form so I could import it. A lot of the functionality is already present in Excel pivot tables. A benefit to the free software is the ability to export it to the web and discuss it as part of the community. I used one embedded chart from the site and the rest were easier to use as screenshots.

Nevertheless there is something pleasing about the ability to use software with data and it was fun to try it out. The functionality did not feel as intuitive as I expected and I have a fair amount of experience with statistical software.

I would love to know about anyone else’s experience with the visualisation tool.