Obtaining data is often the most difficult part of any analysis as access to resources is often restricted to academia or to those who can afford to buy it. It is rarely cheap. Even if you happen to be a student who is in a highly privileged position to access so much of what is available, what you want may not be collected.
The data.gov.uk project was created in 2010 to make availability of data much easier. This could be a significant matter for freelance researchers and it’s hard to imagine why if you don’t use this sort of material. The move towards open access of data was followed by a second one which saw the discount of the cost off of the National Statistics Postcode Directory. This used to cost around £3000 a year and is now available for free. This is a very big deal.
Postcodes are useful as distance is not usually a trivial matter when analysing effects. Where you live matters as does where you shop, travel, send your children to school and work. There are many things associated with location such as funding, control, regulation, health regulations and other top-down measures. Being able to plot these, or map them as the case may be, is incredibly helpful.
There are some examples on the UCL Centre for Advanced Spatial Analysis site of work that is being done with a spatial focus: Virtual London sees the creation of a three-dimensional model of London; Pollution Mapping sees the creation of a 2D/3D interactive air pollution map of London with a system that not only shows current pollution levels, but also predictions through to the year 2010; and the geo-genealogy project which mapped surnames across Britain.
The NSPD can be ordered by completing the order form and returning it to ONS Geography by email.
Information is Beautiful recently won a competition on Wired US magazine to revisualize a bloodwork test. This particularly appealed to me as my recent anaemia was recently presented to me as the number 8.4. Data without context isn’t particularly helpful (see the original image of the results):
The context for my blood results was that the normal range of haemoglobin was 12mg to 15mg and mine was 8.4mg.
Now see the following image for how the appropriate data visualisation could make such numbers immediately understandable:
Merry Christmas everyone whether you celebrate or not. I couldn’t let a festive data theme opportunity go by so please enjoy the following yuletide infographic. Information on the origins, celebration and enjoyment of the Christmas tree is presented concisely and clearly in the graphic created by All In One Garden & Leisure and presented on Infographics Showcase.
Did you know that England used the Christmas Tree formation in the Euro 96 Football Championship
Until December 24, the Radar column on the OReilly.com site is offering a bit of a data treat for all interested readers. So far the topics have covered charts, data from wikipedia, designing your own visualisations and an exploration of where to find data. One of the sites listed is freebase.com – a veritable trove of data, and others include Amazon Public Data Sets, Windows Azure Data Market and Infochimps.
All of these topics are useful if you’re looking for ways to delve and explore the data possibilities out there. The bonus from the O’Reilly website is the good use of technical explanations where appropriate.
One of the most interesting and useful columns is about Processing, the program that is freely available and helps with setting up graphics to visualise data. Explore some of the examples and if the graphics are inspiring enough to make you want to create your own, check out the column and the website. Enjoy.
In line with the aim of being ‘the most open and transparent government in the world’, the Number10 website have released a section called transparency.
Information is listed under the following sections:
Business Plans: Check progress on implementing our policies
Who does what in Whitehall: and how much are they paid
Who ministers are meeting
Government contracts in full – coming in the new year
How your money is spent – coming later this month
An additional section to find all other government data that is not listed in the above sections
There is a list of policies under Business Plans which, combined with actions, would not look amiss in a company’s performance measures.
As data that can be statistically analysed it leaves a lot to be desired but there is a promise of actual datasets in the future. However, there are some useful ways of examining what is available and that is by understanding that there is more than just quantitative data (things that can be measured) available as evidence. Textual and image based information can be analysed using other techniques rather than ones based on statistical techniques.
It is not like traditional research: you don’t have to test hypotheses statistically or develop scales, seek out representative samples in order to generalise about the entire population. You do need to ensure that you pursue good practises as determined by others and there are various methodologies such as phenomenology, ethnography, grounded theory and discourse analysis. In fact, the range of research methods is so wide that the third edition of the The SAGE Handbook of Qualitative Research has 42 chapters.
The appropriate technique depends on the question and what you would like to discover. In the summer of 2009 as part of an ethnographic project, I interviewed three people who worked in media in order to understand their perceptions of speaking or writing publicly and privately (e.g, writing as a journalist and using Twitter). I collected a series of interview scripts and used these as data.
Some benefits to quantitative analysis include being able to generalise findings from a sample towards the entire relevant population with a measurable degree of confidence. This ability is lost when dealing with qualitative analysis although there is a gain in the level of detail which may be more useful. After my research I don’t claim to know how all journalists communicate publicly or privately but I did get a sense of levels of privacy that vary according to each person and can then investigate this theory with others.
In a similar manner of understanding greater details, the government’s business plans are a series of actions and while they don’t allow for much generalisation across policy regimes and different parties in terms of statistical confidence, they do illustrate the manner of process that is used. The content published by the government highlights a way of working ‘efficiently’ and in a ‘corporate’ manner which may be more familiar to business and human resources managers.
So the qualitative data could lead to a hypothesis about running government as a business and suitable people for such an enterprise could be people with appropriate qualifications such as an MBA. The next task would be to look for other evidence that supports this hypothesis.
The story: Hunting Act convictions are at their highest yet according to new figures out with 57 convictions alone in 2009. Animal Safety legislation usually has very low rates of offences and convictions, for example the Deer Act 1991 had three offences and two convictions.
Convictions of the Hunting Act from 2005-2009 can be found on the spreadsheet alongside a comparison with other animal legislation convictions. The Guardian asks “What can you do with this data?”
My lack of creativity could be at the forefront but I can’t see what else I can do with this data. I am impressed that an article could be written on such few figures, barely a description really, with the most important part being the headline. I do see, however, a starting off point for some further questions.
There are three comments that follow the article in the Guardian and the third one by Sparclear seems the most useful.
“Pathetically low figures considering the widespread flouting of the Act.”
=> How much flouting of the rules has there been?
=> Is there a source of data for how many hunts go on? how many are illegal?
Many subsidiary industries evolved to thrive on the Hunt. As well as all the folks who look after the horses (and their vets and saddlers and livery stables and fencing and feedstuffs) there’s a whole class of outdoor workers, kennel keepers, beaters, gamekeepers, gunsmiths, types of forester whose winter economy is dictated by bloodsports. There are clothing shops and pubs and B & B’s, hotels, and farm kitchens which book the catering for particular events a whole year ahead.
=> What information is available on these activities? where would we find it?
Some information is available on the Guardian’s Hunting site but not all of it, especially not related to the extra activities. The additional data would be most useful for providing a context to the numbers provided. 91 offences and 57 convictions do make for the highest figures in the last five years but without any further context they are still quite meaningless.
Comparison to other animal related Acts may be a useful way of pointing out that this is not the same type of story.
TED is an organisation that promotes itself with the tag line ‘ideas worth spreading’. Presenters have included Tony Robbins, Steve Jobs, Elizabeth Gilbert, Richard Dawkins and Malcolm Gladwell. There are also thousands more on the website (which also include transcripts). TED seems to be a place where all the cool people tend to converge when they want to spread the message about inspiring others and changing the world. This next person suggests that changing the world starts with us as individuals and can be done with the use of a device very close to most of our hearts, the smartphone.
Gary Wolf, contributing editor to Wired and blogger for the Quantified Self, is a journalist who gives a five minute introduction to an intriguing new pastime: using mobile applications and always-on gadgets to track and analyze your body, mood, diet, spending, just about everything in daily life you can measure in glorious detail.
Wolf talks about how numbers are “useful when we reflect, learn, remember, and want to improve.” He goes on to add that “[t]he self is our operation center, our consciousness, our moral compass. So, if we want to act more effectively, we have to get to know ourselves better.”
The talk ends with the tempting thought that numbers are the way to get to know ourselves better but it doesn’t go on to say how. I suggest that the next step is to find some way to present the data and then to find some meaning in it through analysis.
David McCandless, another TED talker, is a ‘data journalist’ because he uses data and graphics to present a story. I attended his TEDx presentation in Brussels in 2009 and he made the fascinating point that the brain takes time to interpret numbers and text in order to give them context but it can absorb graphics instantly. See the figure below:
The keyword coverage during elections and terror alerts is presented graphically to show an apparent association between the two over time. Most show an association although 2008 does not. The question at the end of the article is: ‘any correlation?’ Our eyes tell us there is a correlation, that the two events are related to the extent that when one happens we can increase our prediction rate of the other happening, but what does it mean? The association could be random, a comment made on the site by Tim suggests that he took the ‘time’ factor and even after accounting for it still found some correlation which he calls highly significant. He probably means that there is less than 0.0001% of a change that the pattern we see is due to randomness.
However, and to use a favourite statistical saying, correlation does not equal causation. Just because we can see an association, it does not mean that one event causes the other. To find out whether events did occur randomly, have some significant (non-random) relationship and what is the strength, direction and cause of this relationship, we would have to analyse the data and there are various techniques available.
However, that’s a matter for another day, for now it is useful to note how seeing numbers in a graphical format can add context very quickly and suggest some pattern or story. If you are collecting information about your caffeine intake and regular moods then you might find an association between those which may not be so easy to spot when you look at them as numbers.
There are three elements to quantifying the self: 1) collecting the data; 2) presenting the data in order to identify associations; and 3) to understand what the associations mean.
A useful source for exploring the data collection stage is the Quantified Self site which has tools to help. Ideas on presenting data are brilliantly presented on the informationisbeautiful.net site and McCandless is also published in the Guardian. The third part however is not often presented easily and as prettily as the rest.
Oliver Conner recently shared some of his favourite free technology tools in a recent post and I thought I would take the time and try some. The first one was Tableau Public and it is a visualisation tool apparently which as far as I can tell provides an ‘easy’ way to make graphs and tables out of data.
My first output is a figure of the compensation paid to prisoners in Bristol as claims in civil litigation proceedings by type of claim.
Medical negligence payments of £119k diminish the impact of all other claims which at their highest, £18,540 for unlawful detention, are £100k less. The cause for this outlier is £112k paid out in 2006-07 which may make an interesting story but at the moment it hides everything else. Medical negligence gets removed for the rest of the examination.
I find graphics a useful way of presenting information that is immediately understandable rather than using data tables which need some translation to gain some meaning. The amount of money paid out by year is presented on its own to note trends over time.
The sum paid out almost triples between 04-05 and 05-06 and then continually decreases in the following years. So where has the change come from?
In 2004-05, £4,545 was paid out for Injuries – slips & trips and falls, and property (damaged or lost). In 2005-06, £10,920 was paid out for injury due to assault by other prisoners and £2,600 for unlawful detention.
The downward trend in the last three years is split over large payments for Injury (assault by prisoners) in 2005-06 and unlawful detention in the 2006-07 and 2007-08.
Four years don’t show much of a trend because of the differences between the first and last two years. The next stage is to obtain data that isn’t available from the Guardian. The Bristol Evening Post has an article on compensation from Bristol prisons for the years 2007 to 2009 based on Freedom On Information requests. There is more useful information, namely the names of the prisons. “The figures were provided by the National Offender Management Service (NOMS) for Horfield Prison and Eastwood Park – a prison for female inmates in Falfield. NOMS does not hold information for Ashfield Young Offenders Institute in Pucklechurch, which is run by a private company.”
I extrapolated the 2008-09 figures by deleting the sums of 2007-08 from the two years from the Evening Post data.
The table shows that injury payments were made in the first two years but nothing in the following, medical negligence payments appeared for the two years after that but nothing for 2008-09. The most consistent payment for four out of the five years is for unlawful detention. £12,600 was paid out between 2007-09 by Eastwood Park.
A report on this prison was presented after an announced visit in 2000 stating that it had many functions to perform, such as: drug rehabilitation, house foreign nationals, child protection and various training programs. However the report claimed that “it lacks the capacity to carry out any of these tasks completely”. By 2008, however, a further report stated that the prison “in spite of the considerable challenges, is performing reasonably well in all areas, and is carrying out some innovative and supportive work”.
I could look for further information on the unlawful detention at Eastwood Park but I think that this was enough to show what you can get from some data. The Tableau Public program was interesting to use but I would often have to press the back button to change things when I couldn’t figure out how to fix them. The biggest issue was converting the data from the Guardian into list form so I could import it. A lot of the functionality is already present in Excel pivot tables. A benefit to the free software is the ability to export it to the web and discuss it as part of the community. I used one embedded chart from the site and the rest were easier to use as screenshots.
Nevertheless there is something pleasing about the ability to use software with data and it was fun to try it out. The functionality did not feel as intuitive as I expected and I have a fair amount of experience with statistical software.
I would love to know about anyone else’s experience with the visualisation tool.