Big data: is small beautiful?

According to the 2014 Horizon Report for K-12 education (ie Kindergarten to 18 years old), big data and analytics will be adopted by education within the next two to three years. Big data. It’s the current buzzword in education (one of several, at least), but what exactly is it? And is it really of any practical use? And should we be using it anyway?

What are big data and analytics?

According to the Horizon Report,

Data are routinely collected, measured, and analyzed in the consumer sector to inform companies about nearly every aspect of customer behaviour and preferences. A number of researchers and companies are working to design similar analytics that can reveal patterns in learning-related data that could be used to improve learning both for individual students, and across institutions and systems.


The predictive aspect of learning analytics is anticipated to transform the very nature of teaching and learning, helping to address growing concerns about overall outcomes like university and job readiness. Predictive analytics assesses student data such as attendance, subjects taken, and testing to help surface early signals that indicate if a student is struggling, allowing teachers and schools to address issues quickly.

2014 Horizon Report K-12

Crispin Weston also provides a succinct summary of what big data and learning analytics aim to do:

Learning analytics aims to apply to learning the same techniques that are now used by online companies to target their online marketing. By spotting patterns in the data produced by students’ online learning activity, learning analytics systems should be able to help:

  • predict student progress;
  • inform adaptive learning strategies (sequencing digital learning activities or recommending human interventions);
  • profile a student’s current capabilities;
  • automatically group students, depending on their learning needs;
  • identify the most effective learning strategies in different situations;
  • aggregate and present complex data in ways which helps administrators, teachers and students manage instructional processes.

MOOCs and other ed-tech bubbles

The barriers to adoption

Sounds wonderful doesn’t it? Put in a load of data, run an algorithm or two, and out pops student grade predictions and recommendations, or at least data on which you can base recommendations, for improving them. There are, however, two main barriers to adoption, assuming that we think adopting this approach is a good idea in the first place.

First, as is often the case, there is the barrier in the form of practical issues such as the facts that teachers are busy, schools are busy, the data and analytics needs to be packaged in a form that makes it usable, people need training in using and interpreting it and, even if all these issues were addressed, there is still the sort of inertia that arises when something new is introduced into an older culture.

To quote once again from The Horizon Report,

Companies including Amazon, Google, and Facebook have been built on the insights of complex thinkers and communicators, who have popularized the use of big data to capture user-derived data in real-time, redefined the way we conceptualize consumer behaviour, and have built an entirely new industry based on this work. While data mining skills can be applied across virtually any sector, schools are not yet adept at encouraging greater development of these aptitudes through complex thinking and communication.

Second, there is a technical problem. The kind of educational data one might wish to use resides in different datasets and is formatted (meta-tagged) in different ways. So these different sets of data cannot easily be merged into one huge database (and is unlikely to be a scientifically robust thing to do even if they could, because the data were collected in different ways and for different purposes). As Crispin Weston says:

analytics is predicated on "big data" but in education, big data will not exist until we sort out the current failure of interoperability

MOOCs and other ed-tech bubbles 

But what about …


Data mining enables you to dig deeper, find hitherto hidden connections and ask different, bigger, kinds of questions. For example, you could, in theory, find out how student attainment changes when you change their teacher, their room, where they sit in the room, or any other set of variables.

But... Are those patterns real? Or does big data induce pareidolia, which is defined on the BBC's website as follows:

It's "the imagined perception of a pattern or meaning where it does not actually exist", according to the World English Dictionary. It's picking a face out of a knotted tree trunk or finding zoo animals in the clouds."

Pareidolia: Why we see faces in hills, the Moon and toasties

I can give you an example of how I derived patterns from purely fictional data, using a large dataset and pivot table analysis in Excel. Some years ago I created a super-large database comprising over a thousand records and around 30 fields. The data was purely fictional, and consisted of the names of people and various characteristics such as height, age, colour of hair, taste in music, location and so on. By squirting the data into a pivot table in Excel I was able to detect a hitherto unseen pattern: a higher proportion of people who liked Blues music were divorced. This poses a bit of a problem:

Correlation is not causation

What could account for such a phenomenon? Several things come to mind:

  • People who like blues music are pretty depressing to be with, so can’t make a marriage work
  • Getting married caused them to get depressed and start listening to Blues music
  • Their spouses got so fed up with hearing Blues music that they filed for divorce

Any of these could be true, or none of them. But it doesn’t matter anyway because the data was fictional, and all I managed to do was use a technique (pivot table analysis) to discern a pattern that may or may not have really been there. For all I know, the colour of people’s hair might have been a better predictor of whether or not they would end up in the divorce courts. And even if there is a correlation, that doesn’t say anything about causation, so it’s not obvious how this finding could be put to practical use even if the data were not fictional and the pattern seen was not an illusion.

How comprehensive should the data be?

I could make a really good case for stating that none of the studies I’ve seen into, say, the efficacy of using education technology are any good, because none of them take account of the weather. Any schoolteacher will tell you that when it’s windy, student behaviour deteriorates. If you don’t believe me, pop into any school on a windy day or, far easier to arrange, look at the standard of driving. To be really useful, education research should take account of the weather prevailing on any particular day.

Of course, if the data is meta-tagged in a consistent way, you can "mash up" data from different datasets, or overlay them, so you could factor in such things as weather even if you haven’t collected the data yourself. But this example does serve to indicate, I think, that even if you have lots of data, it may not be comprehensive enough. I also wonder, given the fact that we are all individual human beings, could it ever be comprehensive enough?

Mullah Nasrudin's lost keys

Mullah Nasrudin is a figure in Persian literature. In one of his stories, someone comes across him looking under a lamp post for his keys.

"Where did you lose them?", someone asks.

"Down the road", replies Nasrudin.

"Then why are you looking here?"

"Because it's dark down the road. How will I be able to see anything?"

I sometimes think research studies can be like that. For instance, lots of research states that there is no clear link between the use of technology and educational attainment. I think what they really mean is that they haven't been able to devise and carry out a study that separates out all the possible influencing factors. Any teacher who has embraced the use of technology in lessons will tell you that it works.

"Offline" comments and grades

What about data that is either not entered into a computer or stored online, or which is in the form of comments? One of the things I used to do with a paper markbook was have a column for whether students asked good questions, and another column for how much help they required. I could have encoded my entries in these columns in order to enter them into a database or spreadsheet, but:

  • The coding would be subjective and not necessarily meaningful to anybody else
  • It would have been pointless anyway because all I had to do if, say, I thought a test result looked strange was look in my markbook. Then I might say "Ah, yes, she did need a lot of help in that topic" or "Ah yes, she did ask some pretty incisive questions in that topic"

Difficult-to-collect data?

If someone has had an argument that morning, or lost their pet hamster last night, their concentration will be affected. That fact will be lost in "big data": it will be averaged out; it will be regarded as an "outlier", whether consciously or unconsciously.

But excluding outliers can be positively dangerous in some contexts. I am thinking of the health field in particular. Most people do not develop heart problems until their fifties. That means that some people in their twenties may have undiagnosed heart problems because no doctor has checked for them when the person has reported certain symptoms. To be honest, I cannot think of an equivalent scenario in education, but I think it's worth making the point that in big data, anything that doesn't conform to the "norm" or average will be ignored. Paradoxically, big data is also, in some respects, narrow data.

I'm reminded of an observation attributed to Einstein:

"Everything that can be counted does not necessarily count; everything that counts cannot necessarily be counted."

To summarise, here is an infographic representing what I think "big data" is in practice:

A cynical view of big data? Or a realistic one?

Small-scale data

Small data is more important than we ever thought. Tim Hitchcock, in Big data, small data and meaning, presents something called Voice Onset Timing:

My favourite tiny fragment of meaning – the kind of thing I want to find a context for - comes out of Linguistics. It is new to me, and seems a powerful thing: Voice Onset Timing – that breathy gap between when you open your mouth to speak, and when the first sibilant emerges. This apparently changes depending on who are speaking to – a figure of authority, a friend, a lover. It is as if the gestures of everyday life can also be seen as encoded in the lightest breathe. Different VOTs mark racial and gender interactions, insider talk, and public talk.

In other words, in just a couple of milliseconds of empty space there is a new form of close reading that demands radical contextualisation (I am grateful to Norma Mendoza-Denton for introducing me to VOT). And the same kind of thing could be extended to almost anything. The mark left by a chisel is certainly, by definition, unique, but it is also freighted with information about the tool that made it, the wood and the forest from which it emerged; the stroke, the weather on the day, and the craftsman.

So I think back to my days of teaching, when a student's pause before answering a question was an indication, not of a lack of understanding, but because a deeper answer had been thought of? (As Oscar Wilde, pointed out, in examinations the foolish ask questions that the wise cannot answer.")

But all is not lost. As Lyndsay Grant says:

The use of big data analytics in education is not necessarily useless or insidious. For one thing, it can provide a useful additional perspective, ‘from the outside in’ about learners’ development. But we need to consider the implications and consequences of using big data analytics as our main way of knowing about education. It tends to simplify big social and political questions about what kinds of learners we are and want to be, or how education should respond to major social and economic challenges, to a simple process of prescribing the next piece of educational software to download.

Understanding Education through Big Data

And as Andrew McCains points out:

Big data can inform creativity but is no substitute for the human ingenuity.

It’s what you do with data that counts

Tim Hitchcock again:

One of the great ironies of the moment is that in the rush to big data – in the rush to encompass the largest scale, we are excluding 99% of the data that is there.

Big data, small data and meaning 

Does this mean we should simply give up? No, but we can use our own "big data" on a smaller scale. I'll give a few examples of what I mean.

  • When I was head of department in a secondary school, I created a spreadsheet into which each member of my team entered mock examination results. The spreadsheet then displayed the predicted grade for each student based on their coursework grades, and then highlighted those students who were set to achieve a lower final grade than expected.
    Most of the data we obtained from the spreadsheet was accurate. But the reason the flagging up of odd-looking results worked so well was that we knew the kids. If a student attained an unusually low mark for him or her we could talk to the student to in out what happened. Likewise, if the predicted grade was far higher than we'd expected, we'd check to see if there was evidence of cheating.
  • These days I would try to persuade my school to use an application like SIMS-Discover for primaries, and SIMS-Discover for secondaries, for quite a big picture of data in general, including assessment data, or Target Tracker[i] for specifically assessment data.
    These sorts of applications present large swathes of data in a visual format, making it easy to spot trends and anomalies. They are sometimes referred to as "dashboard" applications, because you can see at a glance what is going on in your school. It seems to me that this sort of small-scale big data is much more useful in practice than the massive big data we are used to hearing about.
  • I used to use a progam called, I think, RM Auditor. What it did was collate a whole array of information from across the school's computer network, and I was able to view the data in a variety of ways. For example, I was able to see that although a student had logged on to the computer network every day after school and started the word processor program, he had also, at the same time, started playing a game. In other words, he had not been working at all, which helped to explain a low examination mark.

Big data may be the flavour of the decade, but it is not unambiguously useful or trustworthy. I think schools should think, instead, of the scale and type of data that would be useful and relevant to them, and not succumb to all the big data hype.

Related articles in this edition that you may find useful are:

  • The things you can do with data! Part 2, and
  • Drowning in data?

These are published in the December 2014 edition of Digital Education.



[i] Note that Target Tracker has been updated to take into account the “demise” of Levels, and the website is being updated to reflect that fact, although references to APP have been retained because some schools find it helpful. Chris Smith, of Target Tracker, told me: " Target Tracker will continue to support levels and APP and reporting which appeals especially for those teaching current year 6 and 2 as well as any schools who feel unsure about moving over.  For those schools who have decided to move we have created a major new section of the software which incorporate detailed facilities for formative assessment, gap analysis recording, next steps for children,  exemplification,  and in April parent reporting, collecting evidence and correlation of test data to teacher assessments within the National Curriculum 2014."

This article first appeared in Digital Education, the free newsletter for those with a professional interest in educational ICT and Computing. One of the benefits of subscribing – apart from access to unique content – is articles in a timely manner. For example, this article was published in the December 2014 edition.To sign up, please complete the short form on our newsletter page. We use a double opt-in system, and you won’t get spammed.