From time to time I am asked to comment on other people’s unpublished research. As part of the evidence offered in the manuscript, it is quite common to see analysis based on anonymous questionnaires conducted before and after a pedagogic intervention. In this post I want to raise some concerns about the significant limitations that arise from the unnecessary anonymisation of survey data.
Why offer anonymity?
Firstly, however, it is worth examining the allure of anonymity. From conversations I’ve held with colleagues, the main attraction of anonymisation is the perception that removal of identifiers will free participants to provide full and frank contributions, secure in the knowledge that there can be no personal come-back.
I want to argue here that there are important research benefits from *avoiding* complete anonymity, except in the vanishingly rare occasions where it is vital that contributors cannot be recognised.
1. Keeping identifiers allows for richer analysis. If you can match pre- and post-intervention data it is possible to report on changes relating to individuals which may have been masked by analysis of the cohort as a whole.
2. Keeping identifiers guards against inappropriate comparison of whole cohort data. There is a temptation to take all of the available pre-intervention data and compare it with the complete set of post-intervention data, thereby ensuring that a minimum of data is “wasted”. I believe that this is wrong-headed and to illustrate this point, consider the following scenario in education research.
You have carried out some new teaching activity with students and you want to know whether their attitude to the topic has changed as a result. Your anonymous survey finds that 37% of the students were enthusiastic about the subject before your novel intervention and this has gone up to 62% afterwards. An increase of 25%, this is clearly a positive outcome, isn’t it? Well – possibly not, and let’s see why.
You conducted your pre-intervention survey in a laboratory practical at the start of term. The class were first years, fresh into the university and compliant. What’s more, the session was compulsory so you got excellent coverage of the cohort, with 323 students completing the survey form.
In contrast, your second survey was collected during a 9am lecture in the last week of term, the day before a major piece of coursework was due (and, it turns out, the morning after the Sports Societies bar crawl). As a consequence of these various factors, attendance at the lecture was down… a lot. In fact, there are only 154 people present.
How does this additional information affect our interpretation of the impact of your new teaching activity? The headline improvement from 37% to 62% looks a bit less glossy when we translate this into actual numbers.
- 37% of 323 in the pre-intervention cohort equates to 119.5 people. Assuming that only whole numbers of persons attended the session, this means 120 people were enthusiastic about the topic before you started teaching them.
- After the intervention, 62% of 154 responded positively to the relevant question. This is 96 people.
So, an apparent increase of 25 percent actually converts into a fall of 24 students. Looking at the numbers, which is correct – did your intervention have a positive influence on the students, a negative influence (or neither)? You might like to hope it is the former but, in the absence of concrete evidence that the two changes in the number of respondents are evenly distributed across the spectrum of students on the course, any such inference is wishful speculation. Another potentially useful (and time-consuming) study has been sacrificed on the altar of anonymity.
Improving the validity of data
Putting to one side issues regarding the very different contexts in which the data before and after data was collected, is there anything you could have done to allow for matching of pre- and post- questionnaires so that only those people known to have participated in both surveys are included? There are several possibilities:
1. Ask students to include their names – clearly the most straightforward route to a unique identifier is to ask the students to use their names. This approach loses any advantage that anonymity may have offered, but if you promise that names are only used to match pre- and post- intervention forms which will then be assigned a number, then students may still be willing to be frank. I think, however, that a range of strategies retain the benefits of anonymisation whilst allowing matching.
2. Candidate number or email username – as with a student’s name, their candidate number or their email username ought to guarantee capacity to match surveys. Again these identifiers demand certain trust from the students that you are not going to reverse engineer the code and determine who’s who, but because you would have to actively seek out that information, these are better options than crudely naming the responses.
3. Date of birth – a student’s date of birth is a possible alternative, but has a couple of limitations. Firstly, you could still – if you were so motivated – break the code and identify them. Secondly, and more significantly, there will almost inevitably be two or more students in the cohort with the same date of birth. If it is a paper-based survey you can generally distinguish between students with the same date of birth on the basis of their handwriting. However for electronic surveys, without such capacity, this is probably a fatal limitation to this approach.
4. Favourite colour, pop band, etc – although it might be attractive to asks students bland questions such as their favourite colour as an identifier, these run the obvious risks that more than one student might choose the same combination. Just as likely, they might forget the answer they picked before – after all, who hasn’t experienced that tele-banking moment when the memorable date you told them last time isn’t the memorable date you offered this time? Probably not a workable solution.
5. Other information that is unique to the individual and unchanging – probably best of all is a combination of other factual pieces of information about which students are unlikely to be confused. The use of mother’s maiden name as a security question may make students reticent to give you this detail. However a combination of mother’s given name (Christian name) plus place of birth might be sufficiently discriminatory. In context where this is not enough a third piece of information (e.g. number of first cousins?) would be enough.
6. Combining several factors into a specific code – as I came to the end of this post I realised that I have actually recommended a system before (I won’t repeat it here, follow this link to see the details). At the time my interest in the topic was largely theoretical, but several recent events have suggested that advice on this issue needs to be more broadly available. In addition, the worked example above (a lightly fictionalised version of a real mistake I encountered) shows why anonymity needs to be avoided.
Does anyone have any other recommended schemes?