ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Opinion Article
Revised

Ethical research standards in a world of big data

[version 2; peer review: 3 approved with reservations]
PUBLISHED 21 Aug 2014
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

Abstract

In 2009 Ginsberg et al. reported using Google search query volume to estimate influenza activity in advance of traditional methodologies. It was a groundbreaking example of digital disease detection, and it still remains illustrative of the power of gathering data from the internet for important research. In recent years, the methodologies have been extended to include new topics and data sources; Twitter in particular has been used for surveillance of influenza-like-illnesses, political sentiments, and even behavioral risk factors like sentiments about childhood vaccination programs. As the research landscape continuously changes, the protection of human subjects in online research needs to keep pace. Here we propose a number of guidelines for ensuring that the work done by digital researchers is supported by ethical-use principles. Our proposed guidelines include: 1) Study designs using Twitter-derived data should be transparent and readily available to the public. 2) The context in which a tweet is sent should be respected by researchers. 3) All data that could be used to identify tweet authors, including geolocations, should be secured. 4) No information collected from Twitter should be used to procure more data about tweet authors from other sources. 5) Study designs that require data collection from a few individuals rather than aggregate analysis require Institutional Review Board (IRB) approval. 6) Researchers should adhere to a user’s attempt to control his or her data by respecting privacy settings. As researchers, we believe that a discourse within the research community is needed to ensure protection of research subjects. These guidelines are offered to help start this discourse and to lay the foundations for the ethical use of Twitter data.

Revised Amendments from Version 1

The new version emphasizes that the purpose of the guidelines is provide actionable steps for researchers engaged in Twitter research. It also clarifies that study designs based on aggregate study designs do not require informed consent or IRB approval, whereas those with potential privacy concerns should seek approval. Finally, we added a more extensive literature review at the suggestion of reviewer Tristan Henderson, and emphasized potential risks at the suggestion of reviewer Sherry Emery.

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

The growing popularity of social media sites presents a unique opportunity to study human interactions and experiences. Twitter, one of the most popular social media sites, allows users to ‘microblog’ by sharing 140 character messages with their social network. Twitter reports that there are 255 million monthly active users sending 500 million tweets each day1. Researchers have begun to use this data to answer questions in a variety of fields28. Recent reflections on the data collection practices of the US National Security Administration have spurred similar meditations on the ethics of digital research9. The concern is that Twitter data could conceivably be used in a way that violates the privacy and rights of the tweet authors. The discussions in the ethics fields have not been made easily accessible to researchers conducting this kind of research, who may be unfamiliar with the ethics literature. To protect the rights of Twitter users, we propose the following guidelines (see Figure 1), which we hope will become the standard among researchers, journal editors, and Institutional Review Boards.

9618426c-ef5b-4b38-8ab9-86a14f87289b_figure1.gif

Figure 1. Proposed guidelines for the ethical use of Twitter for research.

Twitter data have already been used in a number of studies to detect influenza-like illness (ILI)36, risky behaviors associated with the transmission of HIV7, sentiments about childhood vaccination programs8, and political sentiments9. These study designs generally feature count data, rather than user-specific data. For example, there are multiple studies that compare the proportion of tweets about flu-related symptoms to public health data on influenza-like incidence (ILI). An increase in syndromic flu tweets might indicate that an outbreak is occurring. These data are usually reported either at the national level or without any geographic parameters. Another common study design aims to determine public sentiments by counting words, phrases, and emoticons that co-occur with keywords like ‘Obama’. These sentiment indicators can be used to infer public opinions about political elections, mental health, or consumer products.

Unlike Facebook, Pinterest and other competitors, Twitter provides several application programming interfaces (APIs) that allow real-time access to vast amounts of content. Data streamed through the APIs include metadata about the authors, including the text location from their profile (e.g. ‘Baltimore’), their time zone, the time they sent the tweet, the number of friends and followers they have, the number of tweets they have ever sent, and more (Figure 2). Approximately 1% of tweets have a geolocation, which uses GPS to append the author’s precise geographic coordinates to the tweet. Geolocations are sufficiently detailed to determine from which wing of a building a tweet was sent. The default privacy settings do not enable geolocation, but do make a user’s tweets and metadata available through the API. Users can modify their setting to make their profile private, which shields their account from public view online and from the API10.

9618426c-ef5b-4b38-8ab9-86a14f87289b_figure2.gif

Figure 2. An excerpt of a single tweet returned through the Twitter API11.

There are numerous ways to access the data through these APIs: one method is through the ‘garden hose’, which is a random sample of approximately 1% of all live-streamed tweets. Other access methods include the search API which enables searching for particular users, hashtags, or locations, and author-specific queries which can retrospectively gather up to 3,200 tweets from a single user10. Furthermore, in 2010 Twitter donated its entire historical record of tweets to the US Library of Congress. Detailed plans for these data are not yet available, but the Library of Congress has indicated that it intends to collaborate with academic institutions to make the data available to researchers11,12.

The strength of tweets as a data source is in the volume; collection through the garden hose API brings in approximately 60,000–100,000 tweets per day. However because tweets are short and often lack context, it is difficult for computers to determine tweet content automatically. For this reason, researchers primarily use tweet data to conduct population-level research concerned with trends and patterns. Study designs rely on large volumes of data to accommodate false positives and negatives. A typical data set contains millions of tweets and many thousands of tweet authors. However, a user-centric use case involving Twitter is not inconceivable. Researchers interested in social network analysis, qualitative research, and rare-event topics may eventually turn to Twitter as a data source. Potential methodologies include building a social network out of @mentions (the @ is Twitter lexicon for referencing another user); mining qualitative data from specific user’s accounts; or conducting prospective research by following a person or small group of people over time. These user-centric approaches are fundamentally different from population-level studies, and may require different ethical considerations than aggregated study designs. Additional methodologies might also involve interacting with Twitter users, which will not be addressed here.

Under non-digital circumstances, ethics guidelines suggest that collecting information from a public space where people could ‘reasonably expect to be observed by strangers’ is considered appropriate even without informed consent13. According to these guidelines, tweets are text that users publish for the purpose of sharing with others. The weakness of this argument is that it fails to distinguish between population-level research and research focused on selected individuals. It would be clearly unethical for a researcher to follow one specific shopper around the mall and gather data exclusively about them without their consent. However, simply counting or observing behavior in aggregate without consent at a mall is an acceptable research practice. The difference is that the latter example adheres to a level of privacy that the observed individual might expect from being in public, whereas the former violates those natural privacy boundaries. A similar distinction is needed in digital research.

As an example of the potential privacy pitfalls of digital research, suppose investigators were interested in the social networks of adolescents suffering from depression. A research plan might look like this: the investigators gather geocoded tweets that contain words relevant to the topic of interest, as shown in Figure 3. They filter for geocodes that correspond to school locations in order to identify adolescent users. From there, a simple query to the Twitter API returns a list of followers for each of those presumably depressed adolescents. They now have a social network. For each member of the network, they mine the user’s tweet histories to find identifying details such as their real names. The researchers then use the gathered information to ‘snowball’ data collection by curating from a variety of different sources like Facebook, tumblr and the White Pages. They can collect birth dates, cell phone numbers, home addresses, favorite hangout spots, “likes” and “dislikes”, etc. The final result would be detailed demographic information for potentially thousands of people who exhibit symptoms of depression or are connected to a depressed adolescent. Current guidelines do not prohibit this kind of research activity. However, if the same information were collected through surveys or other traditional means, Institutional Review Board (IRB) approval would be needed.

9618426c-ef5b-4b38-8ab9-86a14f87289b_figure3.gif

Figure 3. Each dot is a geolocated tweet collected through the Twitter API.

The example tweet displayed is fabricated.

According to the US Department of Health and Human Services Policy for Protection of Human Research Subjects, data that are publicly available are exempt from requiring IRB approval14. Because Twitter data are public, they technically fall under this exemption. Furthermore, Twitter’s privacy policy makes no secret of the fact that user data are indexed by search engines, archived within the US Library of Congress, and are available through an API15. However, it is unlikely that many users follow the link to read the lengthy and complex document. One study found that it would take 244 hours a year for an average internet user to read every privacy policy of the unique sites they visit16.

To help researchers navigate the landscape of ethics in online research, several principles for ethical conduct have been proposed. The Menlo Report17 was conceived as a complement of the Belmont Report, tailored to information and communication technology research18. The two reports share three basic principles: respect for people, beneficence, and justice. The Menlo Report adds an additional principle, respect for law and public interest. These principles are meant to inform and guide researchers and ethicists ‘in ethical analyses and self-regulation’17. The Association of Internet Research published a similar document in 2012 outlining questions and considerations for researchers to weigh19. Like the Menlo Report, the Association of Internet Research report urges a case by case approach. For more in depth discussions about the role of IRB in digital research, see for example the works of Solberg20 and Buchanon et al.21.

Elsewhere in the ethics literature, several themes have emerged. Respecting context, particularly in circumstances when the digital content creators might desire proper attribution, has been central22,23. The definition and accompanying expectations of ‘public data’ has been discussed variously by others2428. The necessity and feasibility of obtaining informed consent, particularly when minors might be in the digital study sample, has also been identified as an obstacle2931.

Though useful as foundational documents, efforts to make these ideas accessible to digital researchers have been few. The US Consumer Privacy Bill of Rights (CPBR)32 may serve as a useful framework for outlining best practices for researchers conducting research with Twitter. The CPBR was issued by the Obama administration in February 2012 in order to “give consumers clear guidance on what they should expect from those who handle their personal information, and set expectations for companies that use personal data”. There are seven principles enumerated by CPBR in Table 1. The guidelines proposed here echo many of the concepts previously identified as important, but do so in a way that is actionable (see Figure 1).

Table 1. Consumer Privacy Bill of Rights32.

TransparencyConsumers have a right to easily understandable and accessible information about privacy and security
practices.
Respect for contextConsumers have a right to expect that companies will collect, use, and disclose personal data in ways that
are consistent with the context in which consumers provide the data.
SecurityConsumers have a right to secure and responsible handling of personal data.
Focused collectionConsumers have a right to reasonable limits on the personal data that companies collect and retain.
AccountabilityConsumers have a right to have personal data handled by companies with appropriate measures in place
to assure they adhere to the Consumer Privacy Bill of Rights.
Individual controlConsumers have a right to exercise control over what personal data companies collect from them.
Access and accuracyConsumers have a right to access and correct personal data in usable formats.

Proposed guidelines for the ethical use of Twitter data

The objectives, methodologies, and data handling practices of the project are transparent and easily accessible

This information should be published in manuscripts, published on the web for the public to access, and provided to IRB (when relevant). Going forward, collaboration between the research community and Twitter to provide information to users about ongoing research and relevant results may also be beneficial. Transparency regarding uses of Internet data for research purposes is needed for fostering ‘privacy literacy’ so that the users can make informed decisions about participating in Twitter.

Study design and analyses respect the context in which a tweet was sent

Twitter participants can reasonably expect to rely on some anonymity of the crowd to manage privacy. A tweet author discussing his mental health does not do so with the intention of sharing that data with researchers; he does it to communicate with his digital community25. Avoid qualitatively analyzing these communications as if they are offered for research consumption without consent, because it does not align with the context in which the tweets were created.

The anonymity of tweet authors is protected, ensuring that subjects should not be identifiable in any way

To preserve source anonymity, direct quotes or screen names are not publishable, nor are any details that could be used to identify a subject. Any and all information that could be entered into a search engine to trace back to a human source should be protected. A composite of multiple example tweets may instead be used for illustrative purpose. Geolocations in particular should be scaled to a larger geographic area in order to avoid violating the privacy of those tweet authors. The Title 13 of the Data Protection and Privacy Policy, the federal law under which the Census Bureau is regulated, expressly forbids publishing GPS coordinates33; researchers should adhere to this guideline as well.

Tweet data are not used to harvest additional information from other sources

Focused collection is also important for preserving anonymity. It is possible to use data collected from Twitter to discern the identities of tweet authors, which can then be used to find and collect additional information from additional sources. For example an author’s username, identifying details provided in tweet texts, or geolocations could all be used to collect data about that individual from other sources like Facebook, LinkedIn, Flickr, or public records. This methodology should not be pursued without consent or IRB approval.

Twitter users’ efforts to control their personal data are honored

Researchers may not follow a user on Twitter in order to gain access to a protected account. Doing so would violate that user’s efforts to control his or her personal data.

Researchers work collaboratively with IRB just as they would for any other human subject data collection

There is not currently an expectation that researchers engaging in research using Twitter will interface with their IRB. As discussed above, studies that could be conceived as individual-based should require IRB approval, whereas research designs that use data in aggregate (e.g. counts of keywords) may proceed without explicit consent. In turn, review boards should keep abreast of social network mining methodologies and corresponding ethical considerations in order provide informed guidance to researchers.

Conclusions

Research involving Twitter is growing in popularity, but the issues surrounding the ethics of using it as a data source have not yet been closely examined. There are hypothetical study designs that could use Twitter data in a way that violates the privacy and ethical treatment of participants. In order to avoid those misuses, six guidelines derived from the US Consumer Privacy Bill of Rights are proposed. We welcome discourse in the research community on this topic, and encourage further discussion.

Please use the #TACTICS hashtag on Twitter to participate in this discussion online.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 21 Aug 2014
Revised
Version 1
VERSION 1 PUBLISHED 06 Feb 2014
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 11 Jun 2014
    Caitlin RIvers, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech., Blacksburg, VA, 24060, USA
    11 Jun 2014
    Author Response
    Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d ... Continue reading
  • Reader Comment 27 May 2014
    Ernesto Priego, City University London, UK
    27 May 2014
    Reader Comment
    Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
    Competing Interests: None.
  • Discussion is closed on this version, please comment on the latest version above.
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.12688/f1000research.3-38.v2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 21 Aug 2014
Revised
Views
31
Cite
Reviewer Report 16 Dec 2015
Christophe Giraud-Carrier, Data Mining Laboratory, Brigham Young University, Provo, UT, USA 
Approved with Reservations
VIEWS 31
The paper addresses an important issue and makes a number of interesting points. There are a couple of technical points that the authors should address.
  1. The authors do mention the Twitter Search API, but they say nothing about the Twitter Streaming
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Giraud-Carrier C. Reviewer Report For: Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.5256/f1000research.5016.r11384)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
54
Cite
Reviewer Report 18 Nov 2014
Tristan Henderson, Department of Computer Science, University of St Andrews, Fife, UK 
Approved with Reservations
VIEWS 54
The article has been improved since the original submission. The purpose has been clarified somewhat. The literature review has been expanded although there is still no discussion of Neuhaus and Webmoor who have already proposed similar guidelines for Twitter research, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Henderson T. Reviewer Report For: Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.5256/f1000research.5016.r6732)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 06 Feb 2014
Views
108
Cite
Reviewer Report 01 Jul 2014
Sherry Emery, Institute for Health Research and Policy, University of Illinois at Chicago, Chicago, IL, USA 
Approved with Reservations
VIEWS 108
In general, I think this is a very helpful piece, which lays out some practical and legitimate privacy risks that arise when working with social media data.  And the authors' suggestions for the code of ethics are thoughtful, yet warrant ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Emery S. Reviewer Report For: Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.5256/f1000research.3770.r5282)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
212
Cite
Reviewer Report 06 Mar 2014
Tristan Henderson, Department of Computer Science, University of St Andrews, Fife, UK 
Approved with Reservations
VIEWS 212
This opinion article is interesting and timely, as it is clear that the quantity of research using social media and other sources of personal data is increasing, and the potential concerns around this need to be considered by researchers and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Henderson T. Reviewer Report For: Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.5256/f1000research.3770.r3876)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 21 Aug 2014
Revised
Version 1
VERSION 1 PUBLISHED 06 Feb 2014
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 11 Jun 2014
    Caitlin RIvers, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech., Blacksburg, VA, 24060, USA
    11 Jun 2014
    Author Response
    Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d ... Continue reading
  • Reader Comment 27 May 2014
    Ernesto Priego, City University London, UK
    27 May 2014
    Reader Comment
    Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
    Competing Interests: None.
  • Discussion is closed on this version, please comment on the latest version above.
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.