Ethical research standards in a world of big data

Caitlin M. Rivers; Bryan L. Lewis

doi:10.12688/f1000research.3-38.v2

Home Browse Ethical research standards in a world of big data

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Opinion Article

Revised

Ethical research standards in a world of big data

[version 2; peer review: 3 approved with reservations]

Caitlin M. Rivers¹, Bryan L. Lewis¹

PUBLISHED 21 Aug 2014

Author details Author details

¹ Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech., Blacksburg, VA, 24060, USA

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

Abstract

In 2009 Ginsberg et al. reported using Google search query volume to estimate influenza activity in advance of traditional methodologies. It was a groundbreaking example of digital disease detection, and it still remains illustrative of the power of gathering data from the internet for important research. In recent years, the methodologies have been extended to include new topics and data sources; Twitter in particular has been used for surveillance of influenza-like-illnesses, political sentiments, and even behavioral risk factors like sentiments about childhood vaccination programs. As the research landscape continuously changes, the protection of human subjects in online research needs to keep pace. Here we propose a number of guidelines for ensuring that the work done by digital researchers is supported by ethical-use principles. Our proposed guidelines include: 1) Study designs using Twitter-derived data should be transparent and readily available to the public. 2) The context in which a tweet is sent should be respected by researchers. 3) All data that could be used to identify tweet authors, including geolocations, should be secured. 4) No information collected from Twitter should be used to procure more data about tweet authors from other sources. 5) Study designs that require data collection from a few individuals rather than aggregate analysis require Institutional Review Board (IRB) approval. 6) Researchers should adhere to a user’s attempt to control his or her data by respecting privacy settings. As researchers, we believe that a discourse within the research community is needed to ensure protection of research subjects. These guidelines are offered to help start this discourse and to lay the foundations for the ethical use of Twitter data.

Corresponding author: Caitlin M. Rivers

Competing interests: No competing interests were disclosed.

Grant information: Research reported in this publication was supported by the US National Institute of General Medical Sciences of the National Institutes of Health under award number 2U01GM070694-09. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Additional support provided by Defense Threat Reduction Agency Validation Grant HDTRA1-11-1-0016.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2014 Rivers CM and Lewis BL. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

How to cite: Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.12688/f1000research.3-38.v2) First published: 06 Feb 2014, 3:38 (https://doi.org/10.12688/f1000research.3-38.v1) Latest published: 21 Aug 2014, 3:38 (https://doi.org/10.12688/f1000research.3-38.v2)

Revised Amendments from Version 1

The new version emphasizes that the purpose of the guidelines is provide actionable steps for researchers engaged in Twitter research. It also clarifies that study designs based on aggregate study designs do not require informed consent or IRB approval, whereas those with potential privacy concerns should seek approval. Finally, we added a more extensive literature review at the suggestion of reviewer Tristan Henderson, and emphasized potential risks at the suggestion of reviewer Sherry Emery.

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

The growing popularity of social media sites presents a unique opportunity to study human interactions and experiences. Twitter, one of the most popular social media sites, allows users to ‘microblog’ by sharing 140 character messages with their social network. Twitter reports that there are 255 million monthly active users sending 500 million tweets each day¹. Researchers have begun to use this data to answer questions in a variety of fields^2–8. Recent reflections on the data collection practices of the US National Security Administration have spurred similar meditations on the ethics of digital research⁹. The concern is that Twitter data could conceivably be used in a way that violates the privacy and rights of the tweet authors. The discussions in the ethics fields have not been made easily accessible to researchers conducting this kind of research, who may be unfamiliar with the ethics literature. To protect the rights of Twitter users, we propose the following guidelines (see Figure 1), which we hope will become the standard among researchers, journal editors, and Institutional Review Boards.

Figure 1. Proposed guidelines for the ethical use of Twitter for research.

Twitter data have already been used in a number of studies to detect influenza-like illness (ILI)^3–6, risky behaviors associated with the transmission of HIV⁷, sentiments about childhood vaccination programs⁸, and political sentiments⁹. These study designs generally feature count data, rather than user-specific data. For example, there are multiple studies that compare the proportion of tweets about flu-related symptoms to public health data on influenza-like incidence (ILI). An increase in syndromic flu tweets might indicate that an outbreak is occurring. These data are usually reported either at the national level or without any geographic parameters. Another common study design aims to determine public sentiments by counting words, phrases, and emoticons that co-occur with keywords like ‘Obama’. These sentiment indicators can be used to infer public opinions about political elections, mental health, or consumer products.

Unlike Facebook, Pinterest and other competitors, Twitter provides several application programming interfaces (APIs) that allow real-time access to vast amounts of content. Data streamed through the APIs include metadata about the authors, including the text location from their profile (e.g. ‘Baltimore’), their time zone, the time they sent the tweet, the number of friends and followers they have, the number of tweets they have ever sent, and more (Figure 2). Approximately 1% of tweets have a geolocation, which uses GPS to append the author’s precise geographic coordinates to the tweet. Geolocations are sufficiently detailed to determine from which wing of a building a tweet was sent. The default privacy settings do not enable geolocation, but do make a user’s tweets and metadata available through the API. Users can modify their setting to make their profile private, which shields their account from public view online and from the API¹⁰.

Figure 2. An excerpt of a single tweet returned through the Twitter API¹¹.

There are numerous ways to access the data through these APIs: one method is through the ‘garden hose’, which is a random sample of approximately 1% of all live-streamed tweets. Other access methods include the search API which enables searching for particular users, hashtags, or locations, and author-specific queries which can retrospectively gather up to 3,200 tweets from a single user¹⁰. Furthermore, in 2010 Twitter donated its entire historical record of tweets to the US Library of Congress. Detailed plans for these data are not yet available, but the Library of Congress has indicated that it intends to collaborate with academic institutions to make the data available to researchers^11,12.

The strength of tweets as a data source is in the volume; collection through the garden hose API brings in approximately 60,000–100,000 tweets per day. However because tweets are short and often lack context, it is difficult for computers to determine tweet content automatically. For this reason, researchers primarily use tweet data to conduct population-level research concerned with trends and patterns. Study designs rely on large volumes of data to accommodate false positives and negatives. A typical data set contains millions of tweets and many thousands of tweet authors. However, a user-centric use case involving Twitter is not inconceivable. Researchers interested in social network analysis, qualitative research, and rare-event topics may eventually turn to Twitter as a data source. Potential methodologies include building a social network out of @mentions (the @ is Twitter lexicon for referencing another user); mining qualitative data from specific user’s accounts; or conducting prospective research by following a person or small group of people over time. These user-centric approaches are fundamentally different from population-level studies, and may require different ethical considerations than aggregated study designs. Additional methodologies might also involve interacting with Twitter users, which will not be addressed here.

Under non-digital circumstances, ethics guidelines suggest that collecting information from a public space where people could ‘reasonably expect to be observed by strangers’ is considered appropriate even without informed consent¹³. According to these guidelines, tweets are text that users publish for the purpose of sharing with others. The weakness of this argument is that it fails to distinguish between population-level research and research focused on selected individuals. It would be clearly unethical for a researcher to follow one specific shopper around the mall and gather data exclusively about them without their consent. However, simply counting or observing behavior in aggregate without consent at a mall is an acceptable research practice. The difference is that the latter example adheres to a level of privacy that the observed individual might expect from being in public, whereas the former violates those natural privacy boundaries. A similar distinction is needed in digital research.

As an example of the potential privacy pitfalls of digital research, suppose investigators were interested in the social networks of adolescents suffering from depression. A research plan might look like this: the investigators gather geocoded tweets that contain words relevant to the topic of interest, as shown in Figure 3. They filter for geocodes that correspond to school locations in order to identify adolescent users. From there, a simple query to the Twitter API returns a list of followers for each of those presumably depressed adolescents. They now have a social network. For each member of the network, they mine the user’s tweet histories to find identifying details such as their real names. The researchers then use the gathered information to ‘snowball’ data collection by curating from a variety of different sources like Facebook, tumblr and the White Pages. They can collect birth dates, cell phone numbers, home addresses, favorite hangout spots, “likes” and “dislikes”, etc. The final result would be detailed demographic information for potentially thousands of people who exhibit symptoms of depression or are connected to a depressed adolescent. Current guidelines do not prohibit this kind of research activity. However, if the same information were collected through surveys or other traditional means, Institutional Review Board (IRB) approval would be needed.

Figure 3. Each dot is a geolocated tweet collected through the Twitter API.

The example tweet displayed is fabricated.

According to the US Department of Health and Human Services Policy for Protection of Human Research Subjects, data that are publicly available are exempt from requiring IRB approval¹⁴. Because Twitter data are public, they technically fall under this exemption. Furthermore, Twitter’s privacy policy makes no secret of the fact that user data are indexed by search engines, archived within the US Library of Congress, and are available through an API¹⁵. However, it is unlikely that many users follow the link to read the lengthy and complex document. One study found that it would take 244 hours a year for an average internet user to read every privacy policy of the unique sites they visit¹⁶.

To help researchers navigate the landscape of ethics in online research, several principles for ethical conduct have been proposed. The Menlo Report¹⁷ was conceived as a complement of the Belmont Report, tailored to information and communication technology research¹⁸. The two reports share three basic principles: respect for people, beneficence, and justice. The Menlo Report adds an additional principle, respect for law and public interest. These principles are meant to inform and guide researchers and ethicists ‘in ethical analyses and self-regulation’¹⁷. The Association of Internet Research published a similar document in 2012 outlining questions and considerations for researchers to weigh¹⁹. Like the Menlo Report, the Association of Internet Research report urges a case by case approach. For more in depth discussions about the role of IRB in digital research, see for example the works of Solberg²⁰ and Buchanon et al.²¹.

Elsewhere in the ethics literature, several themes have emerged. Respecting context, particularly in circumstances when the digital content creators might desire proper attribution, has been central^22,23. The definition and accompanying expectations of ‘public data’ has been discussed variously by others^24–28. The necessity and feasibility of obtaining informed consent, particularly when minors might be in the digital study sample, has also been identified as an obstacle^29–31.

Though useful as foundational documents, efforts to make these ideas accessible to digital researchers have been few. The US Consumer Privacy Bill of Rights (CPBR)³² may serve as a useful framework for outlining best practices for researchers conducting research with Twitter. The CPBR was issued by the Obama administration in February 2012 in order to “give consumers clear guidance on what they should expect from those who handle their personal information, and set expectations for companies that use personal data”. There are seven principles enumerated by CPBR in Table 1. The guidelines proposed here echo many of the concepts previously identified as important, but do so in a way that is actionable (see Figure 1).

Table 1. Consumer Privacy Bill of Rights³².

Transparency	Consumers have a right to easily understandable and accessible information about privacy and security practices.
Respect for context	Consumers have a right to expect that companies will collect, use, and disclose personal data in ways that are consistent with the context in which consumers provide the data.
Security	Consumers have a right to secure and responsible handling of personal data.
Focused collection	Consumers have a right to reasonable limits on the personal data that companies collect and retain.
Accountability	Consumers have a right to have personal data handled by companies with appropriate measures in place to assure they adhere to the Consumer Privacy Bill of Rights.
Individual control	Consumers have a right to exercise control over what personal data companies collect from them.
Access and accuracy	Consumers have a right to access and correct personal data in usable formats.

Proposed guidelines for the ethical use of Twitter data

The objectives, methodologies, and data handling practices of the project are transparent and easily accessible

This information should be published in manuscripts, published on the web for the public to access, and provided to IRB (when relevant). Going forward, collaboration between the research community and Twitter to provide information to users about ongoing research and relevant results may also be beneficial. Transparency regarding uses of Internet data for research purposes is needed for fostering ‘privacy literacy’ so that the users can make informed decisions about participating in Twitter.

Study design and analyses respect the context in which a tweet was sent

Twitter participants can reasonably expect to rely on some anonymity of the crowd to manage privacy. A tweet author discussing his mental health does not do so with the intention of sharing that data with researchers; he does it to communicate with his digital community²⁵. Avoid qualitatively analyzing these communications as if they are offered for research consumption without consent, because it does not align with the context in which the tweets were created.

The anonymity of tweet authors is protected, ensuring that subjects should not be identifiable in any way

To preserve source anonymity, direct quotes or screen names are not publishable, nor are any details that could be used to identify a subject. Any and all information that could be entered into a search engine to trace back to a human source should be protected. A composite of multiple example tweets may instead be used for illustrative purpose. Geolocations in particular should be scaled to a larger geographic area in order to avoid violating the privacy of those tweet authors. The Title 13 of the Data Protection and Privacy Policy, the federal law under which the Census Bureau is regulated, expressly forbids publishing GPS coordinates³³; researchers should adhere to this guideline as well.

Tweet data are not used to harvest additional information from other sources

Focused collection is also important for preserving anonymity. It is possible to use data collected from Twitter to discern the identities of tweet authors, which can then be used to find and collect additional information from additional sources. For example an author’s username, identifying details provided in tweet texts, or geolocations could all be used to collect data about that individual from other sources like Facebook, LinkedIn, Flickr, or public records. This methodology should not be pursued without consent or IRB approval.

Twitter users’ efforts to control their personal data are honored

Researchers may not follow a user on Twitter in order to gain access to a protected account. Doing so would violate that user’s efforts to control his or her personal data.

Researchers work collaboratively with IRB just as they would for any other human subject data collection

There is not currently an expectation that researchers engaging in research using Twitter will interface with their IRB. As discussed above, studies that could be conceived as individual-based should require IRB approval, whereas research designs that use data in aggregate (e.g. counts of keywords) may proceed without explicit consent. In turn, review boards should keep abreast of social network mining methodologies and corresponding ethical considerations in order provide informed guidance to researchers.

Conclusions

Research involving Twitter is growing in popularity, but the issues surrounding the ethics of using it as a data source have not yet been closely examined. There are hypothetical study designs that could use Twitter data in a way that violates the privacy and ethical treatment of participants. In order to avoid those misuses, six guidelines derived from the US Consumer Privacy Bill of Rights are proposed. We welcome discourse in the research community on this topic, and encourage further discussion.

Please use the #TACTICS hashtag on Twitter to participate in this discussion online.

Author contributions

CR conceived of the guidelines and drafted the manuscript. BL contributed to both the development of the guidelines and preparation of the manuscript.

Competing interests

No competing interests were disclosed.

Grant information

Research reported in this publication was supported by the US National Institute of General Medical Sciences of the National Institutes of Health under award number 2U01GM070694-09. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Additional support provided by Defense Threat Reduction Agency Validation Grant HDTRA1-11-1-0016.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Acknowledgements

We thank our external collaborators and members of the Network Dynamics and Simulation Science Laboratory (NDSSL) for their suggestions and comments.

Faculty Opinions recommended

References

1. About/Company. Twitter. Reference Source
2. Ginsberg J, Mohebbi MH, Patel RS, et al.: Detecting influenza epidemics using search engine query data. Nature. 2009; 457(7232): 1012–4. PubMed Abstract | Publisher Full Text
3. Achrekar H, Lazarus R, Park WC: Predicting Flu Trends using Twitter Data. In The First International Workshop on Cyber-Physical Networking Systems. 2011; (pp. 713–718). Reference Source
4. Lampos V, Cristianini N: Tracking the flu pandemic by monitoring the Social Web. Inf Syst. 2010; 411–416. Publisher Full Text
5. Salathé M, Bengtsson L, Bodnar TJ, et al.: Digital epidemiology. PLoS Comput Biol. 2012; 8(7): e1002616. PubMed Abstract | Publisher Full Text | Free Full Text
6. Young SD, Rivers C, Lewis B: Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. Prev Med. 2014; 63: 112–5. PubMed Abstract | Publisher Full Text | Free Full Text
7. Campbell E, Salathé M: Complex social contagion makes networks more vulnerable to disease outbreaks. Sci Rep. 2013; 3: 1905. PubMed Abstract | Publisher Full Text | Free Full Text
8. Tumasjan A, Sprenger TO, Sandner PG, et al.: Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment. Word Journal Of The International Linguistic Association. 2010. Reference Source
9. Omand D: NSA leaks: How to make surveillance both ethical and effective. The Guardian. 2013. Reference Source
10. Twitter Documentation. Twitter. 2013. Reference Source
11. Raymond M: How Tweet It Is!: Library Acquires Entire Twitter Archive. Library of Congress Blog. 2010. Reference Source
12. Allen E: Update on the Twitter Archive at the Library of Congress. Library of Congress Blog. 2013. Reference Source
13. Ethics guidelines for Internet-mediated research. British Psychological Society. 2013. Reference Source
14. Code of Federal Regulation: Protection of human subjects. Department of Health and Human Services. 2009. Reference Source
15. Twitter Privacy Policy. Twitter. 2013. Reference Source
16. McDonald AM, Cranor LF: The Cost of Reading Privacy Policies. I/S: A Journal of Law and Policy for the Information Society. 2008. Reference Source
17. Bailey M, Burstein A, Claffy K, et al.: The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. 2012. Reference Source
18. The Belmont Report. U.S. Department of Health & Human Services.1979. Reference Source
19. Markham A, Buchanan K: Ethical Decision-Making and Internet Research. 2012; (pp. 1–19). Reference Source
20. Solberg LB: Data mining on Facebook: A free space for researchers or an IRB nightmare? J Law Technol policy. 2010; 311–343. Reference Source
21. Buchanan E, Aycock J, Dexter S, et al.: Computer science security research and human subjects: emerging considerations for research ethics boards. J Empir Res Hum Res Ethics. 2011; 6(2): 71–83. PubMed Abstract | Publisher Full Text
22. Bassett EH, Riordan KO: Ethics of Internet research: Contesting the human subjects research model. 2002; 4(1): 233–247. Publisher Full Text
23. Bruckman A, Luther K, Fiesler C: When Should We Use Real Names in Published Accounts of Internet Research? Digital Research Confidential. 2011. Reference Source
24. Zimmer M: ‘But the data is already public’: on the ethics of research in Facebook. Ethics Inf Technol. 2010; 12(4): 313–325. Publisher Full Text
25. Boyd D: Privacy and Publicity in the Context of Big Data. 2010. Reference Source
26. Page X: Contextual Integrity and Preserving Relationship Boundaries in Location- Sharing Social Media. Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches. San Antonio, TX USA. 2013. Reference Source
27. Marx GT: Murky conceptual waters: The public and the private. Ethics Inf Technol. 2001; 3(3): 157–169. Publisher Full Text
28. Moor JH: Towards a Theory of Privacy in the Information Age. 1997; 27–32. Reference Source
29. Moreno MA, Fost NC, Christakis DA: Research ethics in the MySpace era. Pediatrics. 2008; 121(1), 157–61. PubMed Abstract | Publisher Full Text
30. Wilkinson D, Cybermetrics S, Thelwall M: Researching Personal Information on the Public Web: Methods and Ethics. Soc Sci Comput Rev. 2011; 29(4): 387–401. Publisher Full Text
31. Mcneilly S, Hutton L, Henderson T: Understanding ethical concerns in social media privacy studies. Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches. San Antonio, TX USA. 2013. Reference Source
32. Consumer Data Privacy in a Networked World: A Framework for Protecting Privacy and Promoting Innovation in the Global Digital Economy. 2012. Reference Source
33. Data Protection and Privacy Policy. United States Census Bureau. 2012. Reference Source

Comments on this article Comments (2)

Version 2

VERSION 2 PUBLISHED 21 Aug 2014

Revised

Comment

Version 1

VERSION 1 PUBLISHED 06 Feb 2014

Discussion is closed on this version, please comment on the latest version above.

Author Response 11 Jun 2014

Caitlin RIvers, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech., Blacksburg, VA, 24060, USA

11 Jun 2014

Author Response

Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d ... Continue reading Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d like to point out that our guidelines are not for mental health research only, but for Twitter research more broadly.

That being said, I think we agree on a lot of points. Twitter is a wonderful datasource for a studies in a variety of fields. We ourselves have used it to study risky behaviors associated with the transmission of HIV. We are simply proposing that researchers take a few simple steps to protect the privacy of the users whose data they curate.

For example, the study designs you mentioned and the guidelines we proposed are not incompatible. Under our suggested framework, large scale network and geo analyses are fine. We just ask that identifying information (user names, high-resolution geolocations) is not published. If you would like to follow specific users or publish identifiable information, then we suggest consulting an Institutional Review Board.

Thanks again for adding your voice, we hope the conversation continues.

-Caitlin Rivers and Bryan Lewis
Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d like to point out that our guidelines are not for mental health research only, but for Twitter research more broadly.

That being said, I think we agree on a lot of points. Twitter is a wonderful datasource for a studies in a variety of fields. We ourselves have used it to study risky behaviors associated with the transmission of HIV. We are simply proposing that researchers take a few simple steps to protect the privacy of the users whose data they curate.

For example, the study designs you mentioned and the guidelines we proposed are not incompatible. Under our suggested framework, large scale network and geo analyses are fine. We just ask that identifying information (user names, high-resolution geolocations) is not published. If you would like to follow specific users or publish identifiable information, then we suggest consulting an Institutional Review Board.

Thanks again for adding your voice, we hope the conversation continues.

-Caitlin Rivers and Bryan Lewis
Competing Interests: We are the article authors. Close
Report a concern
Reader Comment 27 May 2014

Ernesto Priego, City University London, UK

27 May 2014

Reader Comment

Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
Competing Interests: None.
Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
Competing Interests: None. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Author details Author details

¹ Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech., Blacksburg, VA, 24060, USA

Competing interests

No competing interests were disclosed.

Grant information

Research reported in this publication was supported by the US National Institute of General Medical Sciences of the National Institutes of Health under award number 2U01GM070694-09. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Additional support provided by Defense Threat Reduction Agency Validation Grant HDTRA1-11-1-0016.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 21 Aug 2014, 3:38

https://doi.org/10.12688/f1000research.3-38.v2

version 1

Published: 06 Feb 2014, 3:38

https://doi.org/10.12688/f1000research.3-38.v1

Copyright

© 2014 Rivers CM and Lewis BL. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Rivers CM and Lewis BL. Ethical research standards in a world of big data [version 2; peer review: 3 approved with reservations]. F1000Research 2014, 3:38 (https://doi.org/10.12688/f1000research.3-38.v2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 21 Aug 2014

Revised

Views

31

Reviewer Report 16 Dec 2015

Christophe Giraud-Carrier, Data Mining Laboratory, Brigham Young University, Provo, UT, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.5016.r11384

The paper addresses an important issue and makes a number of interesting points. There are a couple of technical points that the authors should address.

The authors do mention the Twitter Search API, but they say nothing about the Twitter Streaming

The paper addresses an important issue and makes a number of interesting points. There are a couple of technical points that the authors should address.

The authors do mention the Twitter Search API, but they say nothing about the Twitter Streaming API. These are two significantly different ways to access the content of Twitter. The Streaming API allows a program to collect all tweets that match a set of keywords as they are issued. In other words, the Search API allows to look backward, while the Streaming API allows to look forward.
The authors suggest that it would be possible to build social networks (e.g., using mentions) from social media data, as if this was a future event. Many researchers have already done that, and this is indeed the strength of social media, i.e., access not only to the content of what one says, but the interaction he/she has with others.

In spite of the points made by the authors, who try very hard to protect users against themselves, it may be valuable to add a discussion about the fact that as time goes on, there may no longer be any reasonable assumption of privacy in the online world. As recognized by the authors themselves, it is easy to collect all kinds of information from users by considering their profiles, the content they post, and the associations they maintain online. Countless studies have shown how political leaning, sexual orientation, and a number of other sensitive and/or private information can be inferred. This is the nature of the digital world. And for several platforms, such as Twitter, there is already no assumption of privacy. Presumably people put their thoughts, ideas, and emotions out there so others can find them. Usually these bits of information are spread out and users typically dot not consider the possibility that someone (or something, i.e., a machine) can piece them together to infer other information about them, or they do not express them explicitly for research purposes. Nevertheless it can be and is routinely done. This may warrant further discussion.

As stated by the authors, when a tweet author (or blog writer) shares data about himself, "he does it to communicate with his digital community." It may be interesting to think that, when the content is public, then among others, researchers are part of that digital community. Unlike Facebook, anyone can follow anyone else on Twitter without the express consent of the individual being followed. In so doing, followers have access to all of the information of the persons they follow. Again, no assumption of privacy there. In turn, it could open the door to exploitation by researchers just the same as others, with the possible advantage that researchers are likely to use the information in a benevolent fashion (like all technologies social media does have two sides). It seems that while possibly protecting the users (although, as stated above, this may be more an illusion than a reality anymore in today's online world), putting too many restrictions on benevolent researchers (that others, especially malevolent individuals readily dispense with) would limit our ability to serve them and address important public health issues. Some discussion of these issues would be welcome.

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

55

Reviewer Report 18 Nov 2014

Tristan Henderson, Department of Computer Science, University of St Andrews, Fife, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.5016.r6732

The article has been improved since the original submission. The purpose has been clarified somewhat. The literature review has been expanded although there is still no discussion of Neuhaus and Webmoor who have already proposed similar guidelines for Twitter research, ... Continue reading

The article has been improved since the original submission. The purpose has been clarified somewhat. The literature review has been expanded although there is still no discussion of Neuhaus and Webmoor who have already proposed similar guidelines for Twitter research, nor the BPS guidelines. With so many sets of guidelines, it is difficult for a researcher to know which to follow. It would be nice to see a concerted effort (see my next paragraph).

I still have concerns about the appropriateness of this particular venue for developing guidelines. The authors suggest the use of the Twitter hashtag "#TACTICS" for discussion, but such a generic hashtag is not particularly useful. https://twitter.com/search?q=%23TACTICS just brings up tweets about sports and marketing strategy. I believe that some sort of more focused discussion, e.g. at a workshop, perhaps under the auspices of a relevant professional society, would be more appropriate.

Since some people seemed to find the list of references in my previous review useful, here a couple of more recent ones:

The Horizon Digital Economy Research Institute's (Nottingham, UK) response to a UK parliamentary select committee on social media analytics: http://drdrmc.blogspot.co.uk/2014/04/ethics-and-social-media.html
A talk by Dan O'Connor of the Wellcome Trust (again, UK) on ethics and social media research: http://www.youtube.com/watch?v=5VbQcYt0bkw

Given that this document is by definition a work-in-progress, I keep my decision the same.

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 06 Feb 2014

Views

108

Reviewer Report 01 Jul 2014

Sherry Emery, Institute for Health Research and Policy, University of Illinois at Chicago, Chicago, IL, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.3770.r5282

In general, I think this is a very helpful piece, which lays out some practical and legitimate privacy risks that arise when working with social media data. And the authors' suggestions for the code of ethics are thoughtful, yet warrant ... Continue reading

In general, I think this is a very helpful piece, which lays out some practical and legitimate privacy risks that arise when working with social media data. And the authors' suggestions for the code of ethics are thoughtful, yet warrant some debate.

My main concern is that corporate interests, most prominently Facebook - but also app developers and marketers by the legion - are using these data to both manipulate and profit from users digital networks and behaviors. More discussion of the appropriateness of applying traditional IRB standards to social media research is needed. In particular, the relative uselessness of the 'terms of use' in the context of IRB standards is notable. But in practice, this is the standard that commercial entities are applying. For researchers to be held to a different standard risks abdicating social media research to corporate interests--and the public good of generating knowledge from these data will be lost unless that knowledge directly serves somebody's bottom line.

I'd like to see a more thorough consideration of the practical implications of potential privacy breaches from social media research. Would these meet the minimal risk criterion used by most IRBs? If so, then some of the suggestions in the proposed guidelines may be unnecessarily 'protective.'

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

212

Reviewer Report 06 Mar 2014

Tristan Henderson, Department of Computer Science, University of St Andrews, Fife, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.3770.r3876

This opinion article is interesting and timely, as it is clear that the quantity of research using social media and other sources of personal data is increasing, and the potential concerns around this need to be considered by researchers and ... Continue reading

This opinion article is interesting and timely, as it is clear that the quantity of research using social media and other sources of personal data is increasing, and the potential concerns around this need to be considered by researchers and ethics committees. I found the current form of the paper to be rather unsatisfactory, however: it is not clear how or where the proposed guidelines would be applied, how they were developed, and most importantly there is no contextualisation of how this paper builds on the myriad discussions that have already taken place in this arena.

The introductory section of the paper is fine, and provides a clear description of how Twitter works, which is no doubt of use for the intended audience. But the authors need to clearly describe the purpose of the guidelines. Indeed there needs to be some introductory text before the guidelines are introduced, as the current text is rather incongruent. Is the text on the Consumer Privacy Bill of Rights supposed to be in a separate section following the introduction?

Moreover the aim and applicability of the guidelines is unclear. How would these guidelines be enforced? Would they be applied to all studies? Is there any mechanism for exceptions to this? For instance if I wanted to do a study comparing health behaviour on Twitter and Facebook, I would need to collect data from both networks. But this violates the guidelines. Or are these guidelines only for studies where IRB approval would not be obtained? There is much discussion over whether social media studies require any IRB approval at all ¹^,²^,³. But your last guideline seems to disagree with this. The title of the paper refers to “big data” but the paper itself really concerns itself with Twitter (with some discussion of linkability). Are the guidelines to apply to all big data studies?

Given that this paper really only targets US ethics committees and IRBs, I was surprised that there was no discussion of the Common Rule or the recent Menlo Report ⁴^,⁵. An opinion article in Science last year also discussed US regulation ⁶.

One of the guidelines explicitly describes the context in which a tweet was sent. But there is no mention of contextual integrity ⁷ which seems to be the most common framework for studying this, and has indeed been applied to social media research ⁸^,⁹^,¹⁰.

Danah Boyd has talked widely about the ethics of big data research ¹¹^,¹² and her work should be discussed. Concerns about Twitter specifically have also been raised ¹³^,¹⁴ and Neuhaus and Webmoor specifically propose “agile ethics” for the study of such networks ¹⁵. How would agile ethics fit in with your guidelines?

The implicit assumption in your guidelines is that studies should occur without consent and involve trawling the firehose. But these kinds of studies in themselves are controversial ¹⁶. Moreover is there any reason why informed consent could not be obtained for some of these studies, especially since you touch on consent when discussing your hypothetical shopping centre study? We have looked at this empirically ¹⁷.

One striking difference between research and the collection of data for business purposes is that researchers are typically interested in reproducibility. Indeed many research funding bodies worldwide are now insisting on data archiving and sharing as a requirement, and this would apply to social media studies as well (something that has been criticised in the past ¹⁸). Should your guidelines not address data sharing? Technical challenges exist, including the linkability problems that you discuss ¹⁹ and the difficulty of anonymising data ²⁰, while solutions are still nascent ²¹^,²².

The BPS reference (13) should be updated ²³. Moreover you ought to compare and contrast other ethical guidelines for Internet research ²⁴^,²⁵^,²⁶, and other proposals for social media and health research ²⁷^,²⁸.

At one point you mention “privacy literacy” but there is no reference and it was unclear what you mean here. Privacy “salience” has been discussed in the literature ²⁹ but might be slightly different.

Finally, I qualify my review by stating that I am not a life scientist and that this is the first review that I have written for F1000Research. Indeed, I had not heard of it prior to my review request. I do question the appropriateness of a scientific journal for this work. If the intention of the authors is, as stated, to encourage more discussion then it would be good to focus this in one place. Since F1000Research allows comments from readers, then perhaps this does provide a better avenue for discussion than over Twitter, which can be difficult over long time-scales and is not particularly persistent. But in my opinion an even better mechanism for discussion and debate would be to hold a workshop and then publish a workshop report. Certainly further discussion with other communities, and examination of the discussion that has already taken place, would be useful to inform and improve this document. In the first instance, I think it would be good to see a revised document that shows more awareness of the current work, and where controversy exists, discusses why the authors have taken their chosen stance when proposing their guidelines.

References

1. Buchanan E, Aycock J, Dexter S, Dittrich D, et al.: Computer Science Security Research and Human Subjects: Emerging Considerations for Research Ethics Boards. Journal of Emperical Research on Human Ethics. 2011; 6 (2): 71-83 Publisher Full Text
2. Halavais A: Social science: Open up online research. Nature. 2011; 480 (7376): 174-175 PubMed Abstract | Publisher Full Text
3. Solberg LB: Data Mining on Facebook: A free space for researchers or an IRB nightmare?. University of Illinois Journal of Law, Technology & Policy. 2010; 2010 (2): 311-343 Reference Source
4. Dittrich D, Kenneally E: The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Cooperative Association for Internet Data Analysis. 2012. Reference Source
5. Bailey M, Dittrich D, Kenneally E, Maughan D: The Menlo Report. IEEE Security & Privacy Magazine. 2012; 10 (2): 71-75 Publisher Full Text
6. Shapiro RB, Ossorio PN: Research Ethics. Regulation of online social network studies. Science. 2013; 339 (6116): 144-145 PubMed Abstract | Publisher Full Text
7. Nissenbaum HF: Privacy as Contextualo integrity. Washington Law Review. 2004; 79 (1): 119-157 Reference Source
8. Page X: Contextual integrity and preserving relationship boundaries in location- sharing social media In Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches, San Antonio, TX, USA, February. 2013. Reference Source
9. Shi P, Xu H, Chen Y: Using contextual integrity to examine interpersonal information boundary on social network sites. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, PAris, France. 2013. 35-38 Publisher Full Text
10. Hull G, Lipford HR, Latulipe C: Contexual gaps: privacy issues on Facebook. Ethics Inf Technol. 2011; 13 (4): 289-302 Publisher Full Text
11. Boyd D: Privacy and publicity in the context of big data. Keynote at WWW ’10: the 19th International Conference on the World Wide Web, April. 2010. Reference Source
12. Boyd D, Crawford K: Critical questions for big data. Inf Commun Soc. 2012; 15 (5): 662-679 Publisher Full Text
13. Puschmann C, Burgess J: The politics of Twitter data. Social Science Research Network Working Paper Series, January. 2013. Reference Source
14. Vieweg S: The ethics of Twitter research. Revisiting Research Ethics in the Facebook Era: Challenges in Emerging CSCW Research, Savannah, GA, USA, February. 2010. Reference Source
15. Neuhaus F, Webmoor T: Agile ethics for massified research and visualization. Inf Commun Soc. 2012; 15 (1): 43-65 Publisher Full Text
16. Ioannidis JPA: Informed consent, big data, and the oxymoron of research that is not research. Am J Bioeth. 2013; 13 (4): 40-42 PubMed Abstract | Publisher Full Text
17. McNeilly S, Hutton L, Henderson T: Understanding ethical concerns in social media privacy studies.In Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches, San Antonio, TX, USA, February. 2013. Reference Source
18. Huberman BA: Sociology of science: Big data deserve a bigger audience. Nature. 2012; 482 (7385): 308 PubMed Abstract | Publisher Full Text
19. Narayanan A, Shmatikov V: De-anonymizing social networks. Proceedings of the IEEE Symposium on Security and Privacy. 2009. 173-187 Publisher Full Text
20. Zimmer M: “But the data is already public”: on the ethics of research in Facebook. Ethics Inf Technol. 2010; 12 (4): 313-325 Publisher Full Text
21. Dwork C: Differential Privacy. M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, editors, Automata, Languages and Programming, volume 4052 of Lecture Notes in Computer Science, Springer, Berlin. 2006. 1-12 Publisher Full Text
22. Hutton L, Henderson T: An architecture for ethical and privacy-sensitive social network experiments. SIGMETRICS Perform. Eval. Rev. 2013; 40 (4): 90-95 Publisher Full Text
23. British Psychological Society: Ethics guidelines for Internet-mediated research. 2013. Reference Source
24. Buchanan EA, Ess C: Internet research ethics: The field and its critical issues. The Handbook of Information and Computer Ethics, number 11. 2008. 273-292 Publisher Full Text
25. Eynon R, Schroeder R, Fry J: New techniques in online research: challenges for research ethics. Twenty-first Century Society. 2009; 4 (2): 187-199 Publisher Full Text
26. Felzmann H: Ethical issues in internet research: international good practice and Irish research ethics documents. In C. Fowley, C. English, and S. Thouësny, editors, Internet research, theory, and practice: perspectives from Ireland, pages 11–32. Research-publishing.net, Dublin, 1 edition. 2013. Reference Source
27. McKee R: Ethical issues in using social media for health and health care research. Health Policy. 2013; 110 (2-3): 298-301 PubMed Abstract | Publisher Full Text
28. Moreno MA, Fost NC, Christakis DA: Research ethics in the MySpace era. pediatrics. 2008; 121 (1): 157-161doi PubMed Abstract
29. Bonneau J, Preibusch S: The privacy jungle: On the market for data protection in social networks. Proceedings of The Eighth Workshop on the Economics of Information Security (WEIS 2009). 2009. 121-167

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (2)

Version 2

VERSION 2 PUBLISHED 21 Aug 2014

Revised

Comment

Version 1

VERSION 1 PUBLISHED 06 Feb 2014

Discussion is closed on this version, please comment on the latest version above.

Author Response 11 Jun 2014

Caitlin RIvers, Network Dynamics and Simulation Science Laboratory, Virginia Bioinformatics Institute, Virginia Tech., Blacksburg, VA, 24060, USA

11 Jun 2014

Author Response

Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d ... Continue reading Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d like to point out that our guidelines are not for mental health research only, but for Twitter research more broadly.

That being said, I think we agree on a lot of points. Twitter is a wonderful datasource for a studies in a variety of fields. We ourselves have used it to study risky behaviors associated with the transmission of HIV. We are simply proposing that researchers take a few simple steps to protect the privacy of the users whose data they curate.

For example, the study designs you mentioned and the guidelines we proposed are not incompatible. Under our suggested framework, large scale network and geo analyses are fine. We just ask that identifying information (user names, high-resolution geolocations) is not published. If you would like to follow specific users or publish identifiable information, then we suggest consulting an Institutional Review Board.

Thanks again for adding your voice, we hope the conversation continues.

-Caitlin Rivers and Bryan Lewis
Ernesto, thank you for contributing your blog post to the discussion about the ethics of Twitter research. This sort of community-wide conversation is exactly what we were hoping. First I’d like to point out that our guidelines are not for mental health research only, but for Twitter research more broadly.

That being said, I think we agree on a lot of points. Twitter is a wonderful datasource for a studies in a variety of fields. We ourselves have used it to study risky behaviors associated with the transmission of HIV. We are simply proposing that researchers take a few simple steps to protect the privacy of the users whose data they curate.

For example, the study designs you mentioned and the guidelines we proposed are not incompatible. Under our suggested framework, large scale network and geo analyses are fine. We just ask that identifying information (user names, high-resolution geolocations) is not published. If you would like to follow specific users or publish identifiable information, then we suggest consulting an Institutional Review Board.

Thanks again for adding your voice, we hope the conversation continues.

-Caitlin Rivers and Bryan Lewis
Competing Interests: We are the article authors. Close
Report a concern
Reader Comment 27 May 2014

Ernesto Priego, City University London, UK

27 May 2014

Reader Comment

Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
Competing Interests: None.
Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
Inspired by this piece and the SA post that indirectly led me to it, I wrote this post: Twitter as Public Evidence and the Ethics of Twitter Research (27/05/14).
Competing Interests: None. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 21 Aug 14	read		read
Version 1 06 Feb 14	read	read

Tristan Henderson, University of St Andrews, Fife, UK
Sherry Emery, University of Illinois at Chicago, Chicago, IL, USA
Christophe Giraud-Carrier, Brigham Young University, Provo, UT, USA

Comments on this article

All Comments(2)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

31 Views

16 Dec 2015 | for Version 2

Christophe Giraud-Carrier, Data Mining Laboratory, Brigham Young University, Provo, UT, USA

31 Views Cite this report Responses(0)

Approved With Reservations

The paper addresses an important issue and makes a number of interesting points. There are a couple of technical points that the authors should address.

The authors do mention the Twitter Search API, but they say nothing about the Twitter Streaming API. These are two significantly different ways to access the content of Twitter. The Streaming API allows a program to collect all tweets that match a set of keywords as they are issued. In other words, the Search API allows to look backward, while the Streaming API allows to look forward.
The authors suggest that it would be possible to build social networks (e.g., using mentions) from social media data, as if this was a future event. Many researchers have already done that, and this is indeed the strength of social media, i.e., access not only to the content of what one says, but the interaction he/she has with others.

In spite of the points made by the authors, who try very hard to protect users against themselves, it may be valuable to add a discussion about the fact that as time goes on, there may no longer be any reasonable assumption of privacy in the online world. As recognized by the authors themselves, it is easy to collect all kinds of information from users by considering their profiles, the content they post, and the associations they maintain online. Countless studies have shown how political leaning, sexual orientation, and a number of other sensitive and/or private information can be inferred. This is the nature of the digital world. And for several platforms, such as Twitter, there is already no assumption of privacy. Presumably people put their thoughts, ideas, and emotions out there so others can find them. Usually these bits of information are spread out and users typically dot not consider the possibility that someone (or something, i.e., a machine) can piece them together to infer other information about them, or they do not express them explicitly for research purposes. Nevertheless it can be and is routinely done. This may warrant further discussion.

As stated by the authors, when a tweet author (or blog writer) shares data about himself, "he does it to communicate with his digital community." It may be interesting to think that, when the content is public, then among others, researchers are part of that digital community. Unlike Facebook, anyone can follow anyone else on Twitter without the express consent of the individual being followed. In so doing, followers have access to all of the information of the persons they follow. Again, no assumption of privacy there. In turn, it could open the door to exploitation by researchers just the same as others, with the possible advantage that researchers are likely to use the information in a benevolent fashion (like all technologies social media does have two sides). It seems that while possibly protecting the users (although, as stated above, this may be more an illusion than a reality anymore in today's online world), putting too many restrictions on benevolent researchers (that others, especially malevolent individuals readily dispense with) would limit our ability to serve them and address important public health issues. Some discussion of these issues would be welcome.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

55 Views

18 Nov 2014 | for Version 2

Tristan Henderson, Department of Computer Science, University of St Andrews, Fife, UK

55 Views Cite this report Responses(0)

Approved With Reservations

The article has been improved since the original submission. The purpose has been clarified somewhat. The literature review has been expanded although there is still no discussion of Neuhaus and Webmoor who have already proposed similar guidelines for Twitter research, nor the BPS guidelines. With so many sets of guidelines, it is difficult for a researcher to know which to follow. It would be nice to see a concerted effort (see my next paragraph).

I still have concerns about the appropriateness of this particular venue for developing guidelines. The authors suggest the use of the Twitter hashtag "#TACTICS" for discussion, but such a generic hashtag is not particularly useful. https://twitter.com/search?q=%23TACTICS just brings up tweets about sports and marketing strategy. I believe that some sort of more focused discussion, e.g. at a workshop, perhaps under the auspices of a relevant professional society, would be more appropriate.

Since some people seemed to find the list of references in my previous review useful, here a couple of more recent ones:

The Horizon Digital Economy Research Institute's (Nottingham, UK) response to a UK parliamentary select committee on social media analytics: http://drdrmc.blogspot.co.uk/2014/04/ethics-and-social-media.html
A talk by Dan O'Connor of the Wellcome Trust (again, UK) on ethics and social media research: http://www.youtube.com/watch?v=5VbQcYt0bkw

Given that this document is by definition a work-in-progress, I keep my decision the same.

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

108 Views

01 Jul 2014 | for Version 1

Sherry Emery, Institute for Health Research and Policy, University of Illinois at Chicago, Chicago, IL, USA

108 Views Cite this report Responses(0)

Approved With Reservations

In general, I think this is a very helpful piece, which lays out some practical and legitimate privacy risks that arise when working with social media data. And the authors' suggestions for the code of ethics are thoughtful, yet warrant some debate.

My main concern is that corporate interests, most prominently Facebook - but also app developers and marketers by the legion - are using these data to both manipulate and profit from users digital networks and behaviors. More discussion of the appropriateness of applying traditional IRB standards to social media research is needed. In particular, the relative uselessness of the 'terms of use' in the context of IRB standards is notable. But in practice, this is the standard that commercial entities are applying. For researchers to be held to a different standard risks abdicating social media research to corporate interests--and the public good of generating knowledge from these data will be lost unless that knowledge directly serves somebody's bottom line.

I'd like to see a more thorough consideration of the practical implications of potential privacy breaches from social media research. Would these meet the minimal risk criterion used by most IRBs? If so, then some of the suggestions in the proposed guidelines may be unnecessarily 'protective.'

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

212 Views

06 Mar 2014 | for Version 1

Tristan Henderson, Department of Computer Science, University of St Andrews, Fife, UK

212 Views Cite this report Responses(0)

Approved With Reservations

This opinion article is interesting and timely, as it is clear that the quantity of research using social media and other sources of personal data is increasing, and the potential concerns around this need to be considered by researchers and ethics committees. I found the current form of the paper to be rather unsatisfactory, however: it is not clear how or where the proposed guidelines would be applied, how they were developed, and most importantly there is no contextualisation of how this paper builds on the myriad discussions that have already taken place in this arena.

The introductory section of the paper is fine, and provides a clear description of how Twitter works, which is no doubt of use for the intended audience. But the authors need to clearly describe the purpose of the guidelines. Indeed there needs to be some introductory text before the guidelines are introduced, as the current text is rather incongruent. Is the text on the Consumer Privacy Bill of Rights supposed to be in a separate section following the introduction?

Moreover the aim and applicability of the guidelines is unclear. How would these guidelines be enforced? Would they be applied to all studies? Is there any mechanism for exceptions to this? For instance if I wanted to do a study comparing health behaviour on Twitter and Facebook, I would need to collect data from both networks. But this violates the guidelines. Or are these guidelines only for studies where IRB approval would not be obtained? There is much discussion over whether social media studies require any IRB approval at all ¹^,²^,³. But your last guideline seems to disagree with this. The title of the paper refers to “big data” but the paper itself really concerns itself with Twitter (with some discussion of linkability). Are the guidelines to apply to all big data studies?

Given that this paper really only targets US ethics committees and IRBs, I was surprised that there was no discussion of the Common Rule or the recent Menlo Report ⁴^,⁵. An opinion article in Science last year also discussed US regulation ⁶.

One of the guidelines explicitly describes the context in which a tweet was sent. But there is no mention of contextual integrity ⁷ which seems to be the most common framework for studying this, and has indeed been applied to social media research ⁸^,⁹^,¹⁰.

Danah Boyd has talked widely about the ethics of big data research ¹¹^,¹² and her work should be discussed. Concerns about Twitter specifically have also been raised ¹³^,¹⁴ and Neuhaus and Webmoor specifically propose “agile ethics” for the study of such networks ¹⁵. How would agile ethics fit in with your guidelines?

The implicit assumption in your guidelines is that studies should occur without consent and involve trawling the firehose. But these kinds of studies in themselves are controversial ¹⁶. Moreover is there any reason why informed consent could not be obtained for some of these studies, especially since you touch on consent when discussing your hypothetical shopping centre study? We have looked at this empirically ¹⁷.

One striking difference between research and the collection of data for business purposes is that researchers are typically interested in reproducibility. Indeed many research funding bodies worldwide are now insisting on data archiving and sharing as a requirement, and this would apply to social media studies as well (something that has been criticised in the past ¹⁸). Should your guidelines not address data sharing? Technical challenges exist, including the linkability problems that you discuss ¹⁹ and the difficulty of anonymising data ²⁰, while solutions are still nascent ²¹^,²².

The BPS reference (13) should be updated ²³. Moreover you ought to compare and contrast other ethical guidelines for Internet research ²⁴^,²⁵^,²⁶, and other proposals for social media and health research ²⁷^,²⁸.

At one point you mention “privacy literacy” but there is no reference and it was unclear what you mean here. Privacy “salience” has been discussed in the literature ²⁹ but might be slightly different.

Finally, I qualify my review by stating that I am not a life scientist and that this is the first review that I have written for F1000Research. Indeed, I had not heard of it prior to my review request. I do question the appropriateness of a scientific journal for this work. If the intention of the authors is, as stated, to encourage more discussion then it would be good to focus this in one place. Since F1000Research allows comments from readers, then perhaps this does provide a better avenue for discussion than over Twitter, which can be difficult over long time-scales and is not particularly persistent. But in my opinion an even better mechanism for discussion and debate would be to hold a workshop and then publish a workshop report. Certainly further discussion with other communities, and examination of the discussion that has already taken place, would be useful to inform and improve this document. In the first instance, I think it would be good to see a revised document that shows more awareness of the current work, and where controversy exists, discusses why the authors have taken their chosen stance when proposing their guidelines.

References

1. Buchanan E, Aycock J, Dexter S, Dittrich D, et al.: Computer Science Security Research and Human Subjects: Emerging Considerations for Research Ethics Boards. Journal of Emperical Research on Human Ethics. 2011; 6 (2): 71-83 Publisher Full Text
2. Halavais A: Social science: Open up online research. Nature. 2011; 480 (7376): 174-175 PubMed Abstract | Publisher Full Text
3. Solberg LB: Data Mining on Facebook: A free space for researchers or an IRB nightmare?. University of Illinois Journal of Law, Technology & Policy. 2010; 2010 (2): 311-343 Reference Source
4. Dittrich D, Kenneally E: The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. Cooperative Association for Internet Data Analysis. 2012. Reference Source
5. Bailey M, Dittrich D, Kenneally E, Maughan D: The Menlo Report. IEEE Security & Privacy Magazine. 2012; 10 (2): 71-75 Publisher Full Text
6. Shapiro RB, Ossorio PN: Research Ethics. Regulation of online social network studies. Science. 2013; 339 (6116): 144-145 PubMed Abstract | Publisher Full Text
7. Nissenbaum HF: Privacy as Contextualo integrity. Washington Law Review. 2004; 79 (1): 119-157 Reference Source
8. Page X: Contextual integrity and preserving relationship boundaries in location- sharing social media In Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches, San Antonio, TX, USA, February. 2013. Reference Source
9. Shi P, Xu H, Chen Y: Using contextual integrity to examine interpersonal information boundary on social network sites. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, PAris, France. 2013. 35-38 Publisher Full Text
10. Hull G, Lipford HR, Latulipe C: Contexual gaps: privacy issues on Facebook. Ethics Inf Technol. 2011; 13 (4): 289-302 Publisher Full Text
11. Boyd D: Privacy and publicity in the context of big data. Keynote at WWW ’10: the 19th International Conference on the World Wide Web, April. 2010. Reference Source
12. Boyd D, Crawford K: Critical questions for big data. Inf Commun Soc. 2012; 15 (5): 662-679 Publisher Full Text
13. Puschmann C, Burgess J: The politics of Twitter data. Social Science Research Network Working Paper Series, January. 2013. Reference Source
14. Vieweg S: The ethics of Twitter research. Revisiting Research Ethics in the Facebook Era: Challenges in Emerging CSCW Research, Savannah, GA, USA, February. 2010. Reference Source
15. Neuhaus F, Webmoor T: Agile ethics for massified research and visualization. Inf Commun Soc. 2012; 15 (1): 43-65 Publisher Full Text
16. Ioannidis JPA: Informed consent, big data, and the oxymoron of research that is not research. Am J Bioeth. 2013; 13 (4): 40-42 PubMed Abstract | Publisher Full Text
17. McNeilly S, Hutton L, Henderson T: Understanding ethical concerns in social media privacy studies.In Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches, San Antonio, TX, USA, February. 2013. Reference Source
18. Huberman BA: Sociology of science: Big data deserve a bigger audience. Nature. 2012; 482 (7385): 308 PubMed Abstract | Publisher Full Text
19. Narayanan A, Shmatikov V: De-anonymizing social networks. Proceedings of the IEEE Symposium on Security and Privacy. 2009. 173-187 Publisher Full Text
20. Zimmer M: “But the data is already public”: on the ethics of research in Facebook. Ethics Inf Technol. 2010; 12 (4): 313-325 Publisher Full Text
21. Dwork C: Differential Privacy. M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, editors, Automata, Languages and Programming, volume 4052 of Lecture Notes in Computer Science, Springer, Berlin. 2006. 1-12 Publisher Full Text
22. Hutton L, Henderson T: An architecture for ethical and privacy-sensitive social network experiments. SIGMETRICS Perform. Eval. Rev. 2013; 40 (4): 90-95 Publisher Full Text
23. British Psychological Society: Ethics guidelines for Internet-mediated research. 2013. Reference Source
24. Buchanan EA, Ess C: Internet research ethics: The field and its critical issues. The Handbook of Information and Computer Ethics, number 11. 2008. 273-292 Publisher Full Text
25. Eynon R, Schroeder R, Fry J: New techniques in online research: challenges for research ethics. Twenty-first Century Society. 2009; 4 (2): 187-199 Publisher Full Text
26. Felzmann H: Ethical issues in internet research: international good practice and Irish research ethics documents. In C. Fowley, C. English, and S. Thouësny, editors, Internet research, theory, and practice: perspectives from Ireland, pages 11–32. Research-publishing.net, Dublin, 1 edition. 2013. Reference Source
27. McKee R: Ethical issues in using social media for health and health care research. Health Policy. 2013; 110 (2-3): 298-301 PubMed Abstract | Publisher Full Text
28. Moreno MA, Fost NC, Christakis DA: Research ethics in the MySpace era. pediatrics. 2008; 121 (1): 157-161doi PubMed Abstract
29. Bonneau J, Preibusch S: The privacy jungle: On the market for data protection in social networks. Proceedings of The Eighth Workshop on the Economics of Information Security (WEIS 2009). 2009. 121-167

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. About/Company. Twitter. Reference Source

[2] 2. Ginsberg J, Mohebbi MH, Patel RS, et al.: Detecting influenza epidemics using search engine query data. Nature. 2009; 457(7232): 1012–4. PubMed Abstract | Publisher Full Text

[3] 3. Achrekar H, Lazarus R, Park WC: Predicting Flu Trends using Twitter Data. In The First International Workshop on Cyber-Physical Networking Systems. 2011; (pp. 713–718). Reference Source

[4] 4. Lampos V, Cristianini N: Tracking the flu pandemic by monitoring the Social Web. Inf Syst. 2010; 411–416. Publisher Full Text

[5] 5. Salathé M, Bengtsson L, Bodnar TJ, et al.: Digital epidemiology. PLoS Comput Biol. 2012; 8(7): e1002616. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Young SD, Rivers C, Lewis B: Methods of using real-time social media technologies for detection and remote monitoring of HIV outcomes. Prev Med. 2014; 63: 112–5. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Campbell E, Salathé M: Complex social contagion makes networks more vulnerable to disease outbreaks. Sci Rep. 2013; 3: 1905. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Tumasjan A, Sprenger TO, Sandner PG, et al.: Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment. Word Journal Of The International Linguistic Association. 2010. Reference Source

[9] 9. Omand D: NSA leaks: How to make surveillance both ethical and effective. The Guardian. 2013. Reference Source

[10] 10. Twitter Documentation. Twitter. 2013. Reference Source

[11] 11. Raymond M: How Tweet It Is!: Library Acquires Entire Twitter Archive. Library of Congress Blog. 2010. Reference Source

[12] 12. Allen E: Update on the Twitter Archive at the Library of Congress. Library of Congress Blog. 2013. Reference Source

[13] 13. Ethics guidelines for Internet-mediated research. British Psychological Society. 2013. Reference Source

[14] 14. Code of Federal Regulation: Protection of human subjects. Department of Health and Human Services. 2009. Reference Source

[15] 15. Twitter Privacy Policy. Twitter. 2013. Reference Source

[16] 16. McDonald AM, Cranor LF: The Cost of Reading Privacy Policies. I/S: A Journal of Law and Policy for the Information Society. 2008. Reference Source

[17] 17. Bailey M, Burstein A, Claffy K, et al.: The Menlo Report: Ethical Principles Guiding Information and Communication Technology Research. 2012. Reference Source

[18] 18. The Belmont Report. U.S. Department of Health & Human Services.1979. Reference Source

[19] 19. Markham A, Buchanan K: Ethical Decision-Making and Internet Research. 2012; (pp. 1–19). Reference Source

[20] 20. Solberg LB: Data mining on Facebook: A free space for researchers or an IRB nightmare? J Law Technol policy. 2010; 311–343. Reference Source

[21] 21. Buchanan E, Aycock J, Dexter S, et al.: Computer science security research and human subjects: emerging considerations for research ethics boards. J Empir Res Hum Res Ethics. 2011; 6(2): 71–83. PubMed Abstract | Publisher Full Text

[22] 22. Bassett EH, Riordan KO: Ethics of Internet research: Contesting the human subjects research model. 2002; 4(1): 233–247. Publisher Full Text

[23] 23. Bruckman A, Luther K, Fiesler C: When Should We Use Real Names in Published Accounts of Internet Research? Digital Research Confidential. 2011. Reference Source

[24] 24. Zimmer M: ‘But the data is already public’: on the ethics of research in Facebook. Ethics Inf Technol. 2010; 12(4): 313–325. Publisher Full Text

[25] 25. Boyd D: Privacy and Publicity in the Context of Big Data. 2010. Reference Source

[26] 26. Page X: Contextual Integrity and Preserving Relationship Boundaries in Location- Sharing Social Media. Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches. San Antonio, TX USA. 2013. Reference Source

[27] 27. Marx GT: Murky conceptual waters: The public and the private. Ethics Inf Technol. 2001; 3(3): 157–169. Publisher Full Text

[28] 28. Moor JH: Towards a Theory of Privacy in the Information Age. 1997; 27–32. Reference Source

[29] 29. Moreno MA, Fost NC, Christakis DA: Research ethics in the MySpace era. Pediatrics. 2008; 121(1), 157–61. PubMed Abstract | Publisher Full Text

[30] 30. Wilkinson D, Cybermetrics S, Thelwall M: Researching Personal Information on the Public Web: Methods and Ethics. Soc Sci Comput Rev. 2011; 29(4): 387–401. Publisher Full Text

[31] 31. Mcneilly S, Hutton L, Henderson T: Understanding ethical concerns in social media privacy studies. Proceedings of the ACM CSCW Workshop on Measuring Networked Social Privacy: Qualitative & Quantitative Approaches. San Antonio, TX USA. 2013. Reference Source

[32] 32. Consumer Data Privacy in a Networked World: A Framework for Protecting Privacy and Promoting Innovation in the Global Digital Economy. 2012. Reference Source

[33] 33. Data Protection and Privacy Policy. United States Census Bureau. 2012. Reference Source

Ethical research standards in a world of big data

Abstract

Revised Amendments from Version 1

Introduction

Figure 1. Proposed guidelines for the ethical use of Twitter for research.

Figure 2. An excerpt of a single tweet returned through the Twitter API11.

Figure 3. Each dot is a geolocated tweet collected through the Twitter API.

Table 1. Consumer Privacy Bill of Rights32.

Proposed guidelines for the ethical use of Twitter data

The objectives, methodologies, and data handling practices of the project are transparent and easily accessible

Study design and analyses respect the context in which a tweet was sent

The anonymity of tweet authors is protected, ensuring that subjects should not be identifiable in any way

Tweet data are not used to harvest additional information from other sources

Twitter users’ efforts to control their personal data are honored

Researchers work collaboratively with IRB just as they would for any other human subject data collection

Conclusions

Author contributions

Competing interests

Grant information

Acknowledgements

References

Comments on this article Comments (2)

Open Peer Review

Comments on this article Comments (2)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Figure 2. An excerpt of a single tweet returned through the Twitter API¹¹.

Table 1. Consumer Privacy Bill of Rights³².