Digital Humanities Summer Institute: some charts from my #dhsi2014 archive

Please start here first (dated 7 June 2014).

Then read this (dated 9 June 2014).

Below I share some charts I created with my #dhsi2014 archive (as I have indicated data might require further refining; visualisations including the infographic in the second link above offer a view of specific data; also this is a work in progress done for rapid publication and sharing.  I am aware there might be errors, etc.).

You can click on the charts to enlarge them.

[The #dhsi2014 tweets per day in archive bar chart had a typo so have removed to correct; will replace as soon as I can].

Tweet Volume per #dhsi014 User
Tweet Volume per #dhsi014 User
Proportion of Users by #dhsi2014 Tweets
Proportion of Users by #dhsi2014 Tweets
Top 25 #dhsi2014 Tweeters in Archive
Top 25 #dhsi2014 Tweeters in Archive

[*I had forgotten to say that indeed 770 Tweets from one single user seems disproportionate in context– will check the reason –deduplication might be needed, or maybe too many RTs?]

umber of Tweets Sent with the Top 15 (of total of 53) Sources of #dhsi2014 Tweets
Number of Tweets Sent with the Top 15 (of total of 53) Sources of #dhsi2014 Tweets

Priego, Ernesto (2014): Digital Humanities Summer Institute 2014: A #dhsi2014 Archive. figshare.

For feedback please use this contact form. Cheers!

Some Thoughts on Why You Would Like to Archive and Share [Small] Twitter Data Sets

dtwitter ev_chart
Twitter ecosystem quadrant, via


This is just a quick snippet to jot down some ideas as some kind of follow-up to my blog post on the Ethics of researching Twitter datasets  republished today [28 May 2014] by the LSE Impact blog.

If you have ever tried to keep up with Twitter you will know how hard it is. Tweets are like butterflies– one can only really look at them for long if one pins them down out of their natural environment. The reason why we have access to Twitter in any form is because of Twitter’s API, which stands for “Application Programming Interface”. As Twitter explains it,

“An API is a defined way for a program to accomplish a task, usually by retrieving or modifying data. In Twitter’s case, we provide an API method for just about every feature you can see on our website. Programmers use the Twitter API to make applications, websites, widgets, and other projects that interact with Twitter. Programs talk to the Twitter API over HTTP, the same protocol that your browser uses to visit and interact with web pages.”

You might also know that free access to historic Twitter search results are limited to the last 7 days. This is due to several reasons, including the incredible amount of data that is requested from Twitter’s API, and –this is an educated guess– not disconnected from the fact that Twitter’s business model relies on its data being a commodity that can be resold for research. Twitter’s data is stored and managed by at least one well-known third-party, Gnip, one of their “certified data reseller partners”.

For the researcher interested in researching Twitter data, this means that harvesting needs to be done not only automatedly (needless to say storyfiying won’t cut it, even if your dataset is to be very small), but in real time.

As Twitter grew, their ability to satisfy the requests from uncountable services changed. Around August 2012 they announced that their 1.0 version of their API would be switched off in March 2013. About a month later they announced the release of a new version of their API. This imposed new limitations and guidelines (what they call their “Developer Rules of the Road“). I am not a developer, so I won’t attempt to explain these changes like one. As a researcher, this basically means that there is no way to do proper research of Twitter data without understanding how it works at API level, and this means understanding the limitations and possibilities this imposes on researchers.

Taking how the Twitter API works into consideration, it is not surprising that González-Bailón et al (2012) should alert us that the Twitter Search API isn’t 100% reliable, as it “over-represents the more central users and does not offer an accurate picture of peripheral activity”  (“Assessing the bias in communication networks sampled from twitter”, SSRN 2185134). What’s a researcher to do? The whole butterfly colony cannot be captured with the nets most of us have available.

In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public an archive of tweets from 2006 through April, 2010. The Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. On 4 January 2013, the Library of Congress announced an update on their Twitter collection, publishing a white paper [PDF] that summarized the Library’s work collecting Twitter (we haven’t heard of any new updates yet). There they said that

“Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes.”

To get an idea of the enormity of the project, the Library’s white paper says that

“On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.”

To date, none of this data is yet publicly available to researchers.  This is why many of us were very excited when on 5 February 2014 Twitter announced their call for  “Twitter Data Grants” [closed on 15 March 2014]. This was/is a pilot programme [we haven’t heard anything about it yet either]. In the call, Twitter clarified that

“For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.”

As Martin Hawksey pointed out at the time,

“It’s worth stressing that Twitter’s initial pilot will be limited to a small number of proposals, but those who do get access will have the opportunity to “collaborate with Twitter engineers and researchers”. This isn’t the first time Twitter have opened data to researchers having made data available for a Jisc funded project to analyse the London Riot and while I expect Twitter end up with a handful of elite researchers/institutions hopefully the pilot will be extended.”

There are researchers out there who Most researchers out there are likely not to benefit from access to huge Twitter data dumps. We are working with relatively small data sets, limited by the methods we use to collect, archive and study the data (and by our own disciplinary frameworks, [lack of] funding and other limitations). We are trying to do the talk whilst doing the walk, and conduct research on Twitter and about Twitter.

There should be no question now about how valuable Twitter data can be for researchers of perhaps all disciplines. Given the difficulty to properly collect and analyse Twitter data as viewable from most Twitter Web and mobile clients (as most users get it) and the very limited short-span of search results, there is the danger of losing huge amounts of valuable historical material.  As Jason Leavey (2013) says,  “social media presents a growing body of evidence that can inform social and economic policy”, but

“A more sophisticated and overarching approach that uses social media data as a source of primary evidence requires tools that are not yet available. Making sense of social media data in a robust fashion will require a range of different skills and disciplines to come together. This is a process already taking shape in the research community, but it could be hastened by government.”

At the moment, unlimited access to the data has been the privilege of a few lucky individuals and elite institutions.

So, why collect and share Twitter data?

In my case, Martin Hawksey’s Twitter Archive Google Spreadsheet has provided a relatively-simple method to collect some tweets from academic Twitter backchannels (for an example, start with this post and this dataset). I have been steadily collecting them for qualitative and quantitative analysis and archival and historical reasons since at least 2010.

My interest is to also share this data with the participants of the studied networks, in order to encourage collaboration, interest, curiosity, wider dissemination, aswareness, reproducibility of my own findings and ideally further research. For the individual researcher there is a wealth of data out there that, within the limitations imposed by the Twitter API and the archival methods we have at our disposal, can be saved and made accessible before it disappears.

Figshare has been a brilliant addition to my Twitter data research workflow, enabling me to get a Digital Object Identifier for my uploaded outputs, and useful metrics that give me a smoke signal that I am not completely mad and alone in the world.

I believe that you should cite data in just the same way that you can cite other sources of information, such as articles and books. Sharing research data openly can have several benefits,  not limited to

  • providing public evidence of social media outputs (not static, subject to modification, deletion) as static records, for public verification, assessment
  • enabling easy research reuse and reproducibility of research source data
  • allowing the reach of data to be measured or tracked
  • strengthening research networks and fostering exchange and collaboration.

Finally, some useful sources of information that have inspired me to share small data sets are:

…and many others…

This coming  academic year with my students at City, University of London I am looking forward to discussing and dealing practically with the challenges and opportunities of researching, collecting, curating, sharing and preserving data such as the kind we can obtain from Twitter.

If you’ve read this far you might be interested to know that James Baker (British Library) and me will lead a workshop at the dhAHRC event ‘Promoting Interdisciplinary Engagement in the Digital Humanities’ [PDF] at the University of Oxford on 13 June 2014.

This session will offer a space to consider the relationships between research in the arts and humanities and the use and reuse of research data.

Some thoughts on what research data is, the difference between available and useable data, mechanisms for sharing, and what types of sharing encourage reuse will open the session.

Through structured group work, the remainder of the session will encourage participants to reflect on their own research data, to consider what they would want to keep, to share with restrictions, or to share for unrestricted reuse, and the reasons for these choices.

Update: for some recent work with a small Twitter dataset,


At the LSE Impact of Social Sciences Blog: Publicly available data from Twitter does not constitute an “ethical dilemma”.

LSE Impact Blog logo (new)

The Impact Blog of the London School of Economics and Political Science has published today my post from yesterday (“Twitter as Public Evidence and the Ethics of Twitter Research”) under the title Publicly available data from Twitter is public evidence and does not constitute an “ethical dilemma”.

With many thanks to Sierra Williams.

Twitter as Public Evidence and the Ethics of Twitter Research

Cartoon by Gregory via the New Yorker
Cartoon by Gregory via the New Yorker


Twitter to Release All Tweets to Scientists”, says the Scientific American headline. The 344-word post fails to quote a single source at Twitter where this claim can be verified. It is not clear if it’s just a very belated reaction to the February 5 2014 “Twitter Data Grants” call [now closed]. Please note that the Data Grants call is a pilot programme, and the February post clearly indicated that

“For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.”

Nothing in there says that “Twitter” will “Release All Tweets to Scientists”, as the über-retweeted Scientific American headline claims. (Note the above quote and source post was not mentioned or linked to from the Scientific American article in question). It is frustrating a publication like Scientific American (or the Smithsonian Magazine, which repeated the same copy) would not cite or link to the sources for such a claim. “All Twitter” means “All Twitter” and, “Scientists” means… who exactly?

Given Twitter’s business model and data storage and management strategy, it seems highly unlikely that the headline would be a reality (taken verbatim); it is more likely that some researchers might be able to get access to large-yet-curated datasets for research; it is also likely most researchers would have to pay for it.

The short Scientific American post is more like an excuse to mention an opinion article approved with reservations at F1000Research (that most unhelpfully also fails to link to) by Caitlin Rivers and Bryan Lewis, computational epidemiologists at Virginia Tech. The Scientific American post asks: “Is the use of Twitter as a research tool ethical, given that its users do not intend to contribute to research?”

To be honest the question makes me want to reply: these days what’s unethical is not to use Twitter as a research tool. But seriously. It has to be taken into account that Rivers and Lewis’ suggested ethics framework is for Twitter research in mental health [PDF]. This is an essential qualifier to the context in which their framework should be interpreted. Please read the open referee report of the piece by Tristan Henderson, which provides very valuable observations and further reading (Read the Referee Report). The version I read and refer to is version 1 of the paper. Click “track” to see updates.

As the Twitter Hive Mind often says, there is no such thing as ‘raw’ data– all data is always-already subject to curation and editing at all stages of the process. Once one has the data it is relatively easy to delete or edit specific columns for the different metadata obtained from the Twitter API. Nevertheless, as someone who has been collecting and sharing Twitter datasets for Library and Information Science research (see my figshare data) I would be worried if the ethical specificities of a particular field (mental health research, epidemiology in this case) were imposed on other fields. Research involving network analysis and geovisualisation often relies on publicly-available metadata obtained from tweets consciously and willingly provided by users publicly online through their public Twitter accounts.

Rivers and Lewis say that “Twitter participants can reasonably expect to rely on some anonymity of the crowd to manage privacy.” Though I can see why this is being said in the context of an opinion piece on ethics of mental health research, I disagree.

Twitter’s Privacy Policy is clear in this respect:

“Tweets, Following, Lists and other Public Information: Our Services are primarily designed to help you share information with the world. Most of the information you provide us is information you are asking us to make public. This includes not only the messages you Tweet and the metadata provided with Tweets, such as when you Tweeted, but also the lists you create, the people you follow, the Tweets you mark as favorites or Retweet, and many other bits of information that result from your use of the Services. Our default is almost always to make the information you provide public for as long as you do not delete it from Twitter, but we generally give you settings to make the information more private if you want. Your public information is broadly and instantly disseminated. For instance, your public user profile information and public Tweets may be searchable by search engines and are immediately delivered via SMS and our APIs to a wide range of users and services, with one example being the United States Library of Congress, which archives Tweets for historical purposes. [My emphasis – EP]. When you share information or content like photos, videos, and links via the Services, you should think carefully about what you are making public.

Location Information: You may choose to publish your location in your Tweets and in your Twitter profile. You may also tell us your location when you set your trend location on or enable your computer or mobile device to send us location information. You can set your Tweet location preferences in your account settings and learn more about this feature here. Learn how to set your mobile location preferences here. We may use and store information about your location to provide features of our Services, such as Tweeting with your location, and to  improve and customize the Services, for example, with more relevant content like local trends, stories, ads, and suggestions for people to follow.”

There is a wealth of information in a tweet’s metadata that can be beneficial for research in fields other than the Life Sciences.  The act of archiving and disseminating public information publicly does not have to be cause for an “ethical dilemma”, as long as the archived and disseminated information was public in the first instance. If the researcher were collecting and sharing data impossible to obtain freely and publicly we would be facing a different situtation. Publicly published data is public evidence and it should be subject to public research– Facebook is not Twitter, and Twitter research is not hacking into private mobile phone messages or emails. There is a difference between surveillance and recording for historical/sociological/scientific other research. Surveillance implies the collection and analysis of information that the public did not mean to be publicly accessible. Transparency is a positive consequence of publicness, and the public research of publicly-available data is an exercise in transparency and accountability.

A researcher like me is interested in scholarly and artistic networks online composed by individuals who have willingly set up public accounts on Twitter and who post content willingly using hashtags to organise their postings under particular categories and therefore be found under such categories. Individuals worried about the data they publish publicly, freely, openly on Twitter being collected by researchers for research purposes other than the ones they intended should perhaps reconsider how Twitter works. Moreover, it seems to me the likelihood of an individual user’s sensitive data being further disseminated from an academic’s research Twitter dataset is much smaller than the likelihood of it going viral as originally published through a Twitter client.

I suppose the most efficient ethical framework for Twitter use is the simplest one. If you don’t want it found, viewed, collected and potentially researched, don’t tweet it publicly.

More on qualitative research and anonymity: you might also be interested in this post by Mark Carrigan, and in the post by Pat Thompson he links to here.


Wenner MM (2014) Twitter to Release All Tweets to Scientists”, Scientific American, 1 June 2014, available at Accessed 27 May 2014.

Rivers CM and Lewis BL (2014) Ethical research standards in a world of big data [v1; ref status: approved with reservations 1,] F1000Research 2014, 3:38 (doi: 10.12688/f1000research.3-38.v1) Accessed 27 May 2014.

Sharing Research Images in a Networked World

Version 1.14. Written and published quickly… editing is ongoing… comments had been accidentally disabled, now enabled. If you are re-visiting this post, please refresh/reload your browser to ensure you see the latest version.

Update: Via Twitter Amber Thomas recommends Into the wild – Technology for open educational resources (December 2012) edited by  Lorna M. Campbell, Phil Barker, Martin Hawksey and herself (open access). Thank you Amber!


I should perhaps clarify that in this post I am thinking of “research images” in the case of charts, cartoons, doodles, infographics, posters etc. created by researchers/teachers/artists etc. and which are shared online. These images allow the inclusion of contextual text in the form of non-intrusive captions. I appreciate photographs shared online, particularly when published on line immediately after being taken, pose different problems.

I’ve also been thinking that researchers could be encouraged to share any research images we create on repositories of Open Educational Resources, which could contribute to creating awareness of licensing issues.

Attribution seems to me to be a key currency in scholarship (since direct financial reward for the creation/publishing of open content is rare). Therefore embedded licenses and self-archiving in repositories that offer a clear open licensing framework could be positive developments in the fostering of an academic culture that a) encourages sharing, b) recognises the work involved in sharing open resources, and c) attributes online sources.

Recently I’ve been thinking a lot about attribution in the scholarly context of our days. Having done research for Altmetric, for example, made me very sensitive to the differences in the way different disciplines and cultures behave online in relation to sharing, commenting and attributing research online.

When I conceived The Comics Grid I was primarly concerned with establishing innovative mechanisms for addressing the need for online comics scholarship where original and annotated comics pages where shown without being deterred by copyright. Part of the project included helping develop critical awareness of how we cite different sources, including ‘non-traditional’ sources like comic books, cartoons, blog posts, online videos…

As I mentioned in my Forms of Innovation workshop session last Saturday in Durham, the World Wide Web is not the Wild Wild West, even if sometimes it definitely feels like that, a kind of no-man’s land where everyone takes whatever they want, even, perhaps surprisingly, in scholarly circles. I believe that Creative Commons licenses are an ideal way to develop a culture of ethical sharing and attribution.

Licenses by themselves cannot stop people from using content created by others in ways the licenses themselves preclude, but can be used in a court of law if there is evidence of misuse. This means that Open Licenses cannot by themselves make people act ethically: even when there is due licensing, where attribution and granted or reserved rights are clearly stipulated, people can always potentially act wrongly. Same happens with the law. So using and promoting Creative Commons Licenses is only the beginning of helping create a different culture where the World Wide Web is no longer the Wild Wild West, but we need this culture to become gradually pervasive to be really effective.

In the UK, a new Enterprise and Regulatory Reform Act known as “The Instagram Act” has just been passed. Images found online that do not contain clear attribution can be considered ‘orphan works’ and therefore fall in the Public Domain (so anything goes with that content). Read about it here.

Earlier today, Amber Thomas from the University of Warwick tweeted a concern about infographics: “my problem with infographic practice is lack of provenance. hard to cite, lacking in publication date, rarely a clear copyright statement.” (Tweet, 1 May 2013; 11:29am GMT ).

A.J. Cann replied that “publishing on  would fix all that” (Tweet, 1 May 2013: 11:34am GMT). He is right (I also talked about Figshare as a means to ensure content is properly attributed, cited and licensed  in my presentation at Durham), but later I thought that perhaps that was not enough: files made to be shared online should include the attribution, citation and licensing information in the file itself.

Indeed, figshare helps providing a digital object identifier, citation and licensing information, but once the file is downloaded this can be shared further, separated from this context. Once downloaded the file can be endlessly shared, and if clear attribution and licensing is not included in the file, how many will actually trace back the file to the site it was originally made available from, where the attribution and licensing information appears? Thus the need for this information to be included in the file itself, not only on the figshare location from which people are downloading it from.

In the case of images this does not have to be a horrible watermark that compromises the artistic integrity of the image and renders it practically useless, and I’m not talking about some kind of digital rights management thing or restrictive permissions. Simply a clear legend explaining who is the author and in what terms the file is being shared, as a caption at the bottom of the image, in small but legible print. This information can/should be ideally included in the file’s properties too as metadata.

Take this fantastic image for example. I came across it through a retweet by Melonie Fullick.  I loved the image, and I retweeted Melonie’s tweet. I thought this is awsome! Who did it? Can we do t-shirts? Go on, click on the link again, it’s at

We notice from the URL the image file is hosted at which happens to be a site made with Tumblr. On that URL, the image file is orphaned from any context outside ‘Tumblr’ the name of the blog ‘annfriedman’ and the URL itself. I suspect many users will get there, see the image there and stop there: they won’t necessarily go and make an effort to find who did it or under what kind of license it has been shared online.

Because the image file has its own URL at Tumblr, I argue it is possible not to realise that the image is actually part of a blog post (and linked to it), with permalink On that blog post, Ann Friedman explains she “created The Disapproval Matrix**. (With a deep bow to its inspiration.)” (So please note that strictly speaking, as the author recognises, the image in question could be considered a “derivative” of another concept or series of images).

Granted, the image file URL, on its own, shows us the name of a person and the name of the Tumblr log (“annfriedman”) but what is crucial here is that the image file itself does not contain a caption indicating any authorship, attribution or licensing information, nor descriptive metadata, in human readable form, of what it is. One has to do “dilligent search” to find the actual blog post with the contextual information, and even then there is no indication whatsoever about how we as readers/visitors/users are allowed to use the image file in question (which has everything to go viral if you ask me).  If one scrolls down though, one finds the legend “Copyright 2012 Ann Friedman” at the bottom right corner of the Web site’s footer, but not in the post itself, and as I’ve said, not in the image file itself.

Copying “The Disapproval Matrix” is as easy as dragging and dropping. Folk are already sharing the link to the image file, not the link to the blog post that contains the image file and which explains Ann Friedman created it basing herself in the “Approval Matrix” series from New York Magazine.

Now, this post is not about this particular image or its author. It is not a personal critique. I have also shared lots of images online which do not contain attribution and licensing information on the files themselves. I am making use of an example to make a point, about how images are easily reproduced online and about what authors can do about it, regardless if they care or not if they are attributed for their work.

This is what the Web does: it makes decontextualising extremely easy, and it demands an effort from users to locate source, authorship, ownership and/or licensing. As authors of content, we cannot assume that people surfing the Web will all do “dilligent research” to find to whom does an image or any other file (say, an academic paper in PDF or PowerPoint presentation) belong to and how they can use it. The image file and the blog post providing context are very easily separable; the name in a Web resource’s title or URL are no clear indication of authorship, and we cannot just assume that people will make the effort to do “dilligent research.”

The context we live in online is one of attention deficit and speed. Social media platforms allow, encourage and maximise decontextualisation and recontextualisation. Tumblr, Instagram, Twitter, Pinterest: a file that does not indicate source and other information required for citation in itself (as a caption in the case of an image file, which is not in HTML of the resource hosting the file but as part of the image itself and in the file’s metadata) will always run the danger of becoming orphaned.

Needless to say, images can be edited using very basic software, and PDFs can be annotated, slides containing attribution and license deleted, etc. People wanting to steal content will do so no matter what. But we have to stop acting alarmed if our content ends up being shared and reused endlessly without our name if we don’t take some basic measures to ensure everyone and anyone will know easily and directly and very much obviously who created what, and in which ways others are allowed to use it.