The complete archive contains 1294 Tweets published publicly and tagged with #IGNCC14 between 18/07/2014 07:25:47 BST and 21/07/2014 10:17:15 BST.
The conference’s Twitter activity at a glance:
The Tweets contained in the archive were collected using Martin Hawksey’s TAGS 5.1. The file contains five sheets:
Sheet 0. A ‘Cite Me’ sheet, including procedence of this file, citation information, information about its contents, the methods employed and some context.
Sheet 1. Complete #IGNCC14 Archive (Conference days only). 1294 Tweets, from 18/07/2014 07:25:47 BST to 21/07/2014 10:17:15 BST.
Sheet 2. Friday 18 July 2014. 469 Tweets, from 18/07/2014 07:25:47 BST to 18/07/2014 21:27:23 BST.
Sheet 3. Saturday 19 July 2014. 390 Tweets, from 19/07/2014 06:54:24 BST to 19/07/2014 18:01:05 BST.
Sheet 4. Sunday 20 July 2014. 433 Tweets, from 20/07/2014 01:41:11 BST to 21/07/2014 10:17:15 BST.
Tweets collected under Local London, UK times. Times in GMT also included.
Only users with at least 2 followers were included in the archive. Retweets have been included. An initial automatic deduplication was performed. I manually organised and quantified the Tweets in the archive into conference days.
Please note that both research and experience show that the Twitter search API isn’t 100% reliable. Large tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). It is not guaranteed thE file contains each and every Tweet tagged with #IGNCC14 during the indicated period, and is shared for comparative and indicative educational and research purposes only.
Please note the data in this file is likely to require further refining and even deduplication. The data is shared as is. This dataset is shared to encourage open research into scholarly activity on Twitter. If you use or refer to this data in any way please cite and link back using the citation information above.
Below I share some charts I created with my #dhsi2014 archive (as I have indicated data might require further refining; visualisations including the infographic in the second link above offer a view of specific data; also this is a work in progress done for rapid publication and sharing. I am aware there might be errors, etc.).
You can click on the charts to enlarge them.
[The #dhsi2014 tweets per day in archive bar chart had a typo so have removed to correct; will replace as soon as I can].
[*I had forgotten to say that indeed 770 Tweets from one single user seems disproportionate in context– will check the reason –deduplication might be needed, or maybe too many RTs?]
This is just a quick snippet to jot down some ideas as some kind of follow-up to my blog post on the Ethics of researching Twitter datasets republished today [28 May 2014] by the LSE Impact blog.
If you have ever tried to keep up with Twitter you will know how hard it is. Tweets are like butterflies– one can only really look at them for long if one pins them down out of their natural environment. The reason why we have access to Twitter in any form is because of Twitter’s API, which stands for “Application Programming Interface”. As Twitter explains it,
“An API is a defined way for a program to accomplish a task, usually by retrieving or modifying data. In Twitter’s case, we provide an API method for just about every feature you can see on our website. Programmers use the Twitter API to make applications, websites, widgets, and other projects that interact with Twitter. Programs talk to the Twitter API over HTTP, the same protocol that your browser uses to visit and interact with web pages.”
You might also know that free access to historic Twitter search results are limited to the last 7 days. This is due to several reasons, including the incredible amount of data that is requested from Twitter’s API, and –this is an educated guess– not disconnected from the fact that Twitter’s business model relies on its data being a commodity that can be resold for research. Twitter’s data is stored and managed by at least one well-known third-party, Gnip, one of their “certified data reseller partners”.
For the researcher interested in researching Twitter data, this means that harvesting needs to be done not only automatedly (needless to say storyfiying won’t cut it, even if your dataset is to be very small), but in real time.
As Twitter grew, their ability to satisfy the requests from uncountable services changed. Around August 2012 they announced that their 1.0 version of their API would be switched off in March 2013. About a month later they announced the release of a new version of their API. This imposed new limitations and guidelines (what they call their “Developer Rules of the Road“). I am not a developer, so I won’t attempt to explain these changes like one. As a researcher, this basically means that there is no way to do proper research of Twitter data without understanding how it works at API level, and this means understanding the limitations and possibilities this imposes on researchers.
Taking how the Twitter API works into consideration, it is not surprising that González-Bailón et al (2012) should alert us that the Twitter Search API isn’t 100% reliable, as it “over-represents the more central users and does not offer an accurate picture of peripheral activity” (“Assessing the bias in communication networks sampled from twitter”, SSRN 2185134). What’s a researcher to do? The whole butterfly colony cannot be captured with the nets most of us have available.
In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public an archive of tweets from 2006 through April, 2010. The Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. On 4 January 2013, the Library of Congress announced an update on their Twitter collection, publishing a white paper [PDF] that summarized the Library’s work collecting Twitter (we haven’t heard of any new updates yet). There they said that
“Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes.”
To get an idea of the enormity of the project, the Library’s white paper says that
“On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.
As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.”
To date, none of this data is yet publicly available to researchers. This is why many of us were very excited when on 5 February 2014 Twitter announced their call for “Twitter Data Grants” [closed on 15 March 2014]. This was/is a pilot programme [we haven’t heard anything about it yet either]. In the call, Twitter clarified that
“For this initial pilot, we’ll select a small number of proposals to receive free datasets. We can do this thanks to Gnip, one of our certified data reseller partners. They are working with us to give selected institutions free and easy access to Twitter datasets. In addition to the data, we will also be offering opportunities for the selected institutions to collaborate with Twitter engineers and researchers.”
“It’s worth stressing that Twitter’s initial pilot will be limited to a small number of proposals, but those who do get access will have the opportunity to “collaborate with Twitter engineers and researchers”. This isn’t the first time Twitter have opened data to researchers having made data available for a Jisc funded project to analyse the London Riot and while I expect Twitter end up with a handful of elite researchers/institutions hopefully the pilot will be extended.”
There are researchers out there who Most researchers out there are likely not to benefit from access to huge Twitter data dumps. We are working with relatively small data sets, limited by the methods we use to collect, archive and study the data (and by our own disciplinary frameworks, [lack of] funding and other limitations). We are trying to do the talk whilst doing the walk, and conduct research on Twitter and about Twitter.
There should be no question now about how valuable Twitter data can be for researchers of perhaps all disciplines. Given the difficulty to properly collect and analyse Twitter data as viewable from most Twitter Web and mobile clients (as most users get it) and the very limited short-span of search results, there is the danger of losing huge amounts of valuable historical material. As Jason Leavey (2013) says, “social media presents a growing body of evidence that can inform social and economic policy”, but
“A more sophisticated and overarching approach that uses social media data as a source of primary evidence requires tools that are not yet available. Making sense of social media data in a robust fashion will require a range of different skills and disciplines to come together. This is a process already taking shape in the research community, but it could be hastened by government.”
At the moment, unlimited access to the data has been the privilege of a few lucky individuals and elite institutions.
So, why collect and share Twitter data?
In my case, Martin Hawksey’s Twitter Archive Google Spreadsheet has provided a relatively-simple method to collect some tweets from academic Twitter backchannels (for an example, start with this post and this dataset). I have been steadily collecting them for qualitative and quantitative analysis and archival and historical reasons since at least 2010.
My interest is to also share this data with the participants of the studied networks, in order to encourage collaboration, interest, curiosity, wider dissemination, aswareness, reproducibility of my own findings and ideally further research. For the individual researcher there is a wealth of data out there that, within the limitations imposed by the Twitter API and the archival methods we have at our disposal, can be saved and made accessible before it disappears.
Figshare has been a brilliant addition to my Twitter data research workflow, enabling me to get a Digital Object Identifier for my uploaded outputs, and useful metrics that give me a smoke signal that I am not completely mad and alone in the world.
I believe that you should cite data in just the same way that you can cite other sources of information, such as articles and books. Sharing research data openly can have several benefits, not limited to
providing public evidence of social media outputs (not static, subject to modification, deletion) as static records, for public verification, assessment
enabling easy research reuse and reproducibility of research source data
allowing the reach of data to be measured or tracked
strengthening research networks and fostering exchange and collaboration.
Finally, some useful sources of information that have inspired me to share small data sets are:
This coming academic year with my students at City, University of London I am looking forward to discussing and dealing practically with the challenges and opportunities of researching, collecting, curating, sharing and preserving data such as the kind we can obtain from Twitter.
If you’ve read this far you might be interested to know that James Baker (British Library) and me will lead a workshop at the dhAHRC event ‘Promoting Interdisciplinary Engagement in the Digital Humanities’ [PDF] at the University of Oxford on 13 June 2014.
This session will offer a space to consider the relationships between research in the arts and humanities and the use and reuse of research data.
Some thoughts on what research data is, the difference between available and useable data, mechanisms for sharing, and what types of sharing encourage reuse will open the session.
Through structured group work, the remainder of the session will encourage participants to reflect on their own research data, to consider what they would want to keep, to share with restrictions, or to share for unrestricted reuse, and the reasons for these choices.
Update: for some recent work with a small Twitter dataset,
Originally, UKSG stood for the United Kingdom Serials Group. Now that their geographic appeal has grown beyond the UK, and the scope has broadened to include e-books, e-learning and other e-resources as well as serials and e-journals, UKSG have stopped expanding the acronym.
I was honoured to participate in the morning plenary on Tuesday 15 April 2014 9:30-10:30 AM BST. My title was “The Impacts of ‘Impact’: challenges and opportunities of ‘multichannel’ academic work”. You can now see it on UKSG’s YouTube channel… [embedded below].
The conference had a lively backchannel under the #uksglive hashtag. I archived the tweets using Martin Hawksey’s Twitter Archiving Google Spreadsheet (TAGS).
Some insights from the conference’s backchannel:
Number of tweets in archive started 09/04/2014 17:14:33 BST; last tweet in archive 16/04/2014 18:24:45 BST:
Twitter Activity during the 3 days of the conference:
Top tweeters, 9-16 April 2014:
I have shared the source data on figshare as a CSV file containing tweets tagged with #uksglive from Friday April 11 12-00-51 +0000 2014 to Wednesday April 16 17:24:45 +0000 2014. The dates in the CSV file are GMT (not BST).
The original archive contained tweets dating back to 9 April 2014 but for relevance this dataset concentrates on the main activity immediately before, during and a few hours after the actual conference. Some of the data has been cleaned but duplications and even one or two spam tweets might have remained. The data is shared as is.
Please note there was also some Twitter activity around the conference using the hashtags #uksg and #uksg14, but those tweets were not included in this collection.
If you find this data useful and/or use it for your research, please kindly cite this file as indicated above and share it openly with others. Please feel free to get in touch via Twitter @ernestopriego or by sending me an email via my contact page on this blog.
Research Information published an article based on my UKSG presentation. Read it here.
Version 1.14. Written and published quickly… editing is ongoing… comments had been accidentally disabled, now enabled. If you are re-visiting this post, please refresh/reload your browser to ensure you see the latest version.
Update: Via Twitter Amber Thomas recommendsInto the wild – Technology for open educational resources (December 2012) edited by Lorna M. Campbell, Phil Barker, Martin Hawksey and herself (open access). Thank you Amber!
I should perhaps clarify that in this post I am thinking of “research images” in the case of charts, cartoons, doodles, infographics, posters etc. created by researchers/teachers/artists etc. and which are shared online. These images allow the inclusion of contextual text in the form of non-intrusive captions. I appreciate photographs shared online, particularly when published on line immediately after being taken, pose different problems.
I’ve also been thinking that researchers could be encouraged to share any research images we create on repositories of Open Educational Resources, which could contribute to creating awareness of licensing issues.
Attribution seems to me to be a key currency in scholarship (since direct financial reward for the creation/publishing of open content is rare). Therefore embedded licenses and self-archiving in repositories that offer a clear open licensing framework could be positive developments in the fostering of an academic culture that a) encourages sharing, b) recognises the work involved in sharing open resources, and c) attributes online sources.
Recently I’ve been thinking a lot about attribution in the scholarly context of our days. Having done research for Altmetric, for example, made me very sensitive to the differences in the way different disciplines and cultures behave online in relation to sharing, commenting and attributing research online.
When I conceived The Comics GridI was primarly concerned with establishing innovative mechanisms for addressing the need for online comics scholarship where original and annotated comics pages where shown without being deterred by copyright. Part of the project included helping develop critical awareness of how we cite different sources, including ‘non-traditional’ sources like comic books, cartoons, blog posts, online videos…
As I mentioned in my Forms of Innovation workshop session last Saturday in Durham, the World Wide Web is not the Wild Wild West, even if sometimes it definitely feels like that, a kind of no-man’s land where everyone takes whatever they want, even, perhaps surprisingly, in scholarly circles. I believe that Creative Commons licenses are an ideal way to develop a culture of ethical sharing and attribution.
Licenses by themselves cannot stop people from using content created by others in ways the licenses themselves preclude, but can be used in a court of law if there is evidence of misuse. This means that Open Licenses cannot by themselves make people act ethically: even when there is due licensing, where attribution and granted or reserved rights are clearly stipulated, people can always potentially act wrongly. Same happens with the law. So using and promoting Creative Commons Licenses is only the beginning of helping create a different culture where the World Wide Web is no longer the Wild Wild West, but we need this culture to become gradually pervasive to be really effective.
In the UK, a new Enterprise and Regulatory Reform Act known as “The Instagram Act” has just been passed. Images found online that do not contain clear attribution can be considered ‘orphan works’ and therefore fall in the Public Domain (so anything goes with that content). Read about it here.
Earlier today, Amber Thomas from the University of Warwick tweeted a concern about infographics: “my problem with infographic practice is lack of provenance. hard to cite, lacking in publication date, rarely a clear copyright statement.” (Tweet, 1 May 2013; 11:29am GMT ).
A.J. Cann replied that “publishing on http://Figshare.org would fix all that” (Tweet, 1 May 2013: 11:34am GMT). He is right (I also talked about Figshare as a means to ensure content is properly attributed, cited and licensed in my presentation at Durham), but later I thought that perhaps that was not enough: files made to be shared online should include the attribution, citation and licensing information in the file itself.
Indeed, figshare helps providing a digital object identifier, citation and licensing information, but once the file is downloaded this can be shared further, separated from this context. Once downloaded the file can be endlessly shared, and if clear attribution and licensing is not included in the file, how many will actually trace back the file to the site it was originally made available from, where the attribution and licensing information appears? Thus the need for this information to be included in the file itself, not only on the figshare location from which people are downloading it from.
In the case of images this does not have to be a horrible watermark that compromises the artistic integrity of the image and renders it practically useless, and I’m not talking about some kind of digital rights management thing or restrictive permissions. Simply a clear legend explaining who is the author and in what terms the file is being shared, as a caption at the bottom of the image, in small but legible print. This information can/should be ideally included in the file’s properties too as metadata.
We notice from the URL the image file is hosted at http://annfriedman.com/ which happens to be a site made with Tumblr. On that URL, the image file is orphaned from any context outside ‘Tumblr’ the name of the blog ‘annfriedman’ and the URL itself. I suspect many users will get there, see the image there and stop there: they won’t necessarily go and make an effort to find who did it or under what kind of license it has been shared online.
Granted, the image file URL, on its own, shows us the name of a person and the name of the Tumblr log (“annfriedman”) but what is crucial here is that the image file itself does not contain a caption indicating any authorship, attribution or licensing information, nor descriptive metadata, in human readable form, of what it is. One has to do “dilligent search” to find the actual blog post with the contextual information, and even then there is no indication whatsoever about how we as readers/visitors/users are allowed to use the image file in question (which has everything to go viral if you ask me). If one scrolls down though, one finds the legend “Copyright 2012 Ann Friedman” at the bottom right corner of the Web site’s footer, but not in the post itself, and as I’ve said, not in the image file itself.
Copying “The Disapproval Matrix” is as easy as dragging and dropping. Folk are already sharing the link to the image file, not the link to the blog post that contains the image file and which explains Ann Friedman created it basing herself in the “Approval Matrix” series from New York Magazine.
Now, this post is not about this particular image or its author. It is not a personal critique. I have also shared lots of images online which do not contain attribution and licensing information on the files themselves. I am making use of an example to make a point, about how images are easily reproduced online and about what authors can do about it, regardless if they care or not if they are attributed for their work.
This is what the Web does: it makes decontextualising extremely easy, and it demands an effort from users to locate source, authorship, ownership and/or licensing. As authors of content, we cannot assume that people surfing the Web will all do “dilligent research” to find to whom does an image or any other file (say, an academic paper in PDF or PowerPoint presentation) belong to and how they can use it. The image file and the blog post providing context are very easily separable; the name in a Web resource’s title or URL are no clear indication of authorship, and we cannot just assume that people will make the effort to do “dilligent research.”
The context we live in online is one of attention deficit and speed. Social media platforms allow, encourage and maximise decontextualisation and recontextualisation. Tumblr, Instagram, Twitter, Pinterest: a file that does not indicate source and other information required for citation in itself (as a caption in the case of an image file, which is not in HTML of the resource hosting the file but as part of the image itself and in the file’s metadata) will always run the danger of becoming orphaned.
Needless to say, images can be edited using very basic software, and PDFs can be annotated, slides containing attribution and license deleted, etc. People wanting to steal content will do so no matter what. But we have to stop acting alarmed if our content ends up being shared and reused endlessly without our name if we don’t take some basic measures to ensure everyone and anyone will know easily and directly and very much obviously who created what, and in which ways others are allowed to use it.