The 2018 Altmetric Top 100 Outputs with ‘Comics’ as Keyword

As it’s that time of the year and Altmetric has released its 2018 Top 100, in this post I share the 2018 Top 100 research outputs with ‘comics’ as a keyword according to Altmetric.

I queried the data from the Altmetric Explorer, looking for all outputs with this keyword between 13/12/2017 and 13/12/2018. I then refined the data to concentrate only on the Top 100 outputs about comics.

To see the complete Top 100, you can download the dataset I shared on figshare at https://doi.org/10.6084/m9.figshare.7467116.v1.

Below you can quickly take a look at the top 20 outputs with keyword “comics” ordered by their Altmetric Attention score :

Altmetric Attention Score Title Journal/Collection Title Publication Date
524 Ten simple rules for drawing scientific comics PLoS Computational Biology 04/01/2018
286 Comixify: Transform video into a comics 09/12/2018
154 Teaching Confidentiality through Comics at One Spanish Medical School AMA Journal of Ethics 01/02/2018
99 Bruised and Battered: Reinforcing Intimate Partner Violence in Comic Books Feminist Criminology 17/05/2018
84 Of Microscopes and Metaphors: Visual Analogy as a Scientific Tool The Comics Grid: Journal of Comics Scholarship 10/10/2018
79 The potential of comics in science communication JCOM – Journal of Science Communication 23/01/2018
65 Alter egos: an exploration of the perspectives and identities of science comic creators JCOM – Journal of Science Communication 16/01/2018
61 Using comics to change lives The Lancet 01/01/2018
50 The Question Concerning Comics as Technology: Gestell and Grid The Comics Grid: Journal of Comics Scholarship 24/09/2018
47 A survey of comics research in computer science 16/04/2018
41 Is There a Comic Book Industry? Media Industries 05/06/2018
38 The Utility of Multiplex Molecular Tests for Enteric Pathogens: a Micro-Comic Strip Journal of Clinical Microbiology 24/01/2018
38 Farting Jellyfish and Synergistic Opportunities: The Story and Evaluation
of Newcastle Science Comic
The Comics Grid: Journal of Comics Scholarship 20/03/2018
35 Pitfalls in Performing Research in the Clinical Microbiology Laboratory: a Micro-Comic Strip Journal of Clinical Microbiology 25/09/2018
34 Neural Comic Style Transfer: Case Study 05/09/2018
31 Comics and the Ethics of Representation in Health Care … AMA Journal of Ethics AMA Journal of Ethics 01/02/2018
29 Undemocratic Layout: Eight Methods of Accenting Images The Comics Grid: Journal of Comics Scholarship 25/05/2018
29 Communicating Science through Comics: A Method Publications 30/08/2018
26 Of Cornopleezeepi and Party Poopers: A Brief History of Physicians in Comics … AMA Journal of Ethics AMA Journal of Ethics 01/02/2018
26 On the Significance of the Graphic Novel to Contemporary Literary Studies: A Review of The Cambridge Companion to the Graphic Novel The Comics Grid: Journal of Comics Scholarship 19/09/2018
DOI Altmetric Details Page URL
10.1371/journal.pcbi.1005845 https://www.altmetric.com/details/31266263
https://www.altmetric.com/details/52485006
10.1001/journalofethics.2018.20.2.medu1-1802 https://www.altmetric.com/details/32564583
10.1177/1557085118772093 https://www.altmetric.com/details/41904868
10.16995/cg.130 https://www.altmetric.com/details/49471637
10.22323/2.17010401 https://www.altmetric.com/details/32104944
10.22323/2.17010201 https://www.altmetric.com/details/31748235
10.1016/s0140-6736(17)33258-0 https://www.altmetric.com/details/31292645
10.16995/cg.133 https://www.altmetric.com/details/48839521
https://www.altmetric.com/details/37717650
10.3998/mij.15031809.0005.102 https://www.altmetric.com/details/43846275
10.1128/jcm.01916-17 https://www.altmetric.com/details/32171741
10.16995/cg.119 https://www.altmetric.com/details/34631498
10.1128/jcm.01144-18 https://www.altmetric.com/details/48881364
https://www.altmetric.com/details/47890394
10.1001/journalofethics.2018.20.2.fred1-1802 https://www.altmetric.com/details/32521484
10.16995/cg.102 https://www.altmetric.com/details/42619367
10.3390/publications6030038 https://www.altmetric.com/details/47265663
10.1001/journalofethics.2018.20.2.mhst1-1802 https://www.altmetric.com/details/32529286
10.16995/cg.138 https://www.altmetric.com/details/48647607

To see the complete Top 100, you can download the dataset I shared on figshare at https://doi.org/10.6084/m9.figshare.7467116.v1.

I am obviously very pleased to see The Comics Grid included in the Top 100.

It is interesting to note the diversity of countries associated to the profiles (where the metadata was available) giving attention to the outputs. According to Altmetric, there were 4,588 tweets about research outputs with ‘comics’ as keyword between 13/12/17 and 13/12/18 by 2,866 unique tweeters in 98 different countries. The map looks like this:

Countries and Number of Profiles that Gave Attention to Research Outputs with 'Comics' Keyword between 13/12/17 and 13/12/18 according to Altmetric. Chart by Altmetric Explorer.
Countries and Number of Profiles that Gave Attention to Research Outputs with ‘Comics’ Keyword between 13/12/17 and 13/12/18 according to Altmetric. Chart by Altmetric Explorer.

 

I shared the countries data on figshare at https://doi.org/10.6084/m9.figshare.7467455.v1.

For more information and context on Altmetric and using the Altmetric Explorer, see my 2016 post here. Many other posts about alternative metrics and the Altmetric Explorer can be found throghout my blog.

References

Priego, Ernesto (2018): Altmetric Top 100 Outputs with ‘Comics’ Keyword between 13/12/17 and 13/12/18. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7467116.v1

Priego, Ernesto (2018): Countries and Number of Profiles that Gave Attention to Research Outputs with ‘Comics’ Keyword between 13/12/17 and 13/12/18 according to Altmetric. figshare. Dataset. https://doi.org/10.6084/m9.figshare.7467455.v1

Questions of Access in the Digital Humanities: Data from JDSH

[On 8 August 2017, this post was selected as Editor’s Choice in Digital Humanities Now at http://digitalhumanitiesnow.org/2017/08/questions-of-access-in-the-digital-humanities-data-from-jdsh/]

[N.B. As usual, typos might still be present when you read this; this blog post is likely to be revised post-publication… thanks for understanding. This blog is a sandbox of sorts].

Para Domenico, siempre en deuda

tl;dr, scroll down to the charts

I used The Altmetric Explorer to locate any  articles from the Journal of Digital Scholarlship in the Humanities that had had any ‘mentions’ online anytime. An original dataset of 82 bibliographic entries was obtained. With the help of Joe McArthur the Open Access Button API was then employed to detect if any of the journal articles in the dataset had open access surrogates (for example, self-archived versions in institutional repositories) and if so, which content they actually provided access to. The API located 24 URLs of the 82 DOIs corresponding to each article in the dataset.

I then edited and refined the original dataset to include only the top 60 results. Each result was manually refined and cross-checked to verify the resulting links matched the correct outputs and to what kind of content they provided access to, as well as to identify the type of license and type of access of each article’s version of record.

A breakdown of the findings below:

Visualisation of numeralia from the JDSH 60 Articles Altmetric-OA Button Dataset

(Note numbers re OA Button results will not add up as there are overlaps and some results belong to categories not listed).

It must be highlighted that only one of the links located via the Open Access Button API provided access to an article’s full version.

This disciplinarily-circumscribed example from a leading journal in the field of the digital humanities provides evidence for further investigations into the effects of publishers’ embargos on the ability of institutional open access repositories to fufill their mission effectively.

The dataset was openly shared on figshare as

Priego, Ernesto (2017): A Dataset Listing the Top 60 Articles Published in the Journal of Digital Scholarship in the Humanities According to the Altmetric Explorer (search from 11 April 2017), Annotated with Corresponding License and Access Type and Results, when Available, from the Open Access Button API (search from 15 May 2017). figshare. https://doi.org/10.6084/m9.figshare.5278177.v3

 

The Wordy Thing

Back in 2014, we suggested that “altmetrics services like the Altmetric Explorer can be an efficient method to obtain bibliographic datasets and track scholarly outputs being mentioned online in the sources curated by these services” (Priego et al 2014).  That time we used the Explorer to analyse a report obtained by searching for the term ‘digital humanities’ in the titles of outputs mentioned anytime at the time of our query.

It’s been three years since I personally presented that poster at DH2014 in Lausanne, but the topic of publishing pracitices within the digital humanities keeps being of great interest to me. It could be thought of as extreme academic navel-gazing, this business of deciding to look into bibliometric indicators and metadata of scholarly publications. For the digital humanities, however, questions of scholarly communications are questions of methodology, as the technologies and practices required for conducting research and teaching are closely related to the technologies and practices required to make the ‘results’ of teaching and research available. For DH insiders, this is closely connected to the good ol’ less-yacking-more-hacking, or rather, no yacking without hacking. Today, scholarly publishing is all about technological infrastructure, or at least about an ever-growing awareness of the challenges and opportunities of ‘hacking’ the modes of scholarly production.

Moreover, the digital humanities have also been for long preoccupied with the challenges in getting digital scholarship recoginsed and rewarded, and, also importantly, about the difficulties to ensure the human, technical and financial preconditions of sustainability. Scholarly publishing, or more precisely ‘scholarly communications’ as we prefer to say today, are also very much focused on those same concerns. If form and content are unavoidably interlinked and codependent in digital humanities practice, surely issues regarding the so-called ‘dissemination’ of said practice through publications remain vital to its development.

Anyway, I have now finally been able to share a dataset based on a report from the Altmetric Explorer looking into the articles published at the Journal of Digital Scholarship in the Humanities (from now on JDSH), one of the (if not the) leading journal in the field of digital humanities (it was previously titled Literary and Linguistic Computing). I first started looking into which JDSH articles were being tracked by Altmetric as mentioned online for the event organised by Domenico Fiormonte  at the University Roma Tre in April this year (the slides from my participation are here).

My motivation was no only to identify which JDSH outputs (and therefore authors, affiliations, topics, methodologies) were receiving online attention according to Altmetric. I wanted, as we had done previously in 2014, to use an initial report to look into what kind of licensing said articles had, whether they were ‘free to read’, paywalled or labeled with the orange open lock that identifies Open Access outputs.

Back in 2014 we did not have the Open Access Button nor its plugin and API. With it I had the possibility to try to check if any of the articles in my dataset had any openly/freely available versions through the Button. I contacted Joe McArthur from the Button to enquire whether it would be possible to run a list of DOIs through their API in bulk. It was, and we obtained some results.

Here’s a couple of very quick charts visualising some insights from the data.

It should also be highlighted that of the 6 links to institutional repository deposits found via the Open Access Button API, only one gave open access to the full version of the article. The rest were either metatada-only deposits or the full versions were embargoed.

As indicated above, the 60 ‘total articles’ refers to the number of entries in the dataset we are sharing. There are many more articles published in JDSH. The numbers presented represent only the data in question which is in turn the result of particular methods of collection and analysis.

In 2014 we detected that “the 3 most-mentioned outputs in the dataset were available without a paywall”, and we thought that could indicate “the potential of Open Access for greater public impact.” In this dataset, the three articles with the most mentions are also available without a paywall. The most mentioned article is the only one in the set that is licensed with a CC-BY license. The two that follow are ‘free’ articles that require permission for reuse.

The data presented is the result of the specific methods employed to obtain the data. In this sense this data represents as much a testing of the technologies employed as of the actual articles’ licensing and open availability. This means that data in columns L-P reflect the data available through the Open Access Button API at the moment of collection. It is perfectly possible that ‘open surrogates’ of the articles listed are available elsewhere through other methods. Likewise, it is perfectly possible that a different corpus of JDSH articles collected through other methods (for example, of articles without any mentions as tracked by Altmetric) have a different proportion of license and access types etc.

As indicated above the licensing and access type of each article were identified and added manually and individually. Article DOI’s were accessed one by one with a computer browser outside/without access to university library networks, as the intention was to verify if any of the articles were available to the general public without university library network/subscription credentials.

This blog post and the deposit of the data is part of a work in progress and is shared openly to document ongoing work and to encourage further discussion and analyses. It is hoped that quantitative data on the limited level of adoption of Creative Commons licenses and Institutional Repositories within a clearly-circumscribed corpora can motivate reflection and debate.

Acknowledgements

I am indebted to Joe McArthur for his kind and essential help cross-checking the original dataset with the OA Button API, and to Euan Adie and all the Altmetric team for enabling me to use the Altmetric Explorer to conduct research at no cost.

Previous Work Mentioned

Priego, Ernesto; Havemann, Leo; Atenas, Javiera (2014): Online Attention to Digital Humanities Publications (#DH2014 poster). figshare. https://doi.org/10.6084/m9.figshare.1094345.v1 Retrieved: 18:46, Aug 04, 2017 (GMT).

Priego, Ernesto; Havemann, Leo; Atenas, Javiera (2014): Source Dataset for Online Attention to Digital Humanities Publications (#DH2014 poster). figshare. https://doi.org/10.6084/m9.figshare.1094359.v5 Retrieved: 17:52, Aug 04, 2017 (GMT)

Priego, Ernesto (2017): Aprire l’Informatica umanistica / Abriendo las humanidades digitales / Opening the Digital Humanities. figshare. https://doi.org/10.6084/m9.figshare.4902995.v1 Retrieved: 18:00, Aug 04, 2017 (GMT)

On UK Labour and Conservatives Tweet Sources

 

I‘ve been tracking the Twitter accounts of the UK Labour, Conservative, Green, and LibDem parties as we approach June the 8th (General Election). I am interested in what they are saying on Twitter through their official Twitter accounts, how they are saying it, how often and what apps they choose to do so.

Unfortunately there are still some duplicates in my Twitter data collection, but I can at least share at this point the sources used to tweet from the UK Labour and Conservatives Twitter accounts, as well as some indicative numbers, bearing in mind they may vary slightly, for tweets per source in a sample of 500 Tweets per account from 12/05/2017 to 01/06/2017 so far.

 

from_user
UKLabour
Source Count
MediaStudio 279
SproutSocial 106
TweetDeck 55
Twibbon 1
Twitter for Android 3
Twitter for iPhone 8
Twitter Web Client 48
500

 

from_user
Conservatives
Source Count
MediaStudio 25
TweetDeck 222
Twitter for iPhone 73
Twitter Web Client 180
500

Even bearing in my mind the sample of 500 tweets from each account may still contain some duplicates, the list of sources alone provides objective indication of each account’s social media management tool preferences. Something that stands out is that in comparison to, say, the realdonaldtrump account, none of these tweets were posted from Twitter Ads.

The source list indicates to me that UK Labour has attempted a more professional social media management strategy, with a reduced number of tweets from Android, iPhone and the Web Client, whereas the Conservatives have a majority of tweets coming from free & anyone-can-use apps, with no shortage of tweets coming from an iPhone (but no Android at all).

This short update is part of an ongoing lunchtime pet project for which I wish I had more time, but hey.  I also have data from the other political parties, but no time right now. Anyway, for what it’s worth, I thought I’d share.


N.B. Dear Guardian Data, in case you like what you see here and you ‘borrow’ the idea or any data… please kindly attribute and link back. It’s only polite to do so. Thank you!

Exeunt Android; Enter Ads: An Update on the Sources of Presidential Tweetage

 

A quick update as something I consider interesting has emerged from the ongoing archiving of the, er, current ‘Trumpian’ tweetage (see a previous post here). In case you do follow this blog you may be aware I’ve been keeping an eye on the ‘source’ of the Tweets, which is information (a metadata field) pertaining to each published Tweet which is made publicly visible by Twitter to anyone through certain applications like TweetDeck and directly through Twitter’s API (for Twitter’s ‘Field Guide’, see this).

Given the diversity of sources detected on the Tweets from the account under scrutiny in the past, hypotheses have been proposed suggesting correlations between type of content and source (application used to post Tweets); others have suggested that it is also indication of different people behind the account (though as we have said previously it is also possible that the same person tweets from different devices and applications).

Anyway, here’s some recent new insights emerging from the data since the last post:

  • Since Inauguration Day (20 January 2017), the last Tweet coming from Twitter for Android so far was timestamped 25/03/2017 10:41 (AM; DC time).  No Tweets from Android have been posted since that Tweet until the time of writing of this.
  • The last Tweet coming from the Twitter Web Client so far was timestamped 25/01/2017  19:03:33. No more Tweets with the Web Client as source have been posted (or collected by my archive) since then.
  • Since Inauguration Day, the Tweet timestamped 31/03/2017  14:30:38 was the first one to come from Twitter Ads. Since then 21 Tweets have been posted from Twitter Ads, the last one so far timestamped 17/05/2017  16:36:02.
  • During April and May 2017 Tweets have only come from Twitter for iPhone or Twitter Ads. The account in question has tweeted every sincle day throughout May until today 18 May 2017, a total of 90 Tweets so far (including a duplicated one in which a typo was corrected). Below a breakdown per source:

 

from_user Month Source Count
realDonaldTrump May Twitter for iPhone 81
Twitter Ads 9

 

As a keen Twitter user I personally find it interesting Twitter for Android has stopped being used by the account in question and that Twitter Ads has been used recently (instead?) in alternation with the Tweets from iPhone. Eyeballing the dataset quickly appears to indicate there might be a potential correlation between Tweets with links and official announcements (rather than statements/opinions) and Twitter Ads, but that requires looking into more closely and I will have to leave that for another time.


*Public note to self: I need to get rid of this habit of capitalising ‘Tweets’ as a noun… it becomes annoying.

People, Government: Top 300 Terms in the Conservative and Labour Manifestos 2017 (Counts and Trends)

A word cloud of the most frequent 500 terms in the Conservative Manifesto 2017. Word cloud created with Voyant Tools.
A word cloud of the most frequent 500 terms in the Conservative Manifesto 2017. Word cloud created with Voyant Tools.

The Labour and Conservative Manifestos 2017 are arguably two of the most important public documents in the UK these days. I have just deposited the following data on figshare:

Priego, Ernesto (2017): Top 300 Terms in the Conservative and Labour Manifestos 2017 (Counts and Trends). figshare. https://doi.org/10.6084/m9.figshare.5016983.v1

I thought some may be interested in practicing some distant reading, or have some fun composing your own Manifesto…

Android vs iPhone: Source Counts and Trends in a Bit More than a Year’s Worth of Trumpian Tweetage

Last month I took a quick look at a month’s worth of Trumpian tweetage (user ID 25073877)  using text analysis. Using a similar methodology I have now prepared and shared a CSV file containing Tweet IDs and other metadata of 3,805 Tweets  from user ID 25073877 posted publicly between Thursday February 25  2016 16:35:12 +0000  to Monday April 03 2017 12:51:01 +0000. I deposited the file on figshare, including notes on motivation and methodology, here:

3805 Tweet IDs from User 25073877 [Thu Feb 25 16:35:12 +0000 2016 to Mon Apr 03 12:51:01 +0000 2017].

figshare. https://doi.org/10.6084/m9.figshare.4811284.v1

The dataset allows us count the sources for each Tweet (i.e. the application used to publish each Tweet according to the data provided by the Twitter Search API). The resulting counts are:

Source Tweet Count
Twitter for iPhone 1816
Twitter for Android 1672
Twitter Web Client 287
Twitter for iPad 22
Twitter Ads 3
Instagram 2
Media Studio 2
Periscope 1

As we have seen in previous posts, the account has alternated between iPhone and Android since the Inauguration. I wanted to look at relative trends throughout the dataset. Having prepared the main dataset I performed the text analysis of a document comprising the source listing arranged in chronological order according to the date and time of Tweet publication, and the listing corresponds to Tweets published between 25 February 2016 and Monday 3 April 2017. Using the Trends tool in Voyant, I divided the document in 25 segments, with the intention to roughly represent each monthly period covered in the listing and highlight source relative frequency trends in the period covered per segment.

The Trends tool shows a line graph depicting the distribution of a word’s occurrence across a corpus or document; in this case each word represents the source of a Tweet in the document. Each line in the graph is coloured according to the word it represents, at the top of the graph a legend displays which words are associated with which colours. I only included the most-used sources, leaving iPad there as reference.

The resulting graph looks like this:

Line Graph of Relative Frequencies of four most used sources by realdonaldtrump visualised in 25 segments of a document including Twe3,805 Tweets  from user ID 25073877 posted publicly between Thursday February 25  2016 16:35:12 +0000  to Monday April 03 2017 12:51:01 +0000. Data collected and analysed by Ernesto Priego. CC-BY. Chart made with Trends, Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (CC-BY 2017).
Line graph of the relative frequencies of the four most used sources visualised in 25 segments of a document including 3,805 Tweets from user ID 25073877 dated between Thursday February 25 2016 16:35:12 +0000 and Monday April 03 2017 12:51:01 +0000. Data collected and analysed by Ernesto Priego. CC-BY. Chart made with Trends, Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (CC-BY 2017).

I enjoyed this article by Christopher Ingraham (Washington Post Weblog, 3 April 2017), and I envy the access to the whole Trupian tweetage dataset, that would be essential to attempt to reproduce the analysis presented. The piece focuses on the use of exclamation marks (something I took an initial look at on my 6 February 2017 post), but it would be useful to take a closer look at any potential significant correlations between use of language in specific Tweets and the sources used to post those Tweets.

The article also has an embedded video titled ‘When it’s actually Trump tweeting, it’s way angrier’, repeating claims that there is a clear difference between those Tweets the account in question published from an iPhone and those published from an Android. I briefly referred to this issue on my 15 March 2017 post already, and I have not seen evidence yet that it is a staffer who actually posts from Twitter for iPhone from the account. I may be completely wrong, but I am still not convinced there is data-backed evidence to say for certain that Tweets from different sources are always tweeted by two or more different people, or that the differences in language per source are predictable and reliably attributable to a single specific person (the same people can after all tweet from the same account using different devices and applications, and indeed potentially. use different language/discourse/tone).  Anecdotal, I know, but I have noticed that sometimes my tweetage from the Android mobile app is different from my tweetage from TweetDeck on my Mac, but no regular patterns can be inferred there.

I do not necessarily doubt there is more than one person using the account, nor that the language used may vary significantly depending on the Tweets’ source.  What I’d like to see however is more robust studies demonstrating and highlighting correlations between language use in Tweets- texts and Tweets’ sources from the account in question taking into consideration that the same users can own different devices and use different language strategies depending on a series of contextual variables. Access to the source data of said studies should be consider essential for any assessment of any results or conclusions provided. Limitations and oppostion to more open sharing of Twitter data for research reproducibility are just one hurdle on the way for more scholarship in this area.

Android vs iPhone: Trends in a Month’s Worth of Trumpian Tweetage

What’s in a month’s worth of presidential tweetage?

I prepared a dataset containing a total of 123 public Tweets and corresponding metadata from user_id_str 25073877 between 15 February 2017 06:40:32 and 15 March 2017  08:14:20 Eastern Time (this figure does not factor in any tweets the user may have deleted shortly after publication). Of the 123 Tweets 68 were published from Android; 55 from iPhone. The whole text of the Tweets in the dataset accounts for 2,288 words, or 12,364 characters (no spaces; including URLs).

Using the Trends tools from Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell I visualised the raw frequencies of the terms ‘Android’ and ‘iPhone’ in this dataset over 30 segments (more or less corresponding to the length of the month covered in the dataset) where each timestamped Tweet, sorted in chronological order, had its corresponding source indicated.

The result looked like this:

Raw frequency of Tweets per source in 30 segments by realdonaldtrump between 15 February 2017 06:40:32 and 15 March 2017 08:14:20 Eastern Time. Total: 123 Tweets: 68 from Android; 55 from iPhone. Data collected and analysed by Ernesto Priego. CC-BY. Chart made with Trends, Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (CC 2017).
Raw frequency of Tweets per source in 30 segments by realdonaldtrump between 15 February 2017 06:40:32 and 15 March 2017 08:14:20 Eastern Time. Total: 123 Tweets: 68 from Android; 55 from iPhone. Data collected and analysed by Ernesto Priego. CC-BY. Chart made with Trends, Voyant Tools by Stéfan Sinclair & Geoffrey Rockwell (CC 2017).

The chart does indeed reflect the higher number of Tweets from Android, and it also shows how over the whole document both sources are, in spite of more frequent absences from Tweets from iPhone, present throughout. The question as usual is what does this tell us. Back in 9 August 2016 David Robinson published an insightful analysis where he concludes that “he [Trump] writes only the (angrier) Android half”. With the source data I have gathered so far it would be possible (given the time and right circumstances) to perform a content analysis of Tweets per source, in order to confirm or reject any potential corelations between types of Tweets (re: tone, function, sentiment, time of day) and source used to post them.

Eyeballing the data, specifically since Inauguration Day until the present, does not seem to provide unambiguous evidence that the Tweets are undoubtedly written by two different persons (or more). What it is factual is that the Tweets do come from different sources (see my previous post), but at the moment, like with everything else this administration has been doing, my cursory analysis has only found conflicting insights, where for example a Tweet one would perhaps have expected to have been posted from iPhone (attributable hypothetically to a potentially less inflammable aide) was in fact posted from Android, and viceversa.

I may be wrong, but at the moment I cannot see any evidence there is any kind of predictable pattern, let alone strategy, behind the alternation between Android and iPhone (the only two type of sources used to publish Tweet from the account in question in the last month). Most of the times Tweets by source type will come in sequences of four or more Tweets, but sometimes a random lone Tweet from a different source will be sandwiched in between.

More confunsigly, all of the Tweets published between 08/03/2017 18:50 and 15/03/2017  08:14:20 have only had iPhone as source, without exception. Attention to detail is required to run robust statistical and content analyses that consider complete timestamps and further code the Tweet text and time data into more discrete categories, attempting a high level of granularity at both the temporal (time of publishing; ongoing documented events) and textual (content; discourse) levels. (If you are reading this and would like to take a look at the dataset, DM me via Twitter).

Anyway. In case you are curious, here’s the top 20 most frequent words in the text of the tweets, per source, in this dataset ( 15 February 2017 06:40:32 and 15 March 2017  08:14:20 Eastern Time). Analysis courtesy of Voyant Tools, applying a customised English stop words list (excluding Twitter-specific terms like rt, t.co, https, etc, but leaving terms in hashtags).

Android iPhone
Term Count Trend Term Count Trend
fake 11 0.007795889 great 16 0.016129032
great 11 0.007795889 jobs 14 0.014112903
media 10 0.007087172 america 6 0.006048387
obama 10 0.007087172 trump 6 0.006048387
election 9 0.006378455 american 5 0.005040322
just 9 0.006378455 join 5 0.005040322
news 9 0.006378455 big 4 0.004032258
big 8 0.005669738 healthcare 4 0.004032258
failing 6 0.004252303 meeting 4 0.004032258
foxandfriends 6 0.004252303 obamacare 4 0.004032258
president 6 0.004252303 thank 4 0.004032258
russia 6 0.004252303 u.s 4 0.004032258
democrats 5 0.003543586 whitehouse 4 0.004032258
fbi 5 0.003543586 address 3 0.003024194
house 5 0.003543586 better 3 0.003024194
new 5 0.003543586 day 3 0.003024194
nytimes 5 0.003543586 exxonmobil 3 0.003024194
people 5 0.003543586 investment 3 0.003024194
white 5 0.003543586 just 3 0.003024194
american 4 0.002834869 make 3 0.003024194

Android vs iPhone: Most Frequent Words from_user_id_str 25073877 Per Source

I have archived 3,603 public Tweets from_user_id_str 25073877 published between 27/02/2016 00:06 and 27/02/2017 12:06 (GMT -5, Washington DC Time). This is almost exactly a year’s worth of Tweets from the account in question.

Eight source types were detected in the dataset. Most of the Tweets were published either from iPhone (46%) or an Android (45%).

The Tweet counts per source are as follows:

 

Instagram 2
MediaStudio 1
Periscope 1
Twitter Ads 1
Twitter for Android 1629
Twitter for iPad 22
Twitter for iPhone 1660
Twitter Web Client 287
 Total 3603

 

The table above visualised as a bar chart, just because:

 

Source of 3603 Tweets from_user_id_str 25073877 (27/02/2016 00:06 to 27/02/2017 12:06) Bar chart.

 

As a follow/up to a previous post, I share in the table below the top 50 most frequent word forms per source (iPhone and Android) in this set of 3,603 Tweets  from_user_id_str 25073877, courtesy of a quick text analysis (applying a customised English stop word list globally) made with Voyant Tools:

 

Android iPhone
Term Count Trend Term Count Trend
great 276 0.008124816 thank 417 0.015241785
hillary 252 0.00741831 trump2016 215 0.007858475
trump 184 0.005416544 great 190 0.006944698
crooked 162 0.004768914 makeamericagreatagain 165 0.006030922
people 160 0.004710038 join 160 0.005848167
just 151 0.004445099 rt 144 0.00526335
clinton 120 0.003532529 hillary 119 0.004349574
big 107 0.003149838 clinton 118 0.004313023
media 106 0.0031204 america 111 0.004057166
thank 94 0.002767148 trump 104 0.003801309
bad 89 0.002619959 make 89 0.003253043
president 88 0.002590521 new 88 0.003216492
make 86 0.002531646 tomorrow 82 0.002997186
america 85 0.002502208 people 75 0.002741328
cnn 85 0.002502208 maga 73 0.002668226
country 72 0.002119517 today 73 0.002668226
like 72 0.002119517 americafirst 69 0.002522022
u.s 72 0.002119517 draintheswamp 68 0.002485471
time 71 0.00209008 tonight 67 0.00244892
said 67 0.001972329 ohio 66 0.002412369
jobs 66 0.001942891 vote 63 0.002302716
vote 63 0.001854578 just 61 0.002229614
win 63 0.001854578 florida 59 0.002156512
new 62 0.00182514 crooked 52 0.001900654
going 59 0.001736827 going 49 0.001791001
news 58 0.001707389 imwithyou 49 0.001791001
bernie 56 0.001648513 president 49 0.001791001
foxnews 55 0.001619076 votetrump 49 0.001791001
good 54 0.001589638 tickets 46 0.001681348
wow 53 0.0015602 american 43 0.001571695
job 50 0.001471887 time 43 0.001571695
nytimes 50 0.001471887 pennsylvania 42 0.001535144
republican 50 0.001471887 poll 41 0.001498593
0 49 0.001442449 soon 41 0.001498593
today 49 0.001442449 support 41 0.001498593
totally 49 0.001442449 enjoy 38 0.00138894
enjoy 48 0.001413012 campaign 37 0.001352389
cruz 46 0.001354136 rally 37 0.001352389
election 46 0.001354136 carolina 35 0.001279287
look 46 0.001354136 north 35 0.001279287
want 46 0.001354136 live 34 0.001242735
obama 44 0.001295261 speech 33 0.001206184
dishonest 41 0.001206947 california 18 0.000657919
can’t 39 0.001148072 hillaryclinton 18 0.000657919
night 39 0.001148072 honor 18 0.000657919
really 39 0.001148072 job 18 0.000657919
show 39 0.001148072 nevada 18 0.000657919
way 39 0.001148072 right 18 0.000657919
ted 38 0.001118634 supertuesday 18 0.000657919

 

I thought you’d like to know.

#TheDataDebates: A Quick Twitter Data Summary

Screenshot of an interactive visualisation of a #TheDataDebates archive created with Martin Hawksey's TAGSExplorer
Screenshot of an interactive visualisation of a #TheDataDebates archive created with Martin Hawksey’s TAGSExplorer

1 October 2016 Update: I have now deposited on figshare a CSV file with timestamps, source and user_lang metadata of the archived tweets.

Priego, Ernesto (2016): #TheDataDebates Tweet Timestamps, Source, User Language. figshare.https://dx.doi.org/10.6084/m9.figshare.3976731.v1. Retrieved: 10 03, Oct 01, 2016 (GMT)

Social Media Data: What’s the use‘ was the title of a panel discussion held at The British Library, London, on Wednesday 21 September 2016, 18:00 – 20:00. The official hashtag of the event was #TheDataDebates.

I made a collection of Tweets tagged with #TheDataDebates published publicly between 12/09/2016 09:06:52 and 22/09/2016 09:55:03 (BST).

Again I used Tweepy 3.5.0, a Python wrapper for the Twitter API, for the collection. Learning to mine with Python has been fun and empowering. To compare results I also used, as usual, Martin Hawksey’s TAGS, with results being equal (I only collected Tweets from accounts with at least 1 follower). Having the collected data already in a spreadsheet saved me time. I only collected Tweets from accounts with at least one follower.

Here’s a summary of the collection:

First Tweet in Archive 12/09/2016 09:06:52
Last Tweet in Archive 22/09/2016 09:55:03
Number of Tweets 

594

Number of links

152

Number of RTs

312

Number of accounts

152

From the main archive I was able to focus on number of Tweets per source and user language setting.

Source

source Count
Twitter for iPhone

246

Twitter Web Client

131

Twitter for Android

100

Twitter for iPad

74

TweetDeck

12

UK Trends

11

Mobile Web (M5)

5

Hootsuite

5

Twitter for Windows Phone

3

Big Data news flow

1

Linkis

1

Twtterrific

1

iOS

1

Flipboard

1

Lt RTEngine

1

RoundTeam

1

Total

594

User Language Setting (user_lang)

user_lang Count Notes
en

547

en-gb

32

fr

7

6 of it are spam
de

3

it

3

ar

2

both spam
Total

594

 The summary above is of the raw collection so not all the activity it reflects is either ‘human’ nor relevant, as some accounts tweeting have been identified as bots tweeting spam (a less human readable hashtag could have potentially avoided such spamming given the relatively low activity). Except where I identified spam Tweets, in this post I have not looked at the Tweets’ text data (i.e. I haven’t shared here any text or content analysis). Maybe if I have time in the near future. As Retweets were counted as Tweets in this archive a more specific and precise analysis would have to filter them from the dataset.

I am fully aware this would be more interesting and useful if there were opportunities for others to replicate the analysis through access to the source dataset I used. There are lots of interesting types of analysis that could be run and data to focus on in such a dataset as this. As in previous posts about other events, I am simply sharing this post right now as a quick indicative update published only a few hours after the event concluded.

It was pointed out last night that “social media data mining is starting but still has a way to go to catch up with hard analytical methodologies.” A post like this does not claim to employ a such methodologies, it simply seeks to contribute to the debate with evidence that may hopefully inspire other studies.  Perhaps it’s a two-way process, and  “hard analytical methodologies” (and researchers’ and users’ attitudes regarding cultural paradigms around ethics, privacy, consent, statistical significance)  have also a way to go to catch up with new/recent pervasive forms of data creation and dissemination that perhaps require different, media-community- and content-specific approaches to doing research.

Other Considerations [I am reusing my own text from previous posts here]


Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailon, Sandra, et al, 2012). Apart from the filters and limitations already declared, it cannot be guaranteed that each and every Tweet tagged with #TheDataDebates during the indicated period was analysed. The dataset was shared for archival, comparative and indicative educational research purposes only.

Only content from public accounts, obtained from the Twitter Search API, was analysed.  The source data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account. These posts and the resulting dataset contain the results of analyses of Tweets that were published openly on the Web with the queried hashtag; the content of the Tweets is responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.This work is shared to archive, document and encourage open educational research into scholarly activity on Twitter.

No private personal information was shared. The collection, analysis and sharing of the data has been enabled and allowed by Twitter’s Privacy Policy. The sharing of the results complies with Twitter’s Developer Rules of the Road. A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag.

The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). Tweets published publicly by scholars or other professionals during academic conferences or events are often publicly tagged (labeled) with a hashtag dedicated to the event n question. This practice used to be the confined to a few ‘niche’ fields; it is increasingly becoming the norm rather than the exception. Though every reason for Tweeters’ use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences.

As Twitter users, conference Twitter hashtag contributors have agreed to Twitter’s Privacy and data sharing policies.Professional associations like the Modern Language Association and the American Pyschological Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter’s search API has well-known temporal limitations for retrospective historical search and collection. Beyond individual Tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. Though this work has limitations and might not be thoroughly systematic, it is hoped it can contribute to developing new insights into a discipline’s public concerns as expressed on Twitter over time.

References

González-Bailon, Sandra and Wang, Ning and Rivero, Alejandro and Borge-Holthoefer, Javier and Moreno, Yamir, Assessing the Bias in Samples of Large Online Networks (December 4, 2012).  Available at SSRN: http://dx.doi.org/10.2139/ssrn.2185134

Priego, Ernesto (2016) #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2AHRC [ahrcpress]. (2016, Sep 21).

Social media data mining is starting but still has a way to go to catch up with hard analytical methodologies #TheDataDebates [Tweet]. Retrieved from https://twitter.com/ahrcpress/status/778652767636389888

Priego, Ernesto (2016): #TheDataDebates Tweet Timestamps, Source, User Language. figshare. https://dx.doi.org/10.6084/m9.figshare.3976731.v1 Retrieved: 10 03, Oct 01, 2016 (GMT)

Libraries! Most Frequent Terms in #WLIC2016 Tweets (part IV)

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress
82nd IFLA General Conference and Assembly
13–19 August 2016, Columbus, Ohio, USA. Copyright by IFLA, CC BY 4.0.

 


 

This is part IV. For necessary context, methodology, limitations, please see here (part 1),  here (part 2), and here (part 3).

Since this was published and shared for the first time I may have done new edits. I often come back to posts once they have been published to revise them.

Throughout the process of performing the day by day text analysis I became aware of other limitations to take into account and I have revised part 3 accordingly.

Summary

Here’s a summary of the counts of the source (unrefined) #WLIC2016 archive I collected:

Number of Links

12435

Number of RTs estimate based on occurrence of RT

14570

Number of Tweets

23552

Unique Tweets <-used to monitor quality of archive

23421

First Tweet in Archive 14/08/2016 11:29:03 EDT
Last Tweet in Archive 22/08/2016 04:20:53 EDT
In Reply Ids

270

In Reply @s

429

Number of Tweeters

3035

As previously indicated the Tweet count includes RTs. This count might require further deduplication and it might include bots’ Tweets and possibly some unrelated Tweets.

Here’s a summary of the Tweet count of the #WLIC2016  dataset I refined from the complete archive. As I explained in part 3 I organised the Tweets into conference days, from Sunday 14 to Thursday 18 August. Each day was a different corpus to analyse. I also analysed the whole set as a single corpus to ensure the totals replicated.

Day Tweet count
Sunday 14 August 2016

2543

Monday 15 August 2016

6654

Tuesday 16 August 2016

4861

Wednesday 17 August 2016

4468

Thursday 18 August 2016

3801

Thursday – Sunday

22327

 

The Most Frequent Terms

The text analysis involved analysing each corpus, first obtaining a ‘raw’ output of 300 most frequent terms and their counts. As described in previous posts, I then applied an edited English stop words list followed by a manual editing of the top 100 most frequent terms (for the shared dataset) and of the top 50 for this post. Unlike before in this case I removed ‘barack’ and ‘obama’ from Thursday and Monday’s corpora, and tried to remove usernames and hashtags though it’s posssible that further disambiguation and refining might be needed in those top 100 and top 50.

The text analysis of the Sun-Thu Tweets as a single corpus gave us the following Top 50:

#WLIC2016 Sun-Thu Top 50 Most Frequent Terms (stop-words applied; edited)

Rank

Term Count

1

libraries

2895

2

library

2779

3

librarians

1713

4

session

1467

5

access

872

6

world

832

7

public

774

8

copyright

766

9

people

757

10

need

750

11

data

746

12

make

733

13

privacy

674

14

digital

629

15

new

615

16

wikipedia

602

17

indigenous

593

18

use

574

19

information

555

20

great

539

21

knowledge

512

22

literacy

502

23

internet

481

24

work

428

25

thanks

419

26

message

416

27

future

412

28

change

379

29

social

378

30

open

369

31

just

354

32

research

353

33

know

330

34

community

323

35

important

319

36

oclc

317

37

collections

312

38

books

300

39

learn

300

40

opening

291

41

read

289

42

impact

287

43

place

282

44

good

280

45

services

277

46

national

276

47

best

272

48

latest

269

49

report

267

50

users

266

As mentioned above I also analysed each day as a single corpus. I refined the ‘raw’ 300 most frequent terms per day to a top 100 after stop words and manual editing. I then laid them all out as a single table for comparison.

#WLIC2016 Top 50 Most Frequent Terms per Day Comparison (stop-words applied; edited)

Rank

Sun 14 Aug

Mon 15 Aug

Tue 16 Aug

Wed 17 Aug

Thu 18 Aug

1

libraries library library libraries libraries

2

library libraries privacy library library

3

librarians librarians libraries librarians librarians

4

session session librarians indigenous public

5

access copyright session session session

6

world wikipedia people knowledge need

7

public digital data access data

8

copyright make indigenous data impact

9

people world make literacy new

10

need internet access need digital

11

data access wikipedia great world

12

make new use people thanks

13

privacy need information research access

14

digital use world public value

15

new public public new national

16

wikipedia future knowledge marketing change

17

indigenous people copyright general privacy

18

use message homeless open great

19

information collections literacy world work

20

great information oclc archives research

21

knowledge content great just use

22

literacy open homelessness national people

23

internet report need assembly knowledge

24

work space freedom place social

25

thanks trend like make using

26

message great thanks read know

27

future net internet community make

28

change work info social services

29

social neutrality latest reading skills

30

open making experiencing work award

31

just update theft information information

32

research books important use learning

33

know collection just learn users

34

community social subject share book

35

important design change matters user

36

oclc data guidelines key best

37

collections thanks digital know collections

38

books librarian students global academic

39

learn know know government measure

40

opening shaping online life poland

41

read google protect thanks community

42

impact change working important learn

43

place literacy statement development outcomes

44

good just work love share

45

services technology future impact time

46

national online read archivist media

47

best poster award good section

48

latest info create books important

49

report working services cultural service

50

users law good help closing

I have shared on figshare a datset containing the summaries above as well as the raw top 300 most frequent terms for the whole set as well as divided per day. The dataset also includes the top 100 most frequent terms lists per day that I  manually edited after having applied the edited English stop word filter.

You can download the spreadsheet from figshare:

Priego, Ernesto (2016): #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2

Please bear in mind that as refining was done manually and the Terms tool does not always seem to apply stop words evenly there might be errors. This is why the raw output was shared as well. This data should be taken to be indicative only.

As it is increasingly recommended for data sharing, the CC-0 license has been applied to the resulting output in the repository. It is important however to bear in mind that some terms appearing in the dataset might be licensed individually differently; copyright of the source Tweets -and sometimes of individual terms- belongs to their authors.  Authorial/curatorial/collection work has been performed on the shared file as a curated dataset resulting from analysis, in order to make it available as part of the scholarly record. If this dataset is consulted attribution is always welcome.

Ideally for proper reproducibility and to encourage other studies the whole archive dataset should be available.  Those wishing to obtain the whole Tweets should still be able to get them themselves via text and data mining methods.

Conclusions

Indeed, for us today there is absolutely nothing surprising about the term ‘libraries’ being the most frequent word in Tweets coming from IFLA’s World Library and Information Congress. Looking at the whole dataset, however, provides an insight into other frequent terms used by Library and Information professionals in the context of libraries. These terms might not remain frequent for long, and might not have been frequent words in the past (I can only hypothesise– having evidence would be nice).

A key hypothesis for me guiding this exercise has been that perhaps by looking at the words appearing in social media outputs discussing and reporting from a professional association’s major congress, we can get a vague idea of where a sector’s concerns are/were.

I guess it can be safely said that words become meaningful in context. In an age in which repetition and frequency are key to public constructions of cultural relevance (‘trending topics’ increasingly define the news agenda… and what people talk about and how they talk about things) the repetition and frequency of key terms might provide a type of meaningful evidence in itself.  Evidence, however, is just the beginning– further interpretation and analysis must indeed follow.

One cannot obtain the whole picture from decomposing a collectively, socially, publicly created textual corpus (or perhaps any corpus, unless it is a list of words from the start) into its constituent parts. It could also be said that many tools and methods often tell us more about themselves (and those using them) than about the objects of study.

So far text analysis (Rockwell 2003) and ‘distant reading’ through automated methods has focused on working with books (Ramsay 2014). However I’d like to suggest that this kind of text analysis can be another way of reading social media texts and offer another way to contribute to the assessment of their cultural relevance as living documents of a particular setting and moment in time. Who knows, they might also be telling us something about the present perception and activity of a professional field- and might help us to compare it with those in the future.

Other Considerations

Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailon, Sandra, et al, 2012).

Apart from the filters and limitations already declared, it cannot be guaranteed that each and every Tweet tagged with #WLIC2016 during the indicated period was analysed. The dataset was shared for archival, comparative and indicative educational research purposes only.

Only content from public accounts, obtained from the Twitter Search API, was analysed.  The source data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

These posts and the resulting dataset contain the results of analyses of Tweets that were published openly on the Web with the queried hashtag; the content of the Tweets is responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.

This work is shared to archive, document and encourage open educational research into scholarly activity on Twitter. The resulting dataset does not contain complete Tweets nor Twitter metadata. No private personal information was shared. The collection, analysis and sharing of the data has been enabled and allowed by Twitter’s Privacy Policy. The sharing of the results complies with Twitter’s Developer Rules of the Road.

A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). Tweets published publicly by scholars or other professionals during academic conferences are often publicly tagged (labeled) with a hashtag dedicated to the conference in question. This practice used to be the confined to a few ‘niche’ fields; it is increasingly becoming the norm rather than the exception.

Though every reason for Tweeters’ use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour.

In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter’s Privacy and data sharing policies.

Professional associations like the Modern Language Association and the American Pyschological Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter’s search API has well-known temporal limitations for retrospective historical search and collection.

Beyond individual Tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. Though this work has limitations and might not be thoroughly systematic, it is hoped it can contribute to developing new insights into a discipline’s public concerns as expressed on Twitter over time.

References

González-Bailon, Sandra and Wang, Ning and Rivero, Alejandro and Borge-Holthoefer, Javier and Moreno, Yamir, Assessing the Bias in Samples of Large Online Networks (December 4, 2012).  Available at SSRN: http://dx.doi.org/10.2139/ssrn.2185134

Priego, Ernesto (2016) #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2

Ramsay, Stephen (2014) “The Hermeneutics of Screwing Around; or What You Do with a Million Books.” In Pastplay: Teaching and Learning History with Technology, edited by Kevin Kee, 111-20. Ann Arbor: University of Michigan Press, 2014. Also available at http://quod.lib.umich.edu/d/dh/12544152.0001.001/1:5/–pastplay-teaching-and-learning-history-with-technology?g=dculture;rgn=div1;view=fulltext;xc=1

Rockwell, Geoffrey (2003) “What is Text Analysis, Really? [PDF]” preprint, Literary and Linguistic Computing, vol. 18, no. 2, 2003, p. 209-219.

What’s in a Word? Most Frequent Terms in #WLIC2016 Tweets (part III)

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress
82nd IFLA General Conference and Assembly
13–19 August 2016, Columbus, Ohio, USA. Copyright by IFLA, CC BY 4.0.

This is part three. For necessary context please start here (part 1) and here (part 2). The final, fourth part is here.

It’s Friday already and the sessions from IFLA’s WLIC 2016 have finished. I’d like to finish what I started and complete a roundup of my quick (but in practice not-so-quick) collection and text analysis of a sample of #WLIC2016 Tweets. My intention is to finish this with a fourth and final blog post following this one and to share a dataset on figshare as soon as possible.

As previously I customised the spreadsheet settings to collect only Tweets from accounts with at least one follower and to reflect the Congress’ location and time zone. Before exporting as CSV I did a basic automated deduplication, but I did not do any further data refining (which means that non-relevant or spam Tweets may be included in the dataset).

What follows is a basic quantitative summary of the initial complete sample dataset:

  • Total Tweets: 22,540 Tweets (includes RTs)
  • First Tweet in complete sample dataset: Sunday 14/08/2016 11:29:03 EDT
  • Last Tweet in complete sample dataset: Friday 19/08/2016 04:20:43 EDT
  • Number of links:  11,676
  • Number of RTs:    13,859
  • Number of usernames: 2,811

The Congress had activities between Friday 12 August and Friday 19 August, but sessions between Sunday 14 August and Thursday 18 August. Ideally I would have liked to collect Tweets from the early hours of Sunday 14 August but I started collecting late so the earliest I got to was 11:29:03 EDT. I suppose at least it was before the first panel sessions started. For more context re: timings: see the Congress outline.

I refined the complete dataset to include only the days that featured panel sessions, and I have organised the data in a different sheet per day for individual analysis. I have also created a table detailing the Tweet counts per Congress sessions day. [Later I realised that though I had the metadata for the Columbus Ohio time zone I ended up organising the data into GMT/BST days. There is a 5 hours difference but the collected Tweets per day still roughly correspond to the timings of the conference. Of course many will have participated in the hashtag remotely –not present at the event– and many present will have tweeted not synchronically (‘live’).  I don’t think this makes much of a difference (no pun intended) to the analysis, but it’s something I was aware of and that others may or not want to consider as a limitation.

Tweets collected per day

Day Tweet count
Sunday 14 August 2016

2543

Monday 15 August 2016

6654

Tuesday 16 August 2016

4861

Wednesday 17 August 2016

4468

Thursday 18 August 2016

3801

Total Tweets in refined dataset: 22, 327 Tweets.

(Always bear in mind these figures reflect the Tweets in the collected dataset, it does not mean that as a fact that was the total number of Tweets published with the hashtag during that period. Not only does the settings of my querying affects the results; Twitter’s search API also has limitations and cannot be assumed to always return the same type or number of results).

I am still in the process of analysing the dataset. There are of course multiple types of analyses that one could do with this data but bear in mind that in this case I have only focused on using text analysis to obtain the most frequent terms in the text from the Tweets tagged with #WLIC2016 that I collected.

As before, in this case I am using the Terms tool from Voyant Tools to perform a basic text analysis in order to identify number of total words and unique word forms and most frequent terms per day; in other words, the data from each day became an individual corpus. (The complete refined dataset including all collected days could be analysed as a single corpus as well for comparison). I am gradually exporting and collecting the ‘raw’ output from the Terms tool per day, so that once I have finsihed applying the stop words to each corpus this output can be compared and so that it could be reproduced with other stop word lists if desired.

As before I am useing the English stop word list which I edited previously to include Twitter-specific terms (e.g. t.co, amp, https), as well as dataset-specific terms (e.g. the Congress’ Twitter account, related hashtags etc), but this time what I did differently is that I included all the 2,811 account usernames in the complete dataset so they would be excluded from the most frequent terms. These are the usernames from accounts with Tweets in the dataset, but other usernames (that were mentioned in Tweets’ text but that did not Tweet themselves with the hashtag) were logically not filtered, so whenever easily identifiable I am painstakingly removing them (manually!) from the remaining list. I am sure there most be a more effective way of doing this but I find the combination of ‘distant’ (automated) editing and ‘close’ (manual) editing interesting and fun.

I am using the same edited stop word list for each analysis. In this case I have also manually removed non-English terms (mostly pronouns, articles). Needless to say I did this not because I didn’t think they were relevant (quite the opposite) but because even though they had a presence they were not fairly comparable to the overwhelming majority of English terms (a ranking of most frequent non-English terms would be needed). As I will also have shared the unedited, ‘raw’ top most frequent terms in the dataset, anyone wishing to look into the non-English terms could ideally do so and run their own analyses without my own subjective stop word list and editing getting in the way. I tried to be as systematic as possible but disambiguation would be needed (the Terms tool is case and context insensitive, so a term could have been a proper name, or a username, and to be consistent I should have removed those too. Again, having the raw list would allow others to correct any filtering/curation/stop word mistakes).

I am aware there are way more sophisticaded methods of dealing with this data. Personally, doing this type of simple data collection and text analysis is an exercise and an interrogation of data collection and analysis methods and tools as reflective practices. An hypothesis behind it is that the terms a community or discipline uses (and retweets) do say something about those communities or disciplines, at least for a particular moment in time and a particular place in particular settings. Perhaps it also says things about the medium used to express those terms. When ‘screwing around‘ with texts it may be unavoidable to wonder what there is to it beyond ‘bean-counting’ (what’s in a word? what’s in a frequent term?), and what there is to social media and academic/professional live-tweeting that can or cannot be quantified. Doing this type of work makes me reflect as well about my own limitations, the limits of text analysis tools, the appropriateness of tools, the importance of replication and reproducibility and the need to document and to share what has been documented.

I’m also thinking about documentation and the open sharing of data outputs as messages in bottles, or as it has been said of metadata as ‘letters to the future’. I’m aware that this may also seem like navel-gazing of little interest outside those associated to the event in question. I would say that the role of libraries in society at large is more crucial and central than many outside the library and information sector may think (but that’s a subject for another time). Perhaps one day in the future it might be useful to look back at what we were talking about in 2016 and what words we used to talk about it. (Look, we were worried about that!) Or maybe no one cares and no one will care, or by then it will be possible to retrieve anything anywhere with great degrees of relevance and precision (including critical interpretation). In the meanwhile,  I will keep refining these lists and will share the output as soon as I can.

Next… the results!

The final, fourth part is here.

Most Frequent Terms in #WLIC2016 Tweets (part II)

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress
82nd IFLA General Conference and Assembly
13–19 August 2016, Columbus, Ohio, USA. Copyright by IFLA, CC BY 4.0.

 

The first part of this series provides necessary context.

I have now an edited list of the top 50 most frequent terms extracted from a cleaned dataset comprised of 10,721 #WLIC2016 Tweets published by 1,760 unique users between Monday 15/08/2016 10:11:08 EDT and Wednesday 17/08/2016 07:16:35 EDT.

The analysed corpus contained the raw text of the Tweets (includes RTs), comprising 185,006 total words and 12,418 unique word forms.

Stop words were applied as detailed in the first part of this series, and the resulting list (a raw list of 300 most frequent terms) was further edited to remove personal names, personal Twitter user names, common hashtags, etc.  Some organisational Twitter user names were not removed from the list, as an indication of their ‘centrality’ in the network based on the frequency with which they appeared in the corpus.

So here’s an edited list of the top 50 most frequent terms from the dataset described above:

Term Count
library

1379

libraries

1102

librarians

811

session

715

privacy

555

wikipedia

523

make

484

copyright

465

people

428

digital

378

access

375

use

362

public

340

data

322

need

319

iflabuild2016

308

world

308

information

298

internet

289

new

272

great

259

indigenous

255

iflatrends

240

report

202

knowledge

200

future

187

work

187

libraryfreedom

184

literacy

184

space

180

change

178

thanks

172

oclc

171

open

170

just

169

books

168

trend

165

important

162

info

162

know

162

social

161

net

159

neutrality

159

wikilibrary

158

collections

157

working

157

librarian

154

online

154

making

149

guidelines

148

Is this interesting? Is it useful? I don’t know, but I’ve enjoyed documenting it. Reflecting about different criteria to apply stop words and clean, refine terms has also been interesting.

I guess that deep down I believe it’s better to document than not to, even if we may think there should be other ways of doing it (otherwise I wouldn’t even try to do it). Value judgements about the utility or insightfulness of specific data in specific ways is an a posteriori process.

I hope to be able to continue collecting data and once the congress/conference ends I hope to be able to share a dataset with the raw (unedited, unfiltered) most frequent terms in the text from Tweets published with the event’s hashtag. If there’s anyone else interested they could clean, curate and analyse the data in different ways (wishful thinking but hey; it’s hope what guides us.).

%d bloggers like this: