#rfringe17: Top 230 Terms in Tweetage





tl; dr

Repository Fringe is a gathering for repository managers and others interested in research data repositories and publication repositories.

I collected an archive of #rfringe17, containing 1118 Tweet IDs. I then analysed the text in the tweets with Voyant Tools to identify most frequent terms and manually refined the results to 230 terms.

I collected an archive of #rfringe17 tweets using TAGS. The key stats from the archive:

Number of Tweets in Archive 1,118
Number of usernames in Archive 215
First Tweet Collected 26/07/2017 14:58:12
Last Tweet Collected 05/08/2017 08:00:06

From http://www.repositoryfringe.org/:

Repository Fringe is a gathering for repository managers and others interested in research data repositories and publication repositories. Participation is a key element – the event is designed to encourage all attendees to share their repository experiences and expertise.

2017 marks the 10th Repo Fringe where we will be celebrating progress we have made over the last 10 years to share content beyond borders and debating future trends and challenges.

It took place in Edinburgh,  3 – 4 August 2017.

If you are not new to this blog you will then guess that I could not resist running the text of the tweets collected through Voyant Tools to obtain the term counts in the corpus with their Terms tool. As usual I applied the English stop words filter which I customised to include Twitter-specific terms (such as https, t.co, etc.) and the list of usernames.

I then manually refined the resulting data to remove smileys and any remaining usernames (some might have survived as it’s hard to disambiguate sometimes normal terms from usernames). I limited the results to 230 top terms.

Do take the counts with a pinch of salt as I did not clean the export from TAGS so Tweet duplicates and perhaps even some spam (who knows) might have remained.

Term Count
research 109
open 106
data 104
wikidata 75
oa 72
openscience 66
repository 63
repofringe 56
repositories 53
libraries 51
openresleeds 49
copyright 46
just 43
science 42
good 41
impact 41
thanks 41
day 39
access 38
poster 36
work 35
openaccess 34
talk 34
edinburgh 30
today 30
great 29
ucl 29
sherpa 28
read 27
want 27
event 26
project 26
really 26
time 26
cool 25
fringe 25
policy 24
metadata 23
publishers 23
publishing 23
says 23
colleague 22
policies 22
wikipedia 22
workflow 22
guide 21
millar 21
useful 21
comprehensive 20
content 20
fascinating 20
interesting 20
liveblogs 20
rdm 20
institutional 19
issue 19
it’s 19
liveblog 19
look 19
new 19
think 19
workshop 19
check 18
citizen 18
events 18
group 18
ip 18
management 18
need 18
outputs 18
presentation 18
rescue 18
session 18
trump 18
casrai 17
cycle 17
excellent 17
journal 17
lots 17
promotion 17
query 17
resource 17
uk 17
best 16
future 16
press 16
stuff 16
gallery 15
i’m 15
key 15
ref 15
showing 15
successful 15
support 15
thank 15
working 15
art 14
come 14
core 14
fun 14
miss 14
nice 14
process 14
provide 14
reminding 14
university 14
using 14
way 14
add 13
beautiful 13
demo 13
deposit 13
eprints 13
forward 13
funders 13
importance 13
keynote 13
looking 13
paper 13
phd 13
researchers 13
vote 13
e.g 12
era 12
especially 12
feedback 12
generation 12
got 12
let 12
needed 12
observation 12
recent 12
report 12
review 12
showcase 12
site2cite 12
star 12
theses 12
try 12
we’re 12
weirdness 12
advises 11
attendees 11
boat 11
broken 11
coar 11
control 11
criteria 11
exposure 11
global 11
institutions 11
like 11
model 11
prof 11
scholarly 11
survey 11
trek 11
use 11
years 11
articles 10
award 10
case 10
excited 10
exposing 10
figshare 10
gifts 10
hear 10
highlighted 10
important 10
initiative 10
integrating 10
introducing 10
live 10
opening 10
platform 10
ref2021 10
spend 10
vision 10
week 10
won 10
workshops 10
altmetric 9
colleagues 9
current 9
discussion 9
evidence 9
field 9
getting 9
i’ll 9
infrastructure 9
inspiring 9
library 9
link 9
list 9
local 9
long 9
make 9
meeting 9
peer 9
post 9
practice 9
preservation 9
problem 9
role 9
service 9
shoutout 9
shows 9
slides 9
sure 9
team 9
thought 9
touch 9
tweets 9
works 9
added 8
based 8
believe 8
better 8
change 8
conference 8
contributing 8
days 8
european 8
example 8
far 8
favourite 8
fully 8
here’s 8
image 8
included 8

Logically sharing this data as an HTML table is not the best way of doing it but hey. I have the source data if anyone is interested; Twitter developer guidelines allow the sharing of tweet IDs. In this case the source data is composed by the dataset of 1118 tweet ID strings (id_str).

Maybe I missed it but in the list above I could not find ‘bepress’ or ‘elsevier‘, by the way…

#TheDataDebates: A Quick Twitter Data Summary

Screenshot of an interactive visualisation of a #TheDataDebates archive created with Martin Hawksey's TAGSExplorer
Screenshot of an interactive visualisation of a #TheDataDebates archive created with Martin Hawksey’s TAGSExplorer

1 October 2016 Update: I have now deposited on figshare a CSV file with timestamps, source and user_lang metadata of the archived tweets.

Priego, Ernesto (2016): #TheDataDebates Tweet Timestamps, Source, User Language. figshare.https://dx.doi.org/10.6084/m9.figshare.3976731.v1. Retrieved: 10 03, Oct 01, 2016 (GMT)

Social Media Data: What’s the use‘ was the title of a panel discussion held at The British Library, London, on Wednesday 21 September 2016, 18:00 – 20:00. The official hashtag of the event was #TheDataDebates.

I made a collection of Tweets tagged with #TheDataDebates published publicly between 12/09/2016 09:06:52 and 22/09/2016 09:55:03 (BST).

Again I used Tweepy 3.5.0, a Python wrapper for the Twitter API, for the collection. Learning to mine with Python has been fun and empowering. To compare results I also used, as usual, Martin Hawksey’s TAGS, with results being equal (I only collected Tweets from accounts with at least 1 follower). Having the collected data already in a spreadsheet saved me time. I only collected Tweets from accounts with at least one follower.

Here’s a summary of the collection:

First Tweet in Archive 12/09/2016 09:06:52
Last Tweet in Archive 22/09/2016 09:55:03
Number of Tweets 


Number of links


Number of RTs


Number of accounts


From the main archive I was able to focus on number of Tweets per source and user language setting.


source Count
Twitter for iPhone


Twitter Web Client


Twitter for Android


Twitter for iPad




UK Trends


Mobile Web (M5)




Twitter for Windows Phone


Big Data news flow










Lt RTEngine






User Language Setting (user_lang)

user_lang Count Notes






6 of it are spam






both spam


 The summary above is of the raw collection so not all the activity it reflects is either ‘human’ nor relevant, as some accounts tweeting have been identified as bots tweeting spam (a less human readable hashtag could have potentially avoided such spamming given the relatively low activity). Except where I identified spam Tweets, in this post I have not looked at the Tweets’ text data (i.e. I haven’t shared here any text or content analysis). Maybe if I have time in the near future. As Retweets were counted as Tweets in this archive a more specific and precise analysis would have to filter them from the dataset.

I am fully aware this would be more interesting and useful if there were opportunities for others to replicate the analysis through access to the source dataset I used. There are lots of interesting types of analysis that could be run and data to focus on in such a dataset as this. As in previous posts about other events, I am simply sharing this post right now as a quick indicative update published only a few hours after the event concluded.

It was pointed out last night that “social media data mining is starting but still has a way to go to catch up with hard analytical methodologies.” A post like this does not claim to employ a such methodologies, it simply seeks to contribute to the debate with evidence that may hopefully inspire other studies.  Perhaps it’s a two-way process, and  “hard analytical methodologies” (and researchers’ and users’ attitudes regarding cultural paradigms around ethics, privacy, consent, statistical significance)  have also a way to go to catch up with new/recent pervasive forms of data creation and dissemination that perhaps require different, media-community- and content-specific approaches to doing research.

Other Considerations [I am reusing my own text from previous posts here]

Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailon, Sandra, et al, 2012). Apart from the filters and limitations already declared, it cannot be guaranteed that each and every Tweet tagged with #TheDataDebates during the indicated period was analysed. The dataset was shared for archival, comparative and indicative educational research purposes only.

Only content from public accounts, obtained from the Twitter Search API, was analysed.  The source data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account. These posts and the resulting dataset contain the results of analyses of Tweets that were published openly on the Web with the queried hashtag; the content of the Tweets is responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.This work is shared to archive, document and encourage open educational research into scholarly activity on Twitter.

No private personal information was shared. The collection, analysis and sharing of the data has been enabled and allowed by Twitter’s Privacy Policy. The sharing of the results complies with Twitter’s Developer Rules of the Road. A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag.

The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). Tweets published publicly by scholars or other professionals during academic conferences or events are often publicly tagged (labeled) with a hashtag dedicated to the event n question. This practice used to be the confined to a few ‘niche’ fields; it is increasingly becoming the norm rather than the exception. Though every reason for Tweeters’ use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour. In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences.

As Twitter users, conference Twitter hashtag contributors have agreed to Twitter’s Privacy and data sharing policies.Professional associations like the Modern Language Association and the American Pyschological Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter’s search API has well-known temporal limitations for retrospective historical search and collection. Beyond individual Tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. Though this work has limitations and might not be thoroughly systematic, it is hoped it can contribute to developing new insights into a discipline’s public concerns as expressed on Twitter over time.


González-Bailon, Sandra and Wang, Ning and Rivero, Alejandro and Borge-Holthoefer, Javier and Moreno, Yamir, Assessing the Bias in Samples of Large Online Networks (December 4, 2012).  Available at SSRN: http://dx.doi.org/10.2139/ssrn.2185134

Priego, Ernesto (2016) #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2AHRC [ahrcpress]. (2016, Sep 21).

Social media data mining is starting but still has a way to go to catch up with hard analytical methodologies #TheDataDebates [Tweet]. Retrieved from https://twitter.com/ahrcpress/status/778652767636389888

Priego, Ernesto (2016): #TheDataDebates Tweet Timestamps, Source, User Language. figshare. https://dx.doi.org/10.6084/m9.figshare.3976731.v1 Retrieved: 10 03, Oct 01, 2016 (GMT)

%d bloggers like this: