Libraries! Most Frequent Terms in #WLIC2016 Tweets (part IV)

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress
82nd IFLA General Conference and Assembly
13–19 August 2016, Columbus, Ohio, USA. Copyright by IFLA, CC BY 4.0.

 


 

This is part IV. For necessary context, methodology, limitations, please see here (part 1),  here (part 2), and here (part 3).

Since this was published and shared for the first time I may have done new edits. I often come back to posts once they have been published to revise them.

Throughout the process of performing the day by day text analysis I became aware of other limitations to take into account and I have revised part 3 accordingly.

Summary

Here’s a summary of the counts of the source (unrefined) #WLIC2016 archive I collected:

Number of Links

12435

Number of RTs estimate based on occurrence of RT

14570

Number of Tweets

23552

Unique Tweets <-used to monitor quality of archive

23421

First Tweet in Archive 14/08/2016 11:29:03 EDT
Last Tweet in Archive 22/08/2016 04:20:53 EDT
In Reply Ids

270

In Reply @s

429

Number of Tweeters

3035

As previously indicated the Tweet count includes RTs. This count might require further deduplication and it might include bots’ Tweets and possibly some unrelated Tweets.

Here’s a summary of the Tweet count of the #WLIC2016  dataset I refined from the complete archive. As I explained in part 3 I organised the Tweets into conference days, from Sunday 14 to Thursday 18 August. Each day was a different corpus to analyse. I also analysed the whole set as a single corpus to ensure the totals replicated.

Day Tweet count
Sunday 14 August 2016

2543

Monday 15 August 2016

6654

Tuesday 16 August 2016

4861

Wednesday 17 August 2016

4468

Thursday 18 August 2016

3801

Thursday – Sunday

22327

 

The Most Frequent Terms

The text analysis involved analysing each corpus, first obtaining a ‘raw’ output of 300 most frequent terms and their counts. As described in previous posts, I then applied an edited English stop words list followed by a manual editing of the top 100 most frequent terms (for the shared dataset) and of the top 50 for this post. Unlike before in this case I removed ‘barack’ and ‘obama’ from Thursday and Monday’s corpora, and tried to remove usernames and hashtags though it’s posssible that further disambiguation and refining might be needed in those top 100 and top 50.

The text analysis of the Sun-Thu Tweets as a single corpus gave us the following Top 50:

#WLIC2016 Sun-Thu Top 50 Most Frequent Terms (stop-words applied; edited)

Rank

Term Count

1

libraries

2895

2

library

2779

3

librarians

1713

4

session

1467

5

access

872

6

world

832

7

public

774

8

copyright

766

9

people

757

10

need

750

11

data

746

12

make

733

13

privacy

674

14

digital

629

15

new

615

16

wikipedia

602

17

indigenous

593

18

use

574

19

information

555

20

great

539

21

knowledge

512

22

literacy

502

23

internet

481

24

work

428

25

thanks

419

26

message

416

27

future

412

28

change

379

29

social

378

30

open

369

31

just

354

32

research

353

33

know

330

34

community

323

35

important

319

36

oclc

317

37

collections

312

38

books

300

39

learn

300

40

opening

291

41

read

289

42

impact

287

43

place

282

44

good

280

45

services

277

46

national

276

47

best

272

48

latest

269

49

report

267

50

users

266

As mentioned above I also analysed each day as a single corpus. I refined the ‘raw’ 300 most frequent terms per day to a top 100 after stop words and manual editing. I then laid them all out as a single table for comparison.

#WLIC2016 Top 50 Most Frequent Terms per Day Comparison (stop-words applied; edited)

Rank

Sun 14 Aug

Mon 15 Aug

Tue 16 Aug

Wed 17 Aug

Thu 18 Aug

1

libraries library library libraries libraries

2

library libraries privacy library library

3

librarians librarians libraries librarians librarians

4

session session librarians indigenous public

5

access copyright session session session

6

world wikipedia people knowledge need

7

public digital data access data

8

copyright make indigenous data impact

9

people world make literacy new

10

need internet access need digital

11

data access wikipedia great world

12

make new use people thanks

13

privacy need information research access

14

digital use world public value

15

new public public new national

16

wikipedia future knowledge marketing change

17

indigenous people copyright general privacy

18

use message homeless open great

19

information collections literacy world work

20

great information oclc archives research

21

knowledge content great just use

22

literacy open homelessness national people

23

internet report need assembly knowledge

24

work space freedom place social

25

thanks trend like make using

26

message great thanks read know

27

future net internet community make

28

change work info social services

29

social neutrality latest reading skills

30

open making experiencing work award

31

just update theft information information

32

research books important use learning

33

know collection just learn users

34

community social subject share book

35

important design change matters user

36

oclc data guidelines key best

37

collections thanks digital know collections

38

books librarian students global academic

39

learn know know government measure

40

opening shaping online life poland

41

read google protect thanks community

42

impact change working important learn

43

place literacy statement development outcomes

44

good just work love share

45

services technology future impact time

46

national online read archivist media

47

best poster award good section

48

latest info create books important

49

report working services cultural service

50

users law good help closing

I have shared on figshare a datset containing the summaries above as well as the raw top 300 most frequent terms for the whole set as well as divided per day. The dataset also includes the top 100 most frequent terms lists per day that I  manually edited after having applied the edited English stop word filter.

You can download the spreadsheet from figshare:

Priego, Ernesto (2016): #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2

Please bear in mind that as refining was done manually and the Terms tool does not always seem to apply stop words evenly there might be errors. This is why the raw output was shared as well. This data should be taken to be indicative only.

As it is increasingly recommended for data sharing, the CC-0 license has been applied to the resulting output in the repository. It is important however to bear in mind that some terms appearing in the dataset might be licensed individually differently; copyright of the source Tweets -and sometimes of individual terms- belongs to their authors.  Authorial/curatorial/collection work has been performed on the shared file as a curated dataset resulting from analysis, in order to make it available as part of the scholarly record. If this dataset is consulted attribution is always welcome.

Ideally for proper reproducibility and to encourage other studies the whole archive dataset should be available.  Those wishing to obtain the whole Tweets should still be able to get them themselves via text and data mining methods.

Conclusions

Indeed, for us today there is absolutely nothing surprising about the term ‘libraries’ being the most frequent word in Tweets coming from IFLA’s World Library and Information Congress. Looking at the whole dataset, however, provides an insight into other frequent terms used by Library and Information professionals in the context of libraries. These terms might not remain frequent for long, and might not have been frequent words in the past (I can only hypothesise– having evidence would be nice).

A key hypothesis for me guiding this exercise has been that perhaps by looking at the words appearing in social media outputs discussing and reporting from a professional association’s major congress, we can get a vague idea of where a sector’s concerns are/were.

I guess it can be safely said that words become meaningful in context. In an age in which repetition and frequency are key to public constructions of cultural relevance (‘trending topics’ increasingly define the news agenda… and what people talk about and how they talk about things) the repetition and frequency of key terms might provide a type of meaningful evidence in itself.  Evidence, however, is just the beginning– further interpretation and analysis must indeed follow.

One cannot obtain the whole picture from decomposing a collectively, socially, publicly created textual corpus (or perhaps any corpus, unless it is a list of words from the start) into its constituent parts. It could also be said that many tools and methods often tell us more about themselves (and those using them) than about the objects of study.

So far text analysis (Rockwell 2003) and ‘distant reading’ through automated methods has focused on working with books (Ramsay 2014). However I’d like to suggest that this kind of text analysis can be another way of reading social media texts and offer another way to contribute to the assessment of their cultural relevance as living documents of a particular setting and moment in time. Who knows, they might also be telling us something about the present perception and activity of a professional field- and might help us to compare it with those in the future.

Other Considerations

Both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailon, Sandra, et al, 2012).

Apart from the filters and limitations already declared, it cannot be guaranteed that each and every Tweet tagged with #WLIC2016 during the indicated period was analysed. The dataset was shared for archival, comparative and indicative educational research purposes only.

Only content from public accounts, obtained from the Twitter Search API, was analysed.  The source data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

These posts and the resulting dataset contain the results of analyses of Tweets that were published openly on the Web with the queried hashtag; the content of the Tweets is responsibility of the original authors. Original Tweets are likely to be copyright their individual authors but please check individually.

This work is shared to archive, document and encourage open educational research into scholarly activity on Twitter. The resulting dataset does not contain complete Tweets nor Twitter metadata. No private personal information was shared. The collection, analysis and sharing of the data has been enabled and allowed by Twitter’s Privacy Policy. The sharing of the results complies with Twitter’s Developer Rules of the Road.

A hashtag is metadata users choose freely to use so their content is associated, directly linked to and categorised with the chosen hashtag. The purpose and function of hashtags is to organise and describe information/outputs under the relevant label in order to enhance the discoverability of the labeled information/outputs (Tweets in this case). Tweets published publicly by scholars or other professionals during academic conferences are often publicly tagged (labeled) with a hashtag dedicated to the conference in question. This practice used to be the confined to a few ‘niche’ fields; it is increasingly becoming the norm rather than the exception.

Though every reason for Tweeters’ use of hashtags cannot be generalised nor predicted, it can be argued that scholarly Twitter users form specialised, self-selecting public professional networks that tend to observe scholarly practices and accepted modes of social and professional behaviour.

In general terms it can be argued that scholarly Twitter users willingly and consciously tag their public Tweets with a conference hashtag as a means to network and to promote, report from, reflect on, comment on and generally contribute publicly to the scholarly conversation around conferences. As Twitter users, conference Twitter hashtag contributors have agreed to Twitter’s Privacy and data sharing policies.

Professional associations like the Modern Language Association and the American Pyschological Association recognise Tweets as citeable scholarly outputs. Archiving scholarly Tweets is a means to preserve this form of rapid online scholarship that otherwise can very likely become unretrievable as time passes; Twitter’s search API has well-known temporal limitations for retrospective historical search and collection.

Beyond individual Tweets as scholarly outputs, the collective scholarly activity on Twitter around a conference or academic project or event can provide interesting insights for the contemporary history of scholarly communications. Though this work has limitations and might not be thoroughly systematic, it is hoped it can contribute to developing new insights into a discipline’s public concerns as expressed on Twitter over time.

References

González-Bailon, Sandra and Wang, Ning and Rivero, Alejandro and Borge-Holthoefer, Javier and Moreno, Yamir, Assessing the Bias in Samples of Large Online Networks (December 4, 2012).  Available at SSRN: http://dx.doi.org/10.2139/ssrn.2185134

Priego, Ernesto (2016) #WLIC2016 Most Frequent Terms Roundup. figshare.
https://dx.doi.org/10.6084/m9.figshare.3749367.v2

Ramsay, Stephen (2014) “The Hermeneutics of Screwing Around; or What You Do with a Million Books.” In Pastplay: Teaching and Learning History with Technology, edited by Kevin Kee, 111-20. Ann Arbor: University of Michigan Press, 2014. Also available at http://quod.lib.umich.edu/d/dh/12544152.0001.001/1:5/–pastplay-teaching-and-learning-history-with-technology?g=dculture;rgn=div1;view=fulltext;xc=1

Rockwell, Geoffrey (2003) “What is Text Analysis, Really? [PDF]” preprint, Literary and Linguistic Computing, vol. 18, no. 2, 2003, p. 209-219.

What’s in a Word? Most Frequent Terms in #WLIC2016 Tweets (part III)

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress
82nd IFLA General Conference and Assembly
13–19 August 2016, Columbus, Ohio, USA. Copyright by IFLA, CC BY 4.0.

This is part three. For necessary context please start here (part 1) and here (part 2). The final, fourth part is here.

It’s Friday already and the sessions from IFLA’s WLIC 2016 have finished. I’d like to finish what I started and complete a roundup of my quick (but in practice not-so-quick) collection and text analysis of a sample of #WLIC2016 Tweets. My intention is to finish this with a fourth and final blog post following this one and to share a dataset on figshare as soon as possible.

As previously I customised the spreadsheet settings to collect only Tweets from accounts with at least one follower and to reflect the Congress’ location and time zone. Before exporting as CSV I did a basic automated deduplication, but I did not do any further data refining (which means that non-relevant or spam Tweets may be included in the dataset).

What follows is a basic quantitative summary of the initial complete sample dataset:

  • Total Tweets: 22,540 Tweets (includes RTs)
  • First Tweet in complete sample dataset: Sunday 14/08/2016 11:29:03 EDT
  • Last Tweet in complete sample dataset: Friday 19/08/2016 04:20:43 EDT
  • Number of links:  11,676
  • Number of RTs:    13,859
  • Number of usernames: 2,811

The Congress had activities between Friday 12 August and Friday 19 August, but sessions between Sunday 14 August and Thursday 18 August. Ideally I would have liked to collect Tweets from the early hours of Sunday 14 August but I started collecting late so the earliest I got to was 11:29:03 EDT. I suppose at least it was before the first panel sessions started. For more context re: timings: see the Congress outline.

I refined the complete dataset to include only the days that featured panel sessions, and I have organised the data in a different sheet per day for individual analysis. I have also created a table detailing the Tweet counts per Congress sessions day. [Later I realised that though I had the metadata for the Columbus Ohio time zone I ended up organising the data into GMT/BST days. There is a 5 hours difference but the collected Tweets per day still roughly correspond to the timings of the conference. Of course many will have participated in the hashtag remotely –not present at the event– and many present will have tweeted not synchronically (‘live’).  I don’t think this makes much of a difference (no pun intended) to the analysis, but it’s something I was aware of and that others may or not want to consider as a limitation.

Tweets collected per day

Day Tweet count
Sunday 14 August 2016

2543

Monday 15 August 2016

6654

Tuesday 16 August 2016

4861

Wednesday 17 August 2016

4468

Thursday 18 August 2016

3801

Total Tweets in refined dataset: 22, 327 Tweets.

(Always bear in mind these figures reflect the Tweets in the collected dataset, it does not mean that as a fact that was the total number of Tweets published with the hashtag during that period. Not only does the settings of my querying affects the results; Twitter’s search API also has limitations and cannot be assumed to always return the same type or number of results).

I am still in the process of analysing the dataset. There are of course multiple types of analyses that one could do with this data but bear in mind that in this case I have only focused on using text analysis to obtain the most frequent terms in the text from the Tweets tagged with #WLIC2016 that I collected.

As before, in this case I am using the Terms tool from Voyant Tools to perform a basic text analysis in order to identify number of total words and unique word forms and most frequent terms per day; in other words, the data from each day became an individual corpus. (The complete refined dataset including all collected days could be analysed as a single corpus as well for comparison). I am gradually exporting and collecting the ‘raw’ output from the Terms tool per day, so that once I have finsihed applying the stop words to each corpus this output can be compared and so that it could be reproduced with other stop word lists if desired.

As before I am useing the English stop word list which I edited previously to include Twitter-specific terms (e.g. t.co, amp, https), as well as dataset-specific terms (e.g. the Congress’ Twitter account, related hashtags etc), but this time what I did differently is that I included all the 2,811 account usernames in the complete dataset so they would be excluded from the most frequent terms. These are the usernames from accounts with Tweets in the dataset, but other usernames (that were mentioned in Tweets’ text but that did not Tweet themselves with the hashtag) were logically not filtered, so whenever easily identifiable I am painstakingly removing them (manually!) from the remaining list. I am sure there most be a more effective way of doing this but I find the combination of ‘distant’ (automated) editing and ‘close’ (manual) editing interesting and fun.

I am using the same edited stop word list for each analysis. In this case I have also manually removed non-English terms (mostly pronouns, articles). Needless to say I did this not because I didn’t think they were relevant (quite the opposite) but because even though they had a presence they were not fairly comparable to the overwhelming majority of English terms (a ranking of most frequent non-English terms would be needed). As I will also have shared the unedited, ‘raw’ top most frequent terms in the dataset, anyone wishing to look into the non-English terms could ideally do so and run their own analyses without my own subjective stop word list and editing getting in the way. I tried to be as systematic as possible but disambiguation would be needed (the Terms tool is case and context insensitive, so a term could have been a proper name, or a username, and to be consistent I should have removed those too. Again, having the raw list would allow others to correct any filtering/curation/stop word mistakes).

I am aware there are way more sophisticaded methods of dealing with this data. Personally, doing this type of simple data collection and text analysis is an exercise and an interrogation of data collection and analysis methods and tools as reflective practices. An hypothesis behind it is that the terms a community or discipline uses (and retweets) do say something about those communities or disciplines, at least for a particular moment in time and a particular place in particular settings. Perhaps it also says things about the medium used to express those terms. When ‘screwing around‘ with texts it may be unavoidable to wonder what there is to it beyond ‘bean-counting’ (what’s in a word? what’s in a frequent term?), and what there is to social media and academic/professional live-tweeting that can or cannot be quantified. Doing this type of work makes me reflect as well about my own limitations, the limits of text analysis tools, the appropriateness of tools, the importance of replication and reproducibility and the need to document and to share what has been documented.

I’m also thinking about documentation and the open sharing of data outputs as messages in bottles, or as it has been said of metadata as ‘letters to the future’. I’m aware that this may also seem like navel-gazing of little interest outside those associated to the event in question. I would say that the role of libraries in society at large is more crucial and central than many outside the library and information sector may think (but that’s a subject for another time). Perhaps one day in the future it might be useful to look back at what we were talking about in 2016 and what words we used to talk about it. (Look, we were worried about that!) Or maybe no one cares and no one will care, or by then it will be possible to retrieve anything anywhere with great degrees of relevance and precision (including critical interpretation). In the meanwhile,  I will keep refining these lists and will share the output as soon as I can.

Next… the results!

The final, fourth part is here.

Most Frequent Terms in #WLIC2016 Tweets (part II)

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress
82nd IFLA General Conference and Assembly
13–19 August 2016, Columbus, Ohio, USA. Copyright by IFLA, CC BY 4.0.

 

The first part of this series provides necessary context.

I have now an edited list of the top 50 most frequent terms extracted from a cleaned dataset comprised of 10,721 #WLIC2016 Tweets published by 1,760 unique users between Monday 15/08/2016 10:11:08 EDT and Wednesday 17/08/2016 07:16:35 EDT.

The analysed corpus contained the raw text of the Tweets (includes RTs), comprising 185,006 total words and 12,418 unique word forms.

Stop words were applied as detailed in the first part of this series, and the resulting list (a raw list of 300 most frequent terms) was further edited to remove personal names, personal Twitter user names, common hashtags, etc.  Some organisational Twitter user names were not removed from the list, as an indication of their ‘centrality’ in the network based on the frequency with which they appeared in the corpus.

So here’s an edited list of the top 50 most frequent terms from the dataset described above:

Term Count
library

1379

libraries

1102

librarians

811

session

715

privacy

555

wikipedia

523

make

484

copyright

465

people

428

digital

378

access

375

use

362

public

340

data

322

need

319

iflabuild2016

308

world

308

information

298

internet

289

new

272

great

259

indigenous

255

iflatrends

240

report

202

knowledge

200

future

187

work

187

libraryfreedom

184

literacy

184

space

180

change

178

thanks

172

oclc

171

open

170

just

169

books

168

trend

165

important

162

info

162

know

162

social

161

net

159

neutrality

159

wikilibrary

158

collections

157

working

157

librarian

154

online

154

making

149

guidelines

148

Is this interesting? Is it useful? I don’t know, but I’ve enjoyed documenting it. Reflecting about different criteria to apply stop words and clean, refine terms has also been interesting.

I guess that deep down I believe it’s better to document than not to, even if we may think there should be other ways of doing it (otherwise I wouldn’t even try to do it). Value judgements about the utility or insightfulness of specific data in specific ways is an a posteriori process.

I hope to be able to continue collecting data and once the congress/conference ends I hope to be able to share a dataset with the raw (unedited, unfiltered) most frequent terms in the text from Tweets published with the event’s hashtag. If there’s anyone else interested they could clean, curate and analyse the data in different ways (wishful thinking but hey; it’s hope what guides us.).

What Library Folk Live Tweet About: Most Frequent Terms in #WLIC2016 Tweets

IFLA World Library and Information Congress 82nd IFLA General Conference and Assembly 13–19 August 2016, Columbus, Ohio, USA
IFLA World Library and Information Congress. Logo copyright by IFLA, CC BY 4.0.

Part 2 is  here, part 3  here and the final, fourth part is here.

IFLA stands for The International Federation of Library Associations and Institutions.

The IFLA World Library and Information Congress 2016 and 2nd IFLA General Conference and Assembly, ‘Connections. Collaboration. Community’ is currently taking place (13–19 August 2016) at the Greater Columbus Convention Center (GCCC) in Columbus, Ohio, United States.

The official hashtag of the conference is #WLIC2016. Earlier, I shared a searchable, live archive of the hashtag here. (Page may be slow to load depending on bandwidth).

I have looked at the text from 4,945 Tweets published with #WLIC2016 from 14/08/2016 to 15/08/2016 11:16:06 (EDT, Columbus Ohio time). Only accounts with at least 1 follower were included. I collected them with Martin Hawksey’s TAGS.

According to Voyant Tools this corpus had 82,809 total words and 7,506 unique word forms.

I applied an English stop word list which I edited to include Twitter-specific terms (https, t.co, amp (&) etc.), proper names (Barack Obama, other personal usernames) and some French stop words (mainly personal pronouns). I also edited the stop word list to include some dataset-specific terms such as the conference hashtag and other common hashtags, ‘ifla’, etc. (I left others that could also be considered dataset-specific terms, such as ‘session’ though).

The result was a listing of of 800 frequent terms (the least frequent terms in the list had been repeated 5 times). I then cleaned the data from any dataset-specific stop words that the stop word list did not filter and created an edited ordered listing of the most frequent 50 terms. I left in organisations’ Twitter user names (including @potus), as well as other terms that may not seem that meaningful  on their own (but who knows, they may be).

It must be taken into account the corpus included Retweets; each RT counted as a single Tweet, even if that meant terms were being logically repeated. This means that term counts in the list reflect the fact the dataset contains Retweets (which obviously implies the repetition of text).

If for some reason you are curious about what the most frequent words in #WLIC2016 Tweets were during this initial period (see above), here’s the top 50:

Term Count
libraries

543

copyright

517

librarians

484

library

406

session

374

world

326

message

271

opening

249

access

226

make

204

digital

195

internet

162

future

161

information

157

new

146

use

141

people

138

president

131

potus

125

literacy

118

need

117

oclc

114

ceremony

113

dpla

109

poster

105

thanks

103

collections

102

public

100

delegates

99

cilipinfo

98

countries

95

iflatrends

95

google

93

shaping

91

work

89

drag

83

report

83

create

81

open

81

data

79

content

78

learn

78

latest

77

making

77

fight

76

ifla_arl

75

read

74

info

73

exceptions

69

great

68

So for what it’s worth those were the 5o most frequent terms in the corpus.

I, for one, not being present in the Congress, found it interesting that ‘copyright’ is the second most frequent term, following ‘libraries’. One notices also that, unsurprisingly, the listing of top most frequent terms includes some key terms (such as ‘access’, ‘internet’, ‘digital’, ‘open’, ‘data’) concerning Library and Information professionals of late.

Were these the terms you’d have expected to make a ‘top 50’ in almost 5,000 Tweets from this initial phase of this particular conference?

The conference hasn’t finished yet of course. But so far, for a libraries and information world congress, which terms would you say are noticeable by their absence in this list? ;-)

Part 2 is  here, part 3  here and the final, fourth part is here.

 

‘BBCDebate’ on Twitter. A First Look into an Archive of #BBCDebate Tweets

[For the previous post in this series, click here].

The BBC Debate

The BBC’s Great Debate” was broadcasted live in the UK by the BBC on Tuesday 21 June 2016 between 20:00 and 22:00 BST. It saw activity on Twitter with the #BBCDebate hashtag.

I collected some of the Tweets tagged with #BBCDebate using a Google Spreadsheet. (See the methodology section below). I have shared an anonymised dataset on figshare:

Priego, E. (2016) “The BBC’s Great Debate”: Anonymised Data from a #BBCDebate Archive. figshare. https://dx.doi.org/10.6084/m9.figshare.3457688.v1

[Note: figshare DOIs are not resolving or there are delays in resolving; it should be fixed soon…]

Archive Summary (#BBCDebate)

Number of links 16826
Number of RTs 32206 <-estimate based on occurrence of RT
Number of Tweets 38116
Unique tweets 38066 <-used to monitor quality of archive
First Tweet in Archive 14/06/2016 22:03:18 BST
Last Tweet in Archive 22/06/2016 09:12:32 BST
In Reply Ids 349
In Reply @s 456
Tweet rate (tw/min) 62 Tweets/min (from last archive 10mins)
Unique Users in archive:

                      20, 243

Tweets from StrongerIn in archive:

16

Tweets from vote_leave in archive:

15

The raw data was downloaded as an Excel spreadsheet file containing 38,166 Tweets (38,066 Unique Tweets) publicly published with the queried hashtag (#BBCDebate) between 14/06/2016 22:03:18 and 22/06/2016 09:12:32 BST.

Due to the expected high volume of Tweets only users with at least 10 followers were included in the archive.

As indicated above the BBC Debate was broadcasted live on UK national television on Tuesday 21 June 2016 between 20:00 and 22:00 BST. This means the data collection covered the real-time broadcasting of the live debate (see the chart below).

#BBCDebate Activity in the last 3 days
#BBCDebate Activity in the last 3 days. Key: blue: Tweet; red: Reply

The data collected indicated only 12 Tweets in the whole archive contained geolocation data. A variety of user languages (user_lang) were identified.

Number of Different User Languages (user_lang)

Note this is not the language of the Tweets’ text, but the language setting in the application used to post the Tweet. In other words user_lang indicates the language the Twitter user selected from the drop-down list on their Twitter Settings page. This metadata is an indication of a user’s primary language but it might be misleading. For example, a user might select ‘es’ (Spanish) as their preferred language but compose their Tweets in English.

The following list ranks  user_lang  by number of Tweets in dataset in descending order. Specific counts can be obtained by looking at the dataset shared.

user_lang
en
en-gb
fr
de
nl
es
it
ja
ru
pt
ar
sv
pl
tr
da
ca
fi
id
ko
th
el
cs
no
en-IN
he
zh-cn
hi
uk

If you are interested in user_lang, GET help/languages returns the list of languages supported by Twitter along with the language code supported by Twitter. At the time of writing the language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).

It is interesting to note the variety of European user_lang selected by those tweeting about #BBCDebate.

Notes on Methodology

The Tweets contained in the Archive sheet were collected using Martin Hawksey’s TAGS 6.0.

Given the relatively large volume of activity expected around #BBCDebate and the public and political nature of the hashtag, I have only shared indicative data. No full tweets nor any other associated metadata have been shared.

The dataset contains a metrics summary as well as a table with column headings labeled  created_at,  time,    geo_coordinates (anonymised; if there was data YES has been indicated; if no data was present the corresponding cell has been left blank), user_lang and user_followers_count data corresponding to each Tweet.

Timestamps should suffice to prove the existence of the Tweets and could be useful to run analyses of activity on Twitter around a real-time media event.

Text analysis of the raw dataset was performed using Stéfan Sinclair’s & Geoffrey Rockwell’s Voyant Tools. I may share results eventually if I find the time.

The collection and analysis of the dataset complies with Twitter’s Developer Rules of the Road.

Some basic deduplication and refining of the collected data was performed.

As in all the previous datasets I have created and shared it must be taken into account this is just a sample dataset containing the tweets published during the indicated period and not a large-scale collection of the whole output. The data is presented as is as a research sample and as the result of an archival task. The sample’s significance is subject to interpretation.

Again as in all the previous cases please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). Google spreadsheet limits must also be taken into account. Therefore it cannot be guaranteed the dataset contains each and every Tweet actually published with the queried Twitter hashtag during the indicated period. [González-Bailón et al have done very interesting work regarding political discussions online and their work remains an inspiration].

Only data from public accounts was included and analysed. The data was obtained from the public Twitter Search API. The analysed data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.

No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor was contained in the dataset.

I have shared the dataset including the extra tables as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis. It is hoped that by sharing the data someone else might be able to run different analyses and ideally discover different or more significant insights.

For the previous post in this series, click here. If you got all the way here, thank you for reading.

References
[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave”. A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In”. A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare.
https://dx.doi.org/10.6084/m9.figshare.3456617.v1

Priego, E. (2016) “Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/stronger-in-looking-into-a-sample-archive-of-1005-strongerin-tweets/. [Accessed 21 June 2016].

Priego, E. (2016) “The BBC’s Great Debate”: Anonymised Data from a #BBCDebate Archive. figshare. https://dx.doi.org/10.6084/m9.figshare.3457688.v1

“Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets

If you haven’t been there already, please start here. An introduction and a detailed methodological note provide context to this post.

I have now shared a spreadsheet containing an archive of 1,005 @StrongerIn Tweets publicly published by the queried account between12/06/2016 13:34:35 and 21/06/2016 13:11:34 BST.

The spreadsheet contains four more sheets containing a data summary from the archive, a table of tweets’ sources, and tables of corpus term and trend counts and collocate counts.

This will hopefully allow to compare two similar samples from the output of two homologous Twitter accounts, both officially representing the ‘Leave’ and ‘Remain’ sides of the UK EU Referendum. The collected period is the same and if desired it is possible to edit the sets to have for example 1,000 Tweets each.

Following the structrue of my previous post on the ‘Vote Leave‘ dataset, here’s some quick insights from the @StrongerIn account for comparison.

Archive (from:StrongerIn)

Number of links 735
Number of RTs 409 <-estimate based on occurrence of RT
Number of Tweets

1005

Unique tweets 1004 <-used to monitor quality of archive
First Tweet in Archive

12/06/2016 13:34:35

BST
Last Tweet in Archive

21/06/2016 13:11:34

BST
In Reply Ids

9

In Reply @s 0
Tweet rate (tw/min)

0.1

Tweets/min (from last archive 10mins)

Like the @vote_leave account, @StrongerIn is used for mainly broadcasting Tweets and no @ Replies to users were collected during the period represented in the dataset.

Though this dataset, collected over slightly different timings but covering the same number of days, contains 60 fewer Tweets than the Vote Leave one; this @StrongerIn dataset reflects the account shared 235 links more than its @vote_leave counterpart.

Sources

Unlike @vote_leave, the dataset does not indicate that @StrongerIn uses Buffer nor Twitter for iPhone. However TweetDeck (423) and the Twitter Web Client (591) appear as the main sources. There’s even an interestingly strange Tweet, linking to a StrongerIn 404 web site page, published from NationBuilder.

Source Count
Nationbuilder

1

TweetDeck

413

Twitter Web Client

591

Total

1,005

Most Frequent Words

Removing Twitter data-specific stopwords from the raw data (e.g. t.co, amp, rt) the 10 most frequent words in the corpus are:

Term Count Trend
eu

287

0.013906387

remain

224

0.010853765

bbcqt

216

0.01046613

europe

209

0.01012695

vote

170

0.008237232

strongerin

167

0.00809187

uk

159

0.0077042347

jobs

148

0.0071712374

leave

148

0.0071712374

eudebate

113

0.0054753367

Compare them with the 10 most frequent words in the vote_leave data. Anything interesting?

 Let’s compare the top 10 terms from each account side by side:

 

Top 10 Terms in 1,100 vote_leave Tweets over 7 days vote_leave count Top 10 Terms in 1,005 StrongerIn Tweets over 7 days StrongerIn count
voteleave 558 eu 287
eu 402 remain 224
bbcqt 398 bbcqt 216
gove 165 europe 209
takecontrol 146 vote 170
immigration 133 strongerin 167
control 95 uk 159
cameron 89 jobs 148
turkey 84 leave 148
uk 72 eudebate 113

The terms in red are those appearing in both datsets; the terms in blue correspond to the name of each campaign. It’s interesting that though the StrongerIn account has 182 fewer mentions of ‘bbcqt’ (bear in mind the StrongerIn dataset has 95 fewer Tweets), ‘bbqt’ remains in third place on both sets.

The differences between the ranking of mentions of each campaign’s name are noticeable; as well as the fact that the vote_leave campaign has the name of the Prime Minister (himself a Remain campaigner) in its top 10 (as well as that of Gove; a Leave campaigner), while StrongerIn has no names of politicians on its 10 most frequent words.

There are other potentially interesting or noticeable differences when we compare these two top 10s. Can you spot them?  Do they tell us anything or not?

Digging into data and creating datasets does not necessarily tell us new things, but it does allow us to pinpoint otherwise moving objects. We don’t need to pin butterflies to recognise they are indeed butterflies, but the intention is to create new settings for observation.

References

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave” Looking Into a Sample Archive of 1,100 vote_leave Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/vote-leave-looking-into-a-sample-archive-of-1100-vote_leave-tweets/. [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave” A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In” A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. DOI:
https://dx.doi.org/10.6084/m9.figshare.3456617.v1

[StrongerIn]. (2016). [Twitter account].Retrieved from https://twitter.com/StrongerIn. [Accessed 21 June 2016].

[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

“Vote Leave”: Looking Into a Sample Archive of 1,100 vote_leave Tweets

In two days the United Kingdom will be voting in a Referendum that is very likely to change its destiny. More importantly, it is likely to change the destiny of everyone else who has a relationship with the UK.

This is a political event that is not only of national, internal or local interest, but one that is likely to have direct and immediate repercussions well beyond its borders. If one has ever lived in one of the EU member countries recently one does not need to be a political scientist to feel that these repercussions will not only be of a merely economic nature– already, even before the vote is cast, the UK’s social tissue has been undoubtedly transformed and deeply, even tragically affected.

Needless to say one of the arenas where political activity is taking place is on the media (TV, Radio) and social media. As the date to vote in person approaches, I collected and shared a dataset of tweets published by the official Leave campaign Twitter account, @vote_leave, between 12/06/2016 09:06:22 – 21/06/2016 09:29:29 BST. The dataset contains 1,100 tweets.

I did a quick text analysis of the Tweets themselves to get a quick insight into the most frequent terms and collocates in the corpus, and also looked at the tweets’ sources (the services used to publish the Tweets, i.e. the Twitter Web Client, Buffer, the Twitter iPhone app).

Some quick insights from the data:

Archive Summary (from:vote_leave)

Number of links 500
Number of RTs 592 <-estimate based on occurrence of RT
Number of Tweets 1100
Unique tweets 1099 <-used to monitor quality of archive
First Tweet in Archive

12/06/2016 09:06:22

BST
Last Tweet in Archive

21/06/2016 09:29:29

BST
Tweet rate (tw/min) 0.1 Tweets/min (from last archive 10mins)
In Reply Ids 3
In Reply @s 2
@s 90
RTs 54%

It is interesting that the account mostly broadcasts and RTs Tweets, but does a minimal interaction with other users via Reply @s, at least according to this sample dataset. (A larger dataset could corroborate or not if that is a trend indicating a media/content strategy or not).

Sources

The data indicates that most Tweets are published from the Twitter Web Client (496!), which I would have thought any marketing professional would find clunky if not really unfit for purpose.

Not suprisingly however Buffer is used (411 buffered Tweets), which indicates they are likely to have been scheduled in advance. Surprisingly for me, most of the Tweets in the dataset did not have TweetDeck as a source (only 4 according to the collected data in the given period), but it is possible that TweetDeck was used to ‘buffer’ the Tweets, as TweetDeck allows for Buffer integration.

Twitter for iPhone emerges as a significant source, well above Tweetdeck. Personally, I picture such an important political campaigning being done from a mobile phone as kind of scary. Influencing a nation’s destiny from the train home after the pub!

Source Count
Tweetdeck

4

Buffer

411

Twitter for iPhone

189

Twitter Web Client

496

Total

1100

Most Frequent Words

I was not surprised to see that ‘immigration’ was one of the most frequent words appearing in the corpus. However it was interesting to see the centrality of the hashtag ‘bbcqt’ (BBC Question Time). Even if we take into account the specific context of the data’s time period, the prevalence of bbcqt as a term in the corpus could be potentially interpreted as an indication of the importance that television, and specifically the BBC, has had in defining voting trends and public discourse regarding the Referendum.

Removing Twitter data-specific stopwords from the raw data (e.g. t.co, amp, rt) the 10 most frequent words in the corpus are:

Term Count Trend
voteleave

558

0.026160337

eu

402

0.018846694

bbcqt

398

0.018659165

gove

165

0.0077355835

takecontrol

146

0.0068448195

immigration

133

0.0062353495

control

95

0.004453821

cameron

89

0.0041725268

turkey

84

0.003938115

uk

72

0.0033755274

(voteleave, bbcqt, takecontrol were hashtags).

It is not clear how much of a social media/content strategy might be behind a Twitter account like @vote_leave, nor how many account managers are behind the tweetage. Apart from the obvious prevalence of ‘immigration’ as a term, it is nevertheless interesting to see that in 8 days of Tweets in the final countdown to the Referendum there would be a clear interest in tapping into televised debate and influence (bbcqt), to the point that the term would get such a high ranking. Bear in mind that ‘voteleave’ is their standard campaign hashtag, and that ‘eu’ would be expected to be a very frequent word, to the point that it could be considered a stop word in the specific context of this corpus. Perhaps for all the onus on social media as an autonomous medium it is still traditional mainstream media, in this case the BBC, which has the greatest influence in public opinion?

Notes on Methodology

The Tweets contained in the Archive sheet were collected using Martin Hawksey’s TAGS 6.0.

The text analysis was performed using Stéfan Sinclair’s & Geoffrey Rockwell’s Voyant Tools.

The collection and analysis of the dataset complies with Twitter’s Developer Rules of the Road.

The data was collected as an Excel spreadsheet file containing an archive of 1,100 @vote_leave Tweets publicly published by the queried account between 12/06/2016 09:06:22 – 21/06/2016 09:29:29 BST.

I prepared a spreadsheet and added four more sheets to add a data summary from the archive, a table of tweets’ sources, and tables of corpus term and trend counts and collocate counts.

It must be taken into account this is just a sample dataset containing the tweets published during the indicated period and not a large-scale collection of the whole output. The data is presented as is as a research sample and as the result of an archival task. The sample’s significance is subject to interpretation.

Please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). Therefore it cannot be guaranteed the dataset contains each and every Tweet actually published by the queried Twitter account during the indicated period. [González-Bailón et al have done very interesting work regarding political discussions online and their work remains an inspiration].

Only content from public accounts was included and analysed. The data was obtained from the public Twitter Search API. The analysed data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.

No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor was contained in the dataset.

I have shared the dataset including the extra tables as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis. It is hoped that by sharing the data someone else might be able to run different analyses and ideally discover different or more significant insights.

For the next post on this series, click here.

References
[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave”. A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In”. A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare.
https://dx.doi.org/10.6084/m9.figshare.3456617.v1

Priego, E. (2016) “Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/stronger-in-looking-into-a-sample-archive-of-1005-strongerin-tweets/. [Accessed 21 June 2016].

Notes on Content Overload: Whose Filter Failure?

[I have updated this post after some revisions].

Cartoon by Mark Anderson © Mark Anderson All Rights Reseved.

“Here’s what the Internet did: it introduced, for the first time, post-Gutenberg economics. The cost of producing anything by anyone has fallen through the floor. And so there’s no economic logic that says that you have to filter for quality before you publish… The filter for quality is now way downstream of the site of production.

What we’re dealing with now is not the problem of information overload, because we’re always dealing (and always have been dealing) with information overload… Thinking about information overload isn’t accurately describing the problem; thinking about filter failure is.”

-Clay Shirky, “It’s not information overload. It’s filter failure“, Web 2.0 Expo, New York, Thursday, 09/18/2008

‘Content’ is not what it used to be.  I started blogging around 1999. I was an earlyish adopter of MySpace and then Facebook and Tumblr, but I did not get a Twitter account until 2008 and didn’t start a personal Twitter account until 2009. It seems unnecessary to say  but since then blogging has been significantly superceded by social media, and user generated content is now the default in today’s mediascape. Boy, do I sound ‘old.’

Times have changed significantly. We no longer need to advocate (at least not in the same way) for the need to promote and/or disseminate information online.  The relative popularity of a platform like Medium seems to demonstrate the nearly-total blurring between web publishing and social media, at least for long and, er, medium-length forms. But we don’t need to look at the most sophisticated online publishing examples to get the feeling that, if you are, say, on Twitter, everyone is now pushing content. It’s not just a buzzword and I’m not saying anything new: the multiplication of user accounts means the customisation of personal profiles which turns all users, even the least experienced and humble ones, into brands producing content as commodities. Your profile picture is your logo, your online persona is the result of a conscious or unconscious public-facing strategy. The products are not just each individual output, but your whole process of being online; the whole ongoing process. It’s outward thinking, an exercise for reaching out, publicly, to others, continuously.

In the 21st century all media means publishing, the making public of packaged information (dear reader, please be kind: I am acutely aware that we still need professional publishers in the publishing industry). All publishing means ‘social’, at least in the sense of necessitating networks (of users, of data), programming interfaces and algorithms to create, maintain and develop those networks. Like commuters in packed rush hour trains, social media users share a common space where time, space and attention are scarce. Social media users become part of the crowd as a unit, as a whole, but the crowd is composed of individuals at odds with each other, often algorithmically thrown in together, and tensions, misunderstandings arise.

If you have ever taken public transport during the morning or evening rush hour, you understand how the laws of capitalism turn space, and yourself, into commodities. Space is scarce (so are seats, table seats, power plugs, air, floorspace). You are time-poor and your time is money.  You are unique, in the infinite mass (‘the mass is matrix‘). The commuting train (thinking of the UK here) is a type of panopticon (be ever vigilant; report anything suspicious). On the one hand it promotes solipsism (headphones, personal devices, reading material), but on the other hand it requires a constant periphereal awareness. Other bodies are always around you and signals are everywhere. Bodies clash with each other and it takes concentration to avoid it. Two bodies cannot occupy the same space at the same time, but yet the rules of capitalism would like to deconstruct the law of physics and get us all in there at the same time, knowing it is not possible, creating desire and aspiration for the non-graspable; as an ecosystem it generates its own social classes, hierarchies, winners and losers, satisfaction, tension and frustration. It seems to me this is also the logic of today’s social media.

Attention Economy: London Bridge Station, evening rush hour. Pic CC-BY Ernesto Priego
Attention Economy: London Bridge Station, evening rush hour. Pic CC-BY Ernesto Priego

The binding tissue of social media communities is not necessarily commonality, but competition (there is commonality, but it’s conscious or unconscious competition, for attention, for presence, for space, for recognition, which drives it forward). The social media arena enables the production of content (defined as information purposefully packed for its dissemination).  This is a rhetoric I used to resist, a semantic field favoured by self-fashioned, opportunistic social media gurus. However social media can only be fully, ethically theorised in practice, over time, and experience (my experience at least), shows that today’s social media has managed to transform publishing into the condition sine qua non of being online. Even so-called lurkers create content (their accounts as data points and their associated metadata). Like a car parked on a road, the lurker’s social media presence also contributes to pollution, takes up space, pays taxes, alters the configuration of the city, needs to be eventually moved around, might be eventually towed away, stays in the way of things and people, is exposed to environmental conditions, communicates things (class, taste, income) etc.

I write these paragraphs, paradoxically, as a way to frame my recent reasons to resist being on social media as I used to. There are other reasons apart from a perceived content overload, but in this case, this post was motivated by my experience of witnessing web publishing and particularly Twitter microblogging evolve (or devolve) towards pitch-perfect free market capitalism, where becoming a commodity through the production of content as a commodity is the ontological condition.

It is true that every Twitter user experiences Twitter differently. However recent changes in the Twitter API (including an aggressive imposition of ‘promoted’ tweets, inclusion of gif search, allowing all users to see tweets staring with a mention etc.) mean it is becoming very hard to filter information as before: it no longer suffices to be a good curator, because curation is not fully customisable at an individual user’s level, in terms of what content a user is exposed to and when.

Clay Shirky’s “it’s not information overload. It’s filter failure“, worked well for 2008, and it might still be insightful in 2016 if we redefine whose reponsibility it is to filter and if we think hard whether it is really possible to filter successfully these days. There’s also I think a distinction to be made between ‘information’ and ‘content’: one can argue information can exist independently from its packaging (the way it is disseminated, how it is wrapped with other data, phyisical or digital). Content is the paradigmatic shape in which information is transformed into a commodity, and content is composed of different bits of information. We no longer search for isolated bits of information, but for data and metadata wrapped in specific languages and interfaces (we don’t just search for a location, we search on Google Maps, expecting to find other locations apart from the one we were looking for, and information about those locations). We then share what we retrieved, which is a whole mini-package of code, with others, expecting them to have access to the same technical affordances (software, hardware, connectivity) that we do.

Google, Facebook, Twitter, Instagram, Snapchat encourage the transformation of anything into shareable content, not just from professional publishing organisations but from absolutely everybody (dear reader, please be kind: I am acutely aware of the digital divide). This is, of course, not new, and once upon a time we used to celebrate the fact ‘the people formerly known as the audience’ (Rosen 2006) were becoming content producers as well. The leveling of the playing field etc. I remember 2006 well: online, those were exciting yet innocent times. Ten years later, the attention economy is (or is determined to be) more than ever before the only economy, at least in the developed world. The people formerly known as the audience remain the audience, even if they become audiences by endlessly sharing content, and therefore by distracting each other’s attention. The overlords are still the overlords. Perhaps paradoxically, the only way for the regular user to produce more meaningful content (define meaningful etc.) is to spend significant time away from the endless whirlwind of voices sharing content of all types at all times.

There’s been discussion of how the “rise of social media content is overwhelming consumers“, but interestingly there is no doubt in those reflections that social media users are de facto consumers, and suggest that it’s the digital marketers (professionals employed by commercial entities), not the users, who can do something about it. Users are just the target. But don’t we as users have a responsibility too? Because being online is only possible through the creation of content and a digital footprint, (even if one never posts anything, even if one only lurks sites from a Tor browser), it seems logical that there should be a feeling of content overcrowding. Filtering the content one thinks one  needs or expects to discover  has become increasingly difficult, and often sources will equally post something really useful than something completely inane: it is way easier to filter what one posts before it is posted. Some will say posting inane content is an important requirement for the quality content to get eventually an audience at all, but for the experienced, busy user the proliferation of unfilterable chaff renders the social media experience totally frustrating and time-consuming. (Often chaff is in the eye of the beholder, but, one would argue, not always).

Fear of missing out means many of us feel we need to keep an eye on social media to be mildly aware of what’s happening in our fields and in the world, but the illusion created by what looks like everyone actively broadcasting how hard they are at work (or having fun taking planes to exotic conference destinations) can also have a paralyzing effect. Moreover this broadcasting of information related to professional activity does directly contribute to the larger market itself, promoting competition (and its anxieties). The multiplication of channels disseminating professional activity paradoxically yet successfully benefits the perception that jobs are scarce, and the convenient delusion that some candidates will just never be good enough.

Social media is about content and the more users there are the more content there is. The more content there is the harder it is to be heard, and to find and discover relevance. The algorithms will make sure you cannot avoid the content they want you to see, no matter how savvy you think you are, in order to ensure network growth and income. Like overcrowded trains in the morning rush hour, we can argue platforms such as Twitter are quickly suffering from content overcrowding, even if Twitter itself would think they are always underachieving in terms of user base. If there is content overload, whose responsibility is it to filter, and is filtering, as we have traditionally defined it, still really possible under the current infrastructures?

If contemporary algorithms are designed to force users to see as much as possible in spite of their filtering efforts, perhaps we will (hopefully) see a growth of user self-filtering: do we as users really need to post all that? Do users find the time to ask themselves that question? Certainly this is something many if not most users already do up to a certain extent. Eventually, even if everyone became more selective about what they post, wouldn’t we end up in the same overcrowded place, if the intention is for everyone everywhere to be members of the online social arena, the market in the cloud?*

Or maybe it’s a question of a transformation of our ‘modes of perception’, and even the most sophisticated information retrieval specialists will need to consciously adapt their strategies to market-driven discovery systems. At this stage  I personally wonder if the only successful filtering technique would be not to be here/there at all, or at least for considerable periods.

So I’ve been quiet on the blogging front. It took me ages to gather the courage to write this text and finally post it. It goes in various directions, and it might not mean anything to anyone at all but me (deep down my suspicion is someone out there might care). Maggie Nelson writes that

‘most writers I know nurse persistent fantasies about the horrible things -or the horrible thing- that will happen to them if and when they express themselves as they desire’ (The Argonauts, 2015: 114).

For all the social media content overload I increasingly perceive, I paradoxically feel social media is also promoting self-censorship and fear. It also promotes a particular type of writing, specially crafted to maximise sharing. Devising strategies to ensure content is shared in current infrastructures can be a very good thing, as I have said throughout my career, particularly when what is needed is to communicate the value of a certain type of humanities work. But quick sharing also has a counterpart, quick reactions (which, depending on the case, are not always bad!). However one sees plenty of quick, uncharitable reactions to unread content; unfriendly public attitudes to others’ work; virtual mobbing from people who one thinks would never do the same in a professional context like a conference or a lecture, the immediate, context-poor critique of those who dare to express themselves.

Usually it’s minorities and under-represented users who suffer the most and therefore lose terrain in the battle for representation. The widespread adoption of social media in professional contexts has led to self-censorship on social media, even in the lands of the free. When self-filtering becomes self-censorship is a topic that deserves more time and thought. This tension between the need/pressure to disseminate and the need/pressure to remain silent in order to be safe is one of the tensions at the core of this new economy as a way of being with others, a kind of mal d’archive where two opposing forces are at play.

Taking the time to write this and to reflect on the reasons to publish has made me reflect on both the ideas and practices that motivated it and the mechanism and strategies for its eventual dissemination. It may be that the best filter is to take time out all together, in order to keep perspective. Stepping away from a social media platform such as Twitter may remind us it is not an end in itself nor a community of communities disconnected to the offline networks that sustain it. Taking this time to reflect  may help us to reassess what it is that we really want to get across, when and to whom. I suggest that this distance is healthy, even if I recognise that taking this route may mean that some people never read the content we do eventually share.


*Another important aspect of this discussion, which I did not mean to cover here, would be online harassment and bullying. Danah Boyd’s work may come handy in this context.

A #HEFCEmetrics Twitter Archive (Friday 16 January 2015, Warwick)

HEFCE logo

The HEFCE metrics workshop: metrics and the assessment of research quality and impact in the arts and humanities took place on Friday 16 January 2015, 1030 to 1630 GMT at the Scarman Conference Centre, University of Warwick, UK.

I have uploaded a dataset of 821 Tweets tagged with #HEFCEmetrics (case not sensitive):

Priego, Ernesto (2015): A #HEFCEmetrics Twitter Archive (Friday 16 January 2015, Warwick). figshare.
http://dx.doi.org/10.6084/m9.figshare.1293612

TheTweets in the dataset were publicly published and tagged with #HEFCEmetrics between 16/01/2015 00:35:08 GMT and 16/01/2015 23:19:33 GMT. The collection period corresponds to the day the workshop took place in real time.

The Tweets contained in the file were collected using Martin Hawksey’s TAGS 6.0. The file contains 2 sheets.

Only users with at least 2 followers were included in the archive. Retweets have been included. An initial automatic deduplication was performed but data might require further deduplication.

Please note the data in this file is likely to require further refining and even deduplication. The data is shared as is. The contents of each Tweet are responsibility of the original authors. This dataset is shared to encourage open research into scholarly activity on Twitter. If you use or refer to this data in any way please cite and link back using the citation information above.

For the #HEFCEmetrics Twitter archive corresponding to the one-day workshop hosted by the University of Sussex on Tuesday 7 October 2014, please go to

Priego, Ernesto (2014): A #HEFCEmetrics Twitter Archive. figshare.
http://dx.doi.org/10.6084/m9.figshare.1196029

You might also be interested in

Priego, Ernesto (2014): The Twelve Days of REF- A #REF2014 Archive. figshare.
http://dx.doi.org/10.6084/m9.figshare.1275949

#MLA15 Twitter Archive, 8-11 January 2015

130th MLA Annual Convention Vancouver, 8–11 January 2015

#MLA15 is the hashtag which corresponded to the 2015 Modern Language Association Annual Convention. The Convention was held in Vancouver from Thursday 8 to Sunday 11 January 2015.

We have uploaded a dataset as a .xlsx file including data from Tweets publicly published with #mla15:

Priego, Ernesto; Zarate, Chris (2015): #MLA15 Twitter Archive, 8-11 January 2015. figshare.
http://dx.doi.org/10.6084/m9.figshare.1293600

The dataset includes Tweets posted during the actual convention with #mla15: the set starts with a Tweet from Thursday 08/01/2015 00:02:53 Pacific Time and ends with a Tweet from Sunday 11/01/2015 23:59:58 Pacific Time.

The total number of Tweets in this dataset sums 23,609 Tweets. Only Tweets from users with at least two followers were collected.

A combination of Twitter Archiving Google Spreadsheets (Martin Hawksey’s TAGS 6.0; available at https://tags.hawksey.info/ ) was used to harvest this collection. OpenRefine (http://openrefine.org/) was used for deduplicating the data.

Please note the data in the file is likely to require further refining and even deduplication. The data is shared as is. The dataset is shared to encourage open research into scholarly activity on Twitter. If you use or refer to this data in any way please cite and link back using the citation information above.

For the #MLA14 datasets, please go to
Priego, Ernesto; Zarate, Chris (2014): #MLA14 Twitter Archive, 9-12 January 2014. figshare.
http://dx.doi.org/10.6084/m9.figshare.924801

The Twelve Days of REF: A #REF2014 Archive

Cirrus word cloud visualisation of a corpus of 23,791 #REF2014 Tweets

I have uploaded a new dataset to figshare:

Priego, Ernesto (2014): The Twelve Days of REF- A #REF2014 Archive. figshare.

http://dx.doi.org/10.6084/m9.figshare.1275949

The file contains approximately 31,855 unique Tweets published publicly and tagged with #REF2014 during a 12-day period between 08/12/2014 11:18 and 20/12/2014 10:13 GMT.

For some context and an initial partial analysis, please see my previous blog post from 18 December 2014.

As always, this dataset is shared to encourage open research into scholarly activity on Twitter. If you use or refer to this data in any way please cite and link back using the citation information above.

Happy Christmas everybody.

The REF According to Twitter: A #REF2014 Update (18/12/14 16:28 GMT)

As everyone in some way aware of UK higher education knows, the results from the REF 2014 were announced in the first minute of the 18th of december 2014. Two main hashtags have been used to refer to it on Twitter; #REF and the more popular (“official”?) #REF2014.

There’s been of course other variations of these hashtags, including discussion about it not ‘hashing’ the term REF at all. Here I share a quick first look at a sample corpus of  texts from Tweets publicly tagged with #REF2014.

This is just a quick update of a work in progress. No qualitative conclusions are offered, and the quantitative data shared and analysed is provisional. Complete data sets will be published openly once the collection has been completed and the data has been further refined.

The Numbers

I looked at a sample corpus of 23,791 #REF2014 Tweets published by 10,654 unique users between 08/12/2014 11:18 GMT and 18/12/2014 16:32 GMT.

  • The sample corpus only included Tweets from users with a minimum of two followers.
  • The sample corpus consists of 1 document with a total of 454,425 words and 16,968 unique words.
  • The range of Tweets per user varied between 70 and 1, with the average being 2.3 Tweets per user.
  • Only 8 of the total of 10,654 unique users in the corpus published between 50 and 80 Tweets; 30 users published more than 30 Tweets, with 9,473 users publishing between 1 and 5 Tweets only.
  • 6,585 users in the corpus published one Tweet only.

A Quick Text Analysis

Voyant Tools was used to analyse the corpus of 23,791 Tweet texts. A customised English stop words list was applied globally. The most frequent word was “research”, repeated 8,760 times in the corpus; it was included in the stop-word list (as well as, logically, #REF2014).

A word cloud of the whole corpus using the Voyant Cirrus tool looked like this (you can click on the image to enlarge it):

Cirrus word cloud visualisation of a corpus of 23,791 #REF2014 Tweets

#REF2014  Top 50 Most frequent words so far

Word Count
uk 4605
results 4558
top 2784
impact 2091
university 1940
@timeshighered 1790
ranked 1777
world-leading 1314
excellence 1302
universities 1067
world 1040
quality 1012
internationally 933
excellent 931
overall 910
great 827
staff 827
academics 811
proud 794
congratulations 690
rated 690
power 666
@cardiffuni 653
oxford 645
leading 641
best 629
news 616
education 567
5th 561
@gdnhighered 556
@phil_baty 548
ucl 546
number 545
law 544
today 536
table 513
analysis 486
work 482
higher 470
uni 460
result 453
time 447
day 446
cambridge 430
just 428
@ref2014official 427
group 422
science 421
big 420
delighted 410

Limitations

The map is not the territory. Please note that both research and experience show that the Twitter search API isn’t 100% reliable. Large tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). It is not guaranteed this file contains each and every Tweet tagged with the archived hashtag during the indicated period. Further dedpulication of the dataset will be required to validate this initial look at the data, and it is shared now merely as an update of a work in progress.

References

Gonzalez-Bailon, Sandra and Wang, Ning and Rivero, Alejandro and Borge-Holthoefer, Javier and Moreno, Yamir, “Assessing the Bias in Samples of Large Online Networks” (December 4, 2012). Forthcoming in Social Networks. Available at SSRN: http://ssrn.com/abstract=2185134 or http://dx.doi.org/10.2139/ssrn.2185134