‘BBCDebate’ on Twitter. A First Look into an Archive of #BBCDebate Tweets

[For the previous post in this series, click here].

The BBC Debate

The BBC’s Great Debate” was broadcasted live in the UK by the BBC on Tuesday 21 June 2016 between 20:00 and 22:00 BST. It saw activity on Twitter with the #BBCDebate hashtag.

I collected some of the Tweets tagged with #BBCDebate using a Google Spreadsheet. (See the methodology section below). I have shared an anonymised dataset on figshare:

Priego, E. (2016) “The BBC’s Great Debate”: Anonymised Data from a #BBCDebate Archive. figshare. https://dx.doi.org/10.6084/m9.figshare.3457688.v1

[Note: figshare DOIs are not resolving or there are delays in resolving; it should be fixed soon…]

Archive Summary (#BBCDebate)

Number of links 16826
Number of RTs 32206 <-estimate based on occurrence of RT
Number of Tweets 38116
Unique tweets 38066 <-used to monitor quality of archive
First Tweet in Archive 14/06/2016 22:03:18 BST
Last Tweet in Archive 22/06/2016 09:12:32 BST
In Reply Ids 349
In Reply @s 456
Tweet rate (tw/min) 62 Tweets/min (from last archive 10mins)
Unique Users in archive:

                      20, 243

Tweets from StrongerIn in archive:

16

Tweets from vote_leave in archive:

15

The raw data was downloaded as an Excel spreadsheet file containing 38,166 Tweets (38,066 Unique Tweets) publicly published with the queried hashtag (#BBCDebate) between 14/06/2016 22:03:18 and 22/06/2016 09:12:32 BST.

Due to the expected high volume of Tweets only users with at least 10 followers were included in the archive.

As indicated above the BBC Debate was broadcasted live on UK national television on Tuesday 21 June 2016 between 20:00 and 22:00 BST. This means the data collection covered the real-time broadcasting of the live debate (see the chart below).

#BBCDebate Activity in the last 3 days
#BBCDebate Activity in the last 3 days. Key: blue: Tweet; red: Reply

The data collected indicated only 12 Tweets in the whole archive contained geolocation data. A variety of user languages (user_lang) were identified.

Number of Different User Languages (user_lang)

Note this is not the language of the Tweets’ text, but the language setting in the application used to post the Tweet. In other words user_lang indicates the language the Twitter user selected from the drop-down list on their Twitter Settings page. This metadata is an indication of a user’s primary language but it might be misleading. For example, a user might select ‘es’ (Spanish) as their preferred language but compose their Tweets in English.

The following list ranks  user_lang  by number of Tweets in dataset in descending order. Specific counts can be obtained by looking at the dataset shared.

user_lang
en
en-gb
fr
de
nl
es
it
ja
ru
pt
ar
sv
pl
tr
da
ca
fi
id
ko
th
el
cs
no
en-IN
he
zh-cn
hi
uk

If you are interested in user_lang, GET help/languages returns the list of languages supported by Twitter along with the language code supported by Twitter. At the time of writing the language code may be formatted as ISO 639-1 alpha-2 (en), ISO 639-3 alpha-3 (msa), or ISO 639-1 alpha-2 combined with an ISO 3166-1 alpha-2 localization (zh-tw).

It is interesting to note the variety of European user_lang selected by those tweeting about #BBCDebate.

Notes on Methodology

The Tweets contained in the Archive sheet were collected using Martin Hawksey’s TAGS 6.0.

Given the relatively large volume of activity expected around #BBCDebate and the public and political nature of the hashtag, I have only shared indicative data. No full tweets nor any other associated metadata have been shared.

The dataset contains a metrics summary as well as a table with column headings labeled  created_at,  time,    geo_coordinates (anonymised; if there was data YES has been indicated; if no data was present the corresponding cell has been left blank), user_lang and user_followers_count data corresponding to each Tweet.

Timestamps should suffice to prove the existence of the Tweets and could be useful to run analyses of activity on Twitter around a real-time media event.

Text analysis of the raw dataset was performed using Stéfan Sinclair’s & Geoffrey Rockwell’s Voyant Tools. I may share results eventually if I find the time.

The collection and analysis of the dataset complies with Twitter’s Developer Rules of the Road.

Some basic deduplication and refining of the collected data was performed.

As in all the previous datasets I have created and shared it must be taken into account this is just a sample dataset containing the tweets published during the indicated period and not a large-scale collection of the whole output. The data is presented as is as a research sample and as the result of an archival task. The sample’s significance is subject to interpretation.

Again as in all the previous cases please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). Google spreadsheet limits must also be taken into account. Therefore it cannot be guaranteed the dataset contains each and every Tweet actually published with the queried Twitter hashtag during the indicated period. [González-Bailón et al have done very interesting work regarding political discussions online and their work remains an inspiration].

Only data from public accounts was included and analysed. The data was obtained from the public Twitter Search API. The analysed data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.

No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor was contained in the dataset.

I have shared the dataset including the extra tables as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis. It is hoped that by sharing the data someone else might be able to run different analyses and ideally discover different or more significant insights.

For the previous post in this series, click here. If you got all the way here, thank you for reading.

References
[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave”. A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In”. A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare.
https://dx.doi.org/10.6084/m9.figshare.3456617.v1

Priego, E. (2016) “Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/stronger-in-looking-into-a-sample-archive-of-1005-strongerin-tweets/. [Accessed 21 June 2016].

Priego, E. (2016) “The BBC’s Great Debate”: Anonymised Data from a #BBCDebate Archive. figshare. https://dx.doi.org/10.6084/m9.figshare.3457688.v1

“Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets

If you haven’t been there already, please start here. An introduction and a detailed methodological note provide context to this post.

I have now shared a spreadsheet containing an archive of 1,005 @StrongerIn Tweets publicly published by the queried account between12/06/2016 13:34:35 and 21/06/2016 13:11:34 BST.

The spreadsheet contains four more sheets containing a data summary from the archive, a table of tweets’ sources, and tables of corpus term and trend counts and collocate counts.

This will hopefully allow to compare two similar samples from the output of two homologous Twitter accounts, both officially representing the ‘Leave’ and ‘Remain’ sides of the UK EU Referendum. The collected period is the same and if desired it is possible to edit the sets to have for example 1,000 Tweets each.

Following the structrue of my previous post on the ‘Vote Leave‘ dataset, here’s some quick insights from the @StrongerIn account for comparison.

Archive (from:StrongerIn)

Number of links 735
Number of RTs 409 <-estimate based on occurrence of RT
Number of Tweets

1005

Unique tweets 1004 <-used to monitor quality of archive
First Tweet in Archive

12/06/2016 13:34:35

BST
Last Tweet in Archive

21/06/2016 13:11:34

BST
In Reply Ids

9

In Reply @s 0
Tweet rate (tw/min)

0.1

Tweets/min (from last archive 10mins)

Like the @vote_leave account, @StrongerIn is used for mainly broadcasting Tweets and no @ Replies to users were collected during the period represented in the dataset.

Though this dataset, collected over slightly different timings but covering the same number of days, contains 60 fewer Tweets than the Vote Leave one; this @StrongerIn dataset reflects the account shared 235 links more than its @vote_leave counterpart.

Sources

Unlike @vote_leave, the dataset does not indicate that @StrongerIn uses Buffer nor Twitter for iPhone. However TweetDeck (423) and the Twitter Web Client (591) appear as the main sources. There’s even an interestingly strange Tweet, linking to a StrongerIn 404 web site page, published from NationBuilder.

Source Count
Nationbuilder

1

TweetDeck

413

Twitter Web Client

591

Total

1,005

Most Frequent Words

Removing Twitter data-specific stopwords from the raw data (e.g. t.co, amp, rt) the 10 most frequent words in the corpus are:

Term Count Trend
eu

287

0.013906387

remain

224

0.010853765

bbcqt

216

0.01046613

europe

209

0.01012695

vote

170

0.008237232

strongerin

167

0.00809187

uk

159

0.0077042347

jobs

148

0.0071712374

leave

148

0.0071712374

eudebate

113

0.0054753367

Compare them with the 10 most frequent words in the vote_leave data. Anything interesting?

 Let’s compare the top 10 terms from each account side by side:

 

Top 10 Terms in 1,100 vote_leave Tweets over 7 days vote_leave count Top 10 Terms in 1,005 StrongerIn Tweets over 7 days StrongerIn count
voteleave 558 eu 287
eu 402 remain 224
bbcqt 398 bbcqt 216
gove 165 europe 209
takecontrol 146 vote 170
immigration 133 strongerin 167
control 95 uk 159
cameron 89 jobs 148
turkey 84 leave 148
uk 72 eudebate 113

The terms in red are those appearing in both datsets; the terms in blue correspond to the name of each campaign. It’s interesting that though the StrongerIn account has 182 fewer mentions of ‘bbcqt’ (bear in mind the StrongerIn dataset has 95 fewer Tweets), ‘bbqt’ remains in third place on both sets.

The differences between the ranking of mentions of each campaign’s name are noticeable; as well as the fact that the vote_leave campaign has the name of the Prime Minister (himself a Remain campaigner) in its top 10 (as well as that of Gove; a Leave campaigner), while StrongerIn has no names of politicians on its 10 most frequent words.

There are other potentially interesting or noticeable differences when we compare these two top 10s. Can you spot them?  Do they tell us anything or not?

Digging into data and creating datasets does not necessarily tell us new things, but it does allow us to pinpoint otherwise moving objects. We don’t need to pin butterflies to recognise they are indeed butterflies, but the intention is to create new settings for observation.

References

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave” Looking Into a Sample Archive of 1,100 vote_leave Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/vote-leave-looking-into-a-sample-archive-of-1100-vote_leave-tweets/. [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave” A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In” A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. DOI:
https://dx.doi.org/10.6084/m9.figshare.3456617.v1

[StrongerIn]. (2016). [Twitter account].Retrieved from https://twitter.com/StrongerIn. [Accessed 21 June 2016].

[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

“Vote Leave”: Looking Into a Sample Archive of 1,100 vote_leave Tweets

In two days the United Kingdom will be voting in a Referendum that is very likely to change its destiny. More importantly, it is likely to change the destiny of everyone else who has a relationship with the UK.

This is a political event that is not only of national, internal or local interest, but one that is likely to have direct and immediate repercussions well beyond its borders. If one has ever lived in one of the EU member countries recently one does not need to be a political scientist to feel that these repercussions will not only be of a merely economic nature– already, even before the vote is cast, the UK’s social tissue has been undoubtedly transformed and deeply, even tragically affected.

Needless to say one of the arenas where political activity is taking place is on the media (TV, Radio) and social media. As the date to vote in person approaches, I collected and shared a dataset of tweets published by the official Leave campaign Twitter account, @vote_leave, between 12/06/2016 09:06:22 – 21/06/2016 09:29:29 BST. The dataset contains 1,100 tweets.

I did a quick text analysis of the Tweets themselves to get a quick insight into the most frequent terms and collocates in the corpus, and also looked at the tweets’ sources (the services used to publish the Tweets, i.e. the Twitter Web Client, Buffer, the Twitter iPhone app).

Some quick insights from the data:

Archive Summary (from:vote_leave)

Number of links 500
Number of RTs 592 <-estimate based on occurrence of RT
Number of Tweets 1100
Unique tweets 1099 <-used to monitor quality of archive
First Tweet in Archive

12/06/2016 09:06:22

BST
Last Tweet in Archive

21/06/2016 09:29:29

BST
Tweet rate (tw/min) 0.1 Tweets/min (from last archive 10mins)
In Reply Ids 3
In Reply @s 2
@s 90
RTs 54%

It is interesting that the account mostly broadcasts and RTs Tweets, but does a minimal interaction with other users via Reply @s, at least according to this sample dataset. (A larger dataset could corroborate or not if that is a trend indicating a media/content strategy or not).

Sources

The data indicates that most Tweets are published from the Twitter Web Client (496!), which I would have thought any marketing professional would find clunky if not really unfit for purpose.

Not suprisingly however Buffer is used (411 buffered Tweets), which indicates they are likely to have been scheduled in advance. Surprisingly for me, most of the Tweets in the dataset did not have TweetDeck as a source (only 4 according to the collected data in the given period), but it is possible that TweetDeck was used to ‘buffer’ the Tweets, as TweetDeck allows for Buffer integration.

Twitter for iPhone emerges as a significant source, well above Tweetdeck. Personally, I picture such an important political campaigning being done from a mobile phone as kind of scary. Influencing a nation’s destiny from the train home after the pub!

Source Count
Tweetdeck

4

Buffer

411

Twitter for iPhone

189

Twitter Web Client

496

Total

1100

Most Frequent Words

I was not surprised to see that ‘immigration’ was one of the most frequent words appearing in the corpus. However it was interesting to see the centrality of the hashtag ‘bbcqt’ (BBC Question Time). Even if we take into account the specific context of the data’s time period, the prevalence of bbcqt as a term in the corpus could be potentially interpreted as an indication of the importance that television, and specifically the BBC, has had in defining voting trends and public discourse regarding the Referendum.

Removing Twitter data-specific stopwords from the raw data (e.g. t.co, amp, rt) the 10 most frequent words in the corpus are:

Term Count Trend
voteleave

558

0.026160337

eu

402

0.018846694

bbcqt

398

0.018659165

gove

165

0.0077355835

takecontrol

146

0.0068448195

immigration

133

0.0062353495

control

95

0.004453821

cameron

89

0.0041725268

turkey

84

0.003938115

uk

72

0.0033755274

(voteleave, bbcqt, takecontrol were hashtags).

It is not clear how much of a social media/content strategy might be behind a Twitter account like @vote_leave, nor how many account managers are behind the tweetage. Apart from the obvious prevalence of ‘immigration’ as a term, it is nevertheless interesting to see that in 8 days of Tweets in the final countdown to the Referendum there would be a clear interest in tapping into televised debate and influence (bbcqt), to the point that the term would get such a high ranking. Bear in mind that ‘voteleave’ is their standard campaign hashtag, and that ‘eu’ would be expected to be a very frequent word, to the point that it could be considered a stop word in the specific context of this corpus. Perhaps for all the onus on social media as an autonomous medium it is still traditional mainstream media, in this case the BBC, which has the greatest influence in public opinion?

Notes on Methodology

The Tweets contained in the Archive sheet were collected using Martin Hawksey’s TAGS 6.0.

The text analysis was performed using Stéfan Sinclair’s & Geoffrey Rockwell’s Voyant Tools.

The collection and analysis of the dataset complies with Twitter’s Developer Rules of the Road.

The data was collected as an Excel spreadsheet file containing an archive of 1,100 @vote_leave Tweets publicly published by the queried account between 12/06/2016 09:06:22 – 21/06/2016 09:29:29 BST.

I prepared a spreadsheet and added four more sheets to add a data summary from the archive, a table of tweets’ sources, and tables of corpus term and trend counts and collocate counts.

It must be taken into account this is just a sample dataset containing the tweets published during the indicated period and not a large-scale collection of the whole output. The data is presented as is as a research sample and as the result of an archival task. The sample’s significance is subject to interpretation.

Please note that both research and experience show that the Twitter search API is not 100% reliable. Large Tweet volumes affect the search collection process. The API might “over-represent the more central users”, not offering “an accurate picture of peripheral activity” (González-Bailón, Sandra, et al. 2012). Therefore it cannot be guaranteed the dataset contains each and every Tweet actually published by the queried Twitter account during the indicated period. [González-Bailón et al have done very interesting work regarding political discussions online and their work remains an inspiration].

Only content from public accounts was included and analysed. The data was obtained from the public Twitter Search API. The analysed data is also publicly available to all Twitter users via the Twitter Search API and available to anyone with an Internet connection via the Twitter and Twitter Search web client and mobile apps without the need of a Twitter account.

Each Tweet and its contents were published openly on the Web, they were explicitly meant for public consumption and distribution and are responsibility of the original authors. Any copyright belongs to its original authors.

No Personally identifiable information (PII), nor Sensitive Personal Information (SPI) was collected nor was contained in the dataset.

I have shared the dataset including the extra tables as a sample and as an act of citizen scholarship in order to archive, document and encourage open educational and historical research and analysis. It is hoped that by sharing the data someone else might be able to run different analyses and ideally discover different or more significant insights.

For the next post on this series, click here.

References
[vote_leave]. (2016) [Twitter account]. Retrieved from https://twitter.com/vote_leave. [Accessed 21 June 2016].

González-Bailón, S., Banchs, R.E. and Kaltenbrunner, A. (2012) Emotions, Public Opinion and U.S. Presidential Approval Rates: A 5 Year Analysis of Online Political Discussions. Human Communication Research 38 (2) 121-143.

González-Bailón, S. et al (2012) Assessing the Bias in Communication Networks Sampled from Twitter (December 4, 2012). DOI: http://dx.doi.org/10.2139/ssrn.2185134

Hawksey, M. (2013) What the little birdy tells me: Twitter in education. Published on November 12, 2013. Presentation given from the LSE NetworkED Seminar Series 2013 on the use of Twitter in Education. Available from http://www.slideshare.net/mhawksey/what-the-little-birdy-tells-me-twitter-in-education [Accessed 21 June 2016].

Priego, E. (2016) “Vote Leave”. A Dataset of 1,100 Tweets by vote_leave with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare. URL: DOI: https://dx.doi.org/10.6084/m9.figshare.3452834.v1

Priego, E. (2016) “Stronger In”. A Dataset of 1,005 Tweets by StrongerIn with Archive Summary, Sources and Corpus Terms and Collocates Counts and Trends. figshare.
https://dx.doi.org/10.6084/m9.figshare.3456617.v1

Priego, E. (2016) “Stronger In”: Looking Into a Sample Archive of 1,005 StrongerIn Tweets. 21 June 2016. Available from https://epriego.wordpress.com/2016/06/21/stronger-in-looking-into-a-sample-archive-of-1005-strongerin-tweets/. [Accessed 21 June 2016].