How Data Mining Visualizes Story Lines in the Twittersphere

One curious side-effect of the work to digitize books and historical texts is the ability to search these databases for words, when they first appeared and how their frequency of use has changed over time. The Google Books n-gram corpus is a good example (an n-gram is a sequence of n words). Enter a word or phrase and it’ll show you its relative usage frequency since 1800. For example, the word “Frankenstein” first appeared in the late 1810s and has grown in popularity ever since. By contrast, the phrase “Harry Potter” appeared in the late 1990s, gained quickly in popularity but never overtook Frankenstein — or Dracula, for that matter. That may be something of surprise given the unprecedented global popularity of J.K. Rowling’s teenage wizard. And therein lies the problem with a database founded on an old-fashioned, paper-based technology. The Google Books corpus records “Harry Potter” once for each novel, article and text in which it appears, not for the millions of times it is printed and sold. There is no way to account for this level of fame or how it leaves others in the shade. Today that changes, thanks to the work of Thayer Alshaabi at the Computational Story Lab at the University of Vermont and a number of colleagues. This team has created a searchable database of over 100 billion tweets in more than 150 languages containing over a trillion 1-grams, 2-grams and 3-grams. That’s about 10 per cent of all Twitter messages since September 2008. Data Visualization The team has also developed a data visualization tool called Storywrangler that reveals the popularity of any words or phrases based on the number of times they have been tweeted and retweeted. The database shows how this popularity waxes and wanes over time. “In building Storywrangler, our primary goal has been to curate and share a rich, language-based ecology of interconnected n-gram time series derived from Twitter,” say Alshaabi and co. Storywrangler immediately reveals the “story” associated with a wide range of events, individuals and phenomenon. For example, it shows the annual popularity of words associated with religious festivals such as Christmas and Easter. It tells how phrases associated with new films burst into Twittersphere and then fade away, while TV series tend to live on, at least throughout the series’ lifetime. And it reveals the emergence of politico-social movements such as Brexit, Occupy #MeToo and Black Lives Matter. The storylines can also be compared with other databases to provide more fine-grained insight and analysis. For example, the popularity of film titles on Twitter can be compared with the film’s takings at the box office; the emergence of words associated with disease can be compared with the number of infections recorded by official sources; and words associated with political unrest can be compared with incidents of civil disobedience. That’s useful because this kind of analysis provides a new way to study society, potentially with predictive results. Indeed, computer scientists have long suggested that social media can be used to predict the future. Cultural Significance These storylines have social and cultural significance too. “Our collective memory lies in our recordings — in our written texts, artworks, photographs, audio and video — and in our retellings and reinterpretations of that which becomes history,” say Alshaabi and colleagues. Now anyone can study it with Storywrangler. Try it, it’s interesting. As for Harry Potter, Frankenstein and Dracula, the tale that Storywrangler tells is different from the Google Books n-gram corpus. Harry Potter is significantly more popular than his grim-faced predecessors and always has been on Twitter. In 2011, Harry Potter was the 44th most popular term on Twitter while Dracula has never risen higher than 2653rd. Frankenstein’s best rank is 3560th. Of course, fame is a fickle friend and an interesting question is whether Harry Potter will fare as well as Frankenstein two hundred years after publication. Storywrangler, or its future equivalent, would certainly be able to help. Ref: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter. arxiv.org/abs/2007.12988

How Data Mining Visualizes Story Lines in the Twittersphere

A vast new dataset reveals the popularity of words and phrases on Twitter and how they change over time.

Newsletter

The Physics arXiv Blog

Will Phones Let You Smell What's On The Other End Of The Call One Day?

A Third Of The World Lacks Internet Access. Airborne Communications Stations Could Fix That

Are Private Conversations Truly Private? Encryption Could Protect You

Is It the End of the Password?

The Year of the AI Conversation

OMG! The History of Emojis May Surprise You

How to Beat Social Media Algorithms

Do Smartphones Need to Get Smarter?

Social Media Is Not to Blame for Dwindling Face-to-Face Communication

How Hackers Take Down Websites

The Pros and Cons of Artificial Intelligence

Some Researchers Debate if we are Living in a Computer Simulation

How Will the Infrastructure Bill Boost the Nation's Technology?

How Do SpaceX's Starlink Satellites Actually Work?

Experts Are Worried About “Deepfake Geography”

Stay Curious

JoinOur List

SubscribeTo The Magazine