AI Analysis of Historical U.S. Newspapers Reveals New Tools For Mining the Past

In 1914, the biggest story in newspapers across the U.S. was the world war that had recently broken out in Europe with a big question mark hanging over whether the U.S. would take part. The same story dominated the U.S. newspapers in 1915, 1917 and 1918. But in 1916, another story captured the attention of the American public, one that is much less well known today.

In that year, the U.S. Army entered Mexico in pursuit of a Mexican paramilitary force that had attacked the town of Columbus in New Mexico. Its goal was to capture the force’s leader, Pancho Villa, who ultimately escaped having led the American troops a merry dance for several months. For the newspapers, the episode provided exciting stories that knocked World War 1 off the front pages.

It’s easy to imagine that this kind of newspaper analysis is straightforward for scholars. After all, newspapers more than 72 years old are part of the public record and accessible via the Library of Congress. Indeed, its Chronicling America project consists of over 20 million scans of historical newspapers, some of them dating back to 17th century, along with digital versions of the text deciphered by optical character recognition software.

Headline News

But this dataset is far from satisfactory. It turns out that optical character recognition software does not recognize newspaper layouts or distinguish page furniture such as headlines, bylines, captions, and adverts from the stories themselves. This scrambles much of the digital text making it hard to read or to analyze with digital tools. As a result, the seemingly simple task of choosing the biggest news stories of the past is almost impossible.

At least, it was until Melissa Dell at Harvard University in Cambridge and colleagues entered the scene. This group have created a deep learning algorithm that detects the newspaper layout and recognizes the difference between types of text. It then uses optical character recognition to read the stories while clearly labelling the headlines, bylines and captions and ignoring adverts.

The result is a new dataset called American Stories that consists of over a billion news articles. These articles provide a unique window into a different age, throwing light on the nature of life across the U.S. prior to 1925 and on the lives of ancestors. “The resulting American Stories dataset could be used to achieve better understanding of historical English and historical world knowledge,” say Dell and co.

The team put the database through its paces by using it to find clusters of stories. “We show how articles can be grouped into news stories, with different articles that are part of the same unfolding news story clustering together,” say the team.

They then picked the largest cluster of stories in each year and manually read a sample of stories from each cluster to confirm the topic. That produced a list of the biggest stories for each year from 1885 to 1920, including 1916 when Pancho Villa dominated headlines.

This breakthrough paves the way for a new era of historical scholarship. Whether illuminating political dynamics or everyday lives, the American Stories dataset empowers bold new data-driven inquiries into the nation's formative years. Integrating modern computing with priceless primary sources promises exciting new understandings of the past.

Clearly the new database can shed light on historical events, social issues and cultural trends along with the way they were viewed at the time. But the team point to various new possibilities. For example, researchers can use the dataset to study the representation of different groups in the media over time, tracking changes in language, tone, and subject matter. The dataset can also be used to study the history of labor and the struggles of working-class people.

Antiquated Language

The authors also highlight applications the database shouldn’t be used for. They point out that the database reflects historical attitudes and biases and contains antiquated terms along with language now considered offensive. So using this database to train a generative model would raise the danger of the model taking on the same biases. “For these reasons, we recommend against the use of American Stories for training generative models,” say Dell and co.

“Rather, American Stories can be used for a wide variety of applications, ranging from elucidating social science questions to training an historically-oriented language model to exploring world and family history,” they conclude.

That’s interesting work providing insight into the potential for the latest powerful digital techniques to provide a more accurate and reliable source of information for studying the past and understanding the present.

Ref: American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers : arxiv.org/abs/2308.12477

This story was prepared with the assistance of claude.ai