Register for an account


Enter your name and email address below.

Your email address is used to log in and will not be shared or sold. Read our privacy policy.


Website access code

Enter your access code into the form field below.

If you are a Zinio, Nook, Kindle, Apple, or Google Play subscriber, you can enter your website access code to gain subscriber access. Your website access code is located in the upper right corner of the Table of Contents page of your digital edition.


Machines Best Humans in Stanford's Grueling Reading Test

D-briefBy Carl EngelkingJanuary 15, 2018 11:56 PM


Sign up for our email newsletter for the latest science news

(Credit: Shutterstock) The ability to read and understand a passage of text underpins the pursuit of knowledge, and was once a uniquely human cognitive activity. But 2018 marks the year that, by one measure, machines surpassed humans’ reading comprehension abilities. Both Alibaba and Microsoft recently tested their respective artificial neural networks with The Stanford Question Answering Dataset (SQuAD), which is an arduous test of a machine’s natural language processing skills. It’s a dataset that consists of over 100,000 questions drawn from thousands of Wikipedia articles. Basically, it challenges algorithms to parse a passage of text and write answers to tricky questions. The AIs, for example, might read a passage about geology and answer questions like “An igneous rock is a rock that crystallizes from what?” or “What changes the mineral content of a rock?” These questions are a level higher than simply scanning for basic facts, and they require algorithms to process a large amount of information regarding context, sequences and relationships before providing an accurate answer. The algorithm developed by Alibaba’s Institute of Data Science Technologies, SLQA+, notched a score of 82.44 on the test, which was just a hair better than the 82.304 scored by humans. Alibaba claims it is the first time a machine has performed better than flesh-and-blood in the ExactMatch metric of the Stanford test. Microsoft Research Asia also outdid humans, and its R-NET+ scored 82.650. Pranav Rajpurkar, a Stanford artificial intelligence researcher and designer of the test, wrote on Twitter that the achievement is a harbinger more good things to come for AI in 2018. (Note: The F1 metric is the balanced mean between precision and recall).

A strong start to 2018 with the first model (SLQA+) to exceed human-level performance on @stanfordnlp SQuAD's EM metric! Next challenge: the F1 metric, where humans still lead by ~2.5 points!

— Pranav Rajpurkar (@pranavrajpurkar) January 11, 2018

A machine that can provide useful answers to more complicated questions could be put to work in a wide variety of applications. Alibaba, for example, is already using its reading system to field customer service questions on Singles Day, China’s shopping bonanza that’s the largest in the world. “The technology underneath can be gradually applied to numerous applications such as customer service, museum tutorials and online responses to medical inquiries from patients, decreasing the need for human input in an unprecedented way,” Luo Si, chief scientist at the Alibaba institute said in a statement.

3 Free Articles Left

Want it all? Get unlimited access when you subscribe.


Already a subscriber? Register or Log In

Want unlimited access?

Subscribe today and save 70%


Already a subscriber? Register or Log In