"Update your AGI predictions"--- Prof. Roman Yampolskiy, PhD @romanyam
January 15, 2018
(credit: Alibaba Group)
Microsoft and Alibaba have developed deep neural network models that scored higher than humans in a Stanford University reading and comprehension test, Stanford Question Answering Dataset (SQuAD).
Microsoft achieved 82.650 on the ExactMatch (EM) metric* on Jan. 3, and Alibaba Group Holding Ltd. scored 82.440 on Jan. 5. The best human score so far is 82.304.
“SQuAD is a new reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage,” according to the Stanford NLP Group. “With 100,000+ question-answer pairs on 500+ articles, SQuAD is significantly larger than previous reading comprehension datasets.”
“A strong start to 2018 with the first model (SLQA+) to exceed human-level performance on @stanfordnlp SQuAD’s EM metric!,” said Pranav Rajpurkar, a Ph.D. student in the Stanford Machine Learning Group and lead author of a paper in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing on SQuAD (available on open-access ArXiv). “Next challenge: the F1 metric*, where humans still lead by ~2.5 points!” (Alibaba’s SLQA+ scored 88.607 on the F1 metric and Microsoft’s r-net+ scored 88.493.)
However, challenging the “comprehension” description, Gary Marcus, PhD, a Professor of Psychology and Neural Science at NYU, notes in a tweet that “the SQUAD test shows that machines can highlight relevant passages in text, not that they understand those passages.”
“The Chinese e-commerce titan has joined the likes of Tencent Holdings Ltd. and Baidu Inc. in a race to develop AI that can enrich social media feeds, target ads and services or even aid in autonomous driving, Bloomberg notes. “Beijing has endorsed the technology in a national-level plan that calls for the country to become the industry leader 2030.”
*”The ExactMatch metric measures the percentage of predictions that match any one of the ground truth answers exactly. The F1 score metric measures the average overlap between the prediction and ground truth answer.” – Pranav Rajpurkar et al., ArXiv
January 16, 2018
In the past Microsoft (and others) have trumpeted their own AI achievements when their work was actually rubbish. Remember their NN that detected a person’s age, it would fail to match the age for exactly the same image simply based on it’s scale, the number of pixels and therefore the relative noise levels. Like a lot of NNs actually they have deep problems around the presence or absence of noise and the output can end up on a completely different maxima in a way that a human brain would never do.
January 16, 2018
a) I wonder who would expect machines to perform with human cognitive robustness, when there is currently no learning model with the processing power of the human brain?
b) I also ponder why people think progress is not made, just because AGi is not yet here?