Image Captioning AI More Accurate Than Humans

Microsoft has announced that in tests, its new, AI-based, automatic image captioning technology is better than humans at describing photos and images.

Part of Azure AI

The new automatic image captioning model is available via Microsoft’s Azure Cognitive Services Computer Vision offering, which is part of Azure AI. Azure Cognitive Services provides developers with AI services and cognitive APIs to enable them to build intelligent apps without the need for machine-learning expertise.


The test of the new automatic image captioning software, led by Lijuan Wang, a principal research manager in Microsoft’s research lab in Redmond, involved pre-training a large AI model with a rich dataset of images paired with word tags, with each tag mapped to a specific object in an image. This ‘visual vocabulary’ approach is similar to helping children to read e.g. by using a picture book associating single words with images, such as a picture of an apple with the word “apple” beneath it. Using this visual vocabulary system, the machine learning model learned how to compose a sentence and then was able to leverage this ability and fine-tune it when given more novel objects in images.

The Result

The Cornell University research paper based on this test, and published online, concluded that the model could generate fluent image captions that describe novel objects and identify the locations of the objects. The report also concluded that the machine learning model “achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.”  This means that the model achieved and beat human parity on the novel object captioning at scale (nocaps) benchmark i.e. how well the model generated captions for objects in images that were not in the dataset used to train them.

Twice As Good As Existing System

Microsoft’s Lijuan Wang has also concluded that the new AI-powered automatic image captioning system is two times better than the image captioning model that has been used in Microsoft products and services since 2015.

Five Major Human Parities

Lijuan Wang highlights how this latest AI breakthrough in automatic captioning adds to Microsoft’s existing theme of creating “human parity achievement across cognitive AI systems”.  According to her, in the last five years, Microsoft has “achieved five major human parities: in speech recognition, in machine translation, in conversational question answering, in machine reading comprehension, and in 2020, in spite of COVID-19, we got the image captioning human parity.”

What Does This Mean For Your Business?

Microsoft sees this as a ‘breakthrough’ that is essentially an extra technology tool to be added to its Azure platform so that developers can use it to serve a broad set of customers.  As highlighted by Lijuan Wang, it also sends a message to other big tech companies that are expanding their use of AI/machine learning and features at the moment e.g. Google and Amazon, that Microsoft is also making major strides in the kinds of technologies than can have multiple business and other applications, as well as being able to make existing digital search and tools more effective. Microsoft’s own chromium-based search engine, Edge, will, no doubt, be a beneficiary of this technology. This development also shows that we are now entering a stage where AI/machine learning can create tools that are at least on a par with human ability for some tasks.