Benchmarking AI in healthcare
Artificial intelligence (AI) models are becoming an indispensable part of healthcare delivery due to the ever-growing volumes of data the sector produces. The technology supports countless operations, from pattern recognition and classification of disease entities to clinical decision-making. The purpose of AI is to decrease the burden of repetitive tasks on staff and provide information support. Exploration of the applications of AI in health has attracted considerable investment. Despite the significant potential, there are numerous unresolved issues in AI applications in the healthcare sector, namely different regulatory requirements, systematic bias inherent to the algorithms, uncertainty about efficacy, as well as concerns over ethical use.
AI-enabled solutions are only as good as the humans who designed them. The performance of the algorithm depends on the quality of data that feeds the system, training methods, and the mechanism of learning. Biased, incomplete or unverified datasets riddled with errors cannot produce a valid result. Similarly, training of data models needs to rest on sound scientific foundations, be consistently applied, and continuously revised for validity.
In 2018, the World Health Organization (WHO) and the International Telecommunication Union (ITU) established a Focus Group on Artificial Intelligence for Health (FG-AI4H). The initiative is asking all interested stakeholders to participate in developing standards for the evaluation and quality control of AI algorithms in the healthcare sector to improve the existing processes and ensure better health outcomes .
The Focus Group believes that the field would benefit from AI methods that can be independently evaluated in a standardized manner to support global health. Benchmarking shall not require disclosure of the algorithm itself; only the results it produces using standardized input datasets. Performance criteria that reflect the quality of mapping shall include accuracy, reproducibility, robustness, and absence of bias .
The AI models rely on public data and other data sources and train their models based on a clear problem definition, with a defined benchmark that shall serve for assessment at open benchmarking platforms, i.e., crowdAI . The Focus Group shall be responsible for the creation and management of undisclosed test sets .
The common domains identified by FG-AI4H include diagnostics (general and specific), natural language processing, data extraction from clinical notes, or coding lab data. The group also envisions the establishment of registries for reporting serious adverse events linked to healthcare AI . These are exciting times as the infrastructure is literally under construction while already in flight. Let’s see how the initiative catches up.
 ITU. (2018). Focus Group on “Artificial Intelligence for Health.” Retrieved 4 March 2020, from https://www.itu.int/en/ITU-T/focusgroups/ai4h/Pages/default.aspx
 Salathé, M., Wiegand, T., Wenzel, M., & Kishnamurthy, R. (2018). Focus Group on Artificial Intelligence for Health. ITU. Retrieved from https://www.itu.int/en/ITU-T/focusgroups/ai4h/Documents/FG-AI4H_Whitepaper.pdf
 crowdAI. (2018). crowdAI. Retrieved 4 March 2020, from https://www.crowdai.org/