A well-annotated dataset for the Artificial Intelligence (AI)-aided cervical cancer screen, so called Deep Cervical Cytology Lesions (DCCL) has been explored by a collaboration of King Med Diagnostics and Huawei in China. It is the largest set of cervical cytology data for development of the deep learning-based screening product, and it becomes a milestone and “A Benchmark for Cervical Cytology Analysis” as the authors indicated. This beautiful work has been presented at the 22nd International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI, 2019, Shenzhen, China) , and published in the International Workshop on Machine Learning in Medical Imaging .
Cervical cancer is one of the most common malignant tumors threatening women's health, especially in the developing countries. It is preventable or curable if its precancerous lesions are early detected by cytological screening combined with Human Papilloma Virus (HPV) test. Due to severely lack of the screening personnel in China, the mortality and morbidity of cervical cancer remain high. An AI-Aided Screening Product (AI-ASP) for cervical cancer detection will be a solution because it helps to screen out the normal cervical specimens so that the cytologists can focus the diagnosis of abnormal lesions.
In the development of an AI-ASP for cervical screening, a large amount of high-quality and annotated cervical cytology dataset is an essential prerequisite for the deep learning algorithm. Lack of dataset for the deep learning training has become a bottleneck of developing any AI-aided product in medicine.
DCCL has collected a total of 14,432 image blocks from 1,167 complete slide images, which is the largest dataset for the deep learning training on cervical cancer screening. These images were selected from a huge volume of cervical pap smears stored in KingMed Diagnostics. KingMed is the largest commercial laboratory in China, and is also the first laboratory in China obtaining Laboratory Accreditation from the College of American Pathologists (CAP) and International Organization for Standardization 15189 (ISO15189). It has accumulated a total volume of 43.5 millions cervical cytological cases over last twenty years. In cervical cytological screening practice, KingMed completely follows the CAP and ISO15189 guidelines in its quality assurance and quality control program. These ensure a high standard resource of DCCL dataset both in quantity and quality.
Figure 1 illustrates the algorithm of DCCL dataset construction. Annotation was performed by eight senior cytopathologists, who have at least six years or above experience of signing-out in cytopathology. Two cytopathologists paired as a group, one does labeling and another does verification. Before the annotation process, cytopathologists were trained by Huawei AI engineers for the labeling standards and lessons to ensure the quality and accuracy of annotation, and to minimize the subjective difference among the cytopathologists. Two types of annotation were provided; one is the slide-level annotation for the normal result and second is cell-level annotation for the abnormal result. A total of 27,972 lesion cells were labeled following the diagnostic criteria and categories of the 2014 Bethesda System (TBS) for the Cervical Cytology Reporting . The annotation results were also randomly checked by a chief cyto pathologist as the quality assurance process. Therefore, the annotation results of DCCL dataset are high quality and reliable.
DCCL dataset buildup algorithm.
By using this DCCL dataset, a deep learning algorithm model has developed, which has achieved a sensitivity of 61% of the negative cytological cases signed-out by cytopathologists. In these cases, the accuracy rate is greater than 99% (i.e. less than 1% of false-negative rate). The algorithm has also achieved a 100% of sensitivity of the abnormal cases signed-out by cytopathologists, and no cases were missed among the abnormal ones by the deep learning screening algorithm.
In comparison with the currently available cervical cytological data sets including Cervi SCAN , ISBI 2015  and recently published Datasets [6-8] included several hundred of images and few thousands of lesion cells, DCCL dataset has the largest data volume with greater than ten thousands of images and 28 thousands of lesion cells, which come from the largest CAP-accredited laboratory in China. The lesion cell types were classified following TBS criteria , and the high-quality dataset was thoroughly labeled and annotated by the highly experienced cytopathologists. It was very time-consuming, laborious and costly process. The dataset will be released and be publically available for the traditional machine learning and deep learning studies. It is very valuable and the blessedness for the development of AI-aided cervical cancer screening.