Transformers in Dermoscopic Image Classification

Dermoscopy is a powerful method used in dermatology to analyze the features of skin lesions.

The early diagnosis of skin cancer and melanoma has drastically pushed forward with the recent advancements in dermoscopic image classification using artificial intelligence and deep learning. Convolutional neural networks, in particular, have improved the precision and effectiveness of diagnosis. However, the use of vision Transformers for skin cancer applications has gained popularity over the past few months. In this post we review the use of vision Transformers in skin lesion and melanoma diagnosis using dermoscopic images.

Transformers are a group of deep learning models that were initially developed for Natural Language Processing (NLP) problems. Yet their design, which combines self-attentional mechanisms with deep neural networks, has shown to be adaptable in handling a variety of structured data, including images. Because of their versatility, they have been incorporated into the field of computer vision and are now extremely useful for applications like dermoscopic image classification. Transformers include a number of essential elements that support their efficiency, especially in image classification tasks. The self-attention mechanism gives models the ability to judge the importance of various data pieces, having a significant impact on image classification. Additionally, the ability of the model to recognize complex patterns is substantially improved by multi-head attention, which enables the simultaneous assessment of numerous image aspects. On top of that, Transformers use positional encoding, which is essential when working with image data, to address the lack of innate spatial information. Lastly, feed-forward neural networks are integrated to further process the extracted image features, collectively contributing to the robustness of the transformer architecture in image-related tasks.

There are various benefits of using Transformers in dermoscopic image classification. First, their application results in improved accuracy due to their proficiency in capturing intricate patterns within these images, thereby enhancing the precision of skin cancer diagnosis. Second, by automatically picking up on relevant image features, Transformers speed up model building by removing the need for time-consuming human feature engineering. Finally, the use of pre-trained transformer models, which were initially developed for natural language processing, in fine-tuning for dermoscopic image classification taps on the amount of prior knowledge within these models, enhancing their effectiveness in this situation.

However, CNNs have a well-established track record of performance in computer vision tasks, including the analysis of skin lesions. They are effective at tasks including lesion segmentation and classification. The performance of several conventional CNN-based models for skin lesion analysis on images is outstanding and they can adapt to small datasets. Transformers, on the other hand, excel in capturing long-range dependencies in data and can automatically learn relevant features from images without the need for handcrafted features. But they require a substantial amount of data. Consequently, CNNs are still a solid choice for many image-based skin lesion analysis tasks, particularly when dealing with smaller datasets. Yet, there is recent research suggesting that the utilization of hybrid models, which combines both CNNs and transformers, may outperform both individual architectural approaches using current available public datasets. Nevertheless, given the exponential growth of skin image datasets, it is anticipated that extensive datasets will become more prevalent. As a result, it is foreseeable that Transformers will gradually supplant conventional deep learning models in the field of skin cancer classification in the near future.


Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need.

Khan, S., Ali, H., & Shah, Z. (2023). Identifying the role of vision transformer for skin cancer—A scoping review. Frontiers in Artificial Intelligence, 6, 1202990.

Nie, Yali, Paolo Sommella, Marco Carratù, Mattias O’Nils, and Jan Lundgren. 2023. A Deep CNN Transformer Hybrid Model for Skin Lesion Classification of Dermoscopic Images Using Focal Loss.Diagnostics 13, no. 1: 72.