Applied Sciences (Switzerland), cilt.15, sa.12, 2025 (SCI-Expanded)
This study presents a novel approach to the increasingly important task of distinguishing AI-generated images from authentic photographs. The detection of such synthetic content is critical for combating deepfake misinformation and ensuring the authenticity of digital media in journalism, forensics, and online platforms. A custom-designed Vision Transformer (ViT) model, termed Patch-Based Vision Transformer for Identifying Synthetic Media (PV-ISM), is introduced. Its performance is benchmarked against innovative transfer learning methods using 60,000 authentic images from the CIFAKE dataset, which is derived from CIFAR-10, along with a corresponding collection of images generated using Stable Diffusion 1.4. PV-ISM incorporates patch extraction, positional encoding, and multiple transformer blocks with attention mechanisms to identify subtle artifacts in synthetic images. Following extensive hyperparameter tuning, an accuracy of 96.60% was achieved, surpassing the performance of ResNet50 transfer learning approaches (93.32%) and other comparable methods reported in the literature. The experimental results demonstrate the model’s balanced classification capabilities, exhibiting excellent recall and precision throughout both image categories. The patch-based architecture of Vision Transformers, combined with appropriate data augmentation techniques, proves particularly effective for synthetic image detection while requiring less training time than traditional transfer learning approaches.