A Multimodal Pipeline for Personalized Recommendations from Greek YouTube Content: Integrating Automatic Speech Recognition and Aspect-Based Sentiment Analysis

Submitted by admin on Wed, 2026-05-27 13:29

Title	A Multimodal Pipeline for Personalized Recommendations from Greek YouTube Content: Integrating Automatic Speech Recognition and Aspect-Based Sentiment Analysis
Publication Type	Conference Paper
Year of Publication	2026
Authors	Singh M, Roumeliotis KI, Vassilakis C, Margaris D, Mpardis G, Spiliotopoulos D
Conference Name	Proceedings of the 2026 International Conference on Information, Intelligence, Systems and Applications
Publisher	IEEE Xplore
Keywords	Aspect-Based Sentiment Analysis (ABSA), Automatic Speech Recognition (ASR), Modern Greek NLP, Multimedia Data Mining, Personalized Recommender Systems, User Preference Modeling
Abstract	Audio/video social media and video sharing sites like YouTube allow users to create their own community where they post audio content. This type of content includes product reviews which consumers use to assist others in making buying decisions, and this content can be processed using NLP and automated Aspect-Based Sentiment Analysis (ABSA) to produce input for recommender systems. However, developing these systems for low resource languages (such as Modern Greek) lags behind significantly. To address this gap, in this paper we introduce a complete ABSA framework, which transforms Greek multimedia reviews into structured and accurate data, to be used for personalized product recommendation. The proposed framework uses a multistage pipeline to address the complexities of spoken text, such as the highly inflected morphology of the Greek language and slang utilized in reviewing. The first stage of the pipeline transforms raw audio data into a form suitable for processing, while the second stage applies linguistic processing. Two alternatives were evaluated for this stage, where the first utilizes a native Greek Sentiment Lexicon and the second firstly applies Greek-to-English automated translation, followed by English Language Sentiment Analysis. The efficacy of the proposed framework has been assessed through an experiment in which each of the pipeline outcomes is compared against independent human assessments. The assessment shows that the translation-based approach achieves high accuracy and alignment with expert human raters. The proposed work also takes into account the effect of “noise” owing to Automatic Speech Recognition (ASR) errors and studies the propagation of these errors across 12 distinct product attributes. The proposed pipeline thus constitutes an effective solution for exploiting audiovisual content in recommender systems to generate personalized recommendations.