ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Recommender Systems: A Data-Driven Framework for Personalized Decision Intelligence

[version 1; peer review: awaiting peer review]
PUBLISHED 18 Dec 2025
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Software and Hardware Engineering gateway.

Abstract

Background

Recommender systems have become inherent in personalizing experiences, especially digital experiences, across domains such as e-commerce, media, and entertainment. These systems use the user to item interactions data (how an user reacts to an item) to identify patterns that predict preference and rank content. Collaborative filtering is one of the most widely used approaches, relying on similarity between users or items to generate recommendations.

Methods

This study examines collaborative filtering using similarity metrics applied to a curated IMDB movie dataset. Data was preprocessed using merging metadata and ratings, encoding categorical fields, and constructing feature vectors for each movie. The primary metric to compute pairwise distances between items was Cosine similarity. An item-item recommendation engine was then created and implemented, and the output was evaluated using a movie example (the Saw 2004).

Results

The system produced coherent recommendations aligned with the genre and thematic characteristics of the input movie used, Saw (2004). The top-ranked films exhibited high cosine similarity scores, indicating strong vector space proximity and consistent user engagement patterns. Visual exploration of the data confirmed that the similarity-based approach captured meaningful behavioral relationships.

Conclusions

The findings show that a simple similarity-based collaborative filtering model can effectively identify related movies without complex model architectures. Even with lightweight feature engineering, the system generated relevant recommendations that mirror typical user preferences. This demonstrates the practicality of similarity-based methods for scalable and interpretable recommendation tasks, and highlights opportunities for future extensions using hybrid or embedding based models.

Keywords

Recommender Systems; Collaborative Filtering; Similarity Metrics; Predictive Modeling; Behavioral Analytics; Personalization; Decision Intelligence

Introduction

Recommender systems have evolved into fundamental components of modern decision making, driving personalization and consumer engagement across multiple industries. From Amazon to Netflix, leading global enterprises rely on statistically driven recommendation algorithms to interpret customer behavior and tailor digital experiences accordingly.1 By leveraging user-item interaction data, these systems employ advanced techniques to predict future user preferences and optimize product exposure.

Empirical evidence underscores their economic significance. A McKinsey study estimates that nearly 35% of Amazon’s total sales are directly attributable to its recommendation engine.2 Amazon’s multilayered deployment of recommender algorithms integrated across browsing, search, and checkout environments has redefined the e-commerce experience and raised competitive barriers for market entrants.3 Firms that systematically apply data driven recommender methodologies exhibit higher conversion efficiency, greater customer lifetime value, and stronger market differentiation.

Methods

Applications of recommender systems

Recommender systems are embedded across diverse digital ecosystems and serve as core engines for personalization and decision intelligence. They influence consumption patterns by filtering information and tailoring content according to user preferences. Key applications include:

  • E-commerce: Personalized product ranking, bundling, and cross-selling on retail platforms such as Amazon.3

  • News and Publishing: Dynamic content curation based on reading frequency, dwell time, and topical affinity.

  • Music and Podcasts: Platforms such as Spotify employ collaborative filtering and similarity models to recommend audio tracks and playlists aligned with listener preferences.

  • Video Streaming: Netflix and YouTube apply collaborative filtering to predict viewing patterns, optimize watch next queues, and enhance user retention.4

  • Social Media: Systems on platforms like Instagram and Facebook infer interest clusters, enabling targeted recommendations and advertising.

  • Travel and Hospitality: TripAdvisor and related services recommend destinations and accommodations based on spatial, behavioral, and preference proximity.

The economic impact of recommender engines is significant. For instance, Netflix offered a $1 million prize for a model achieving a 10% reduction in mean-squared error (MSE) relative to its production algorithm.5 Such incentives underscore the analytical rigor and commercial value associated with advancing recommender methodologies.

Conceptual framework

Recommender systems are analytical engines that identify and suggest products or services aligned with user preferences. By analyzing interaction patterns and behavioral signals, these systems infer latent interests and generate personalized recommendations tailored to individual consumption profiles. For example, a user who consistently engages with the horror genre in film platforms may receive suggestions for additional horror titles, thereby enhancing engagement and increasing platform retention.

Operational basis

Recommender systems learn from observed interactions by modeling the relationships between users and items. These relational structures form the foundation of preference inference and prediction. Three primary relationship types drive these systems:

User-Item Relationship User-item preference data forms the core of recommendation models. For example, a user who frequently purchases books on Amazon will receive suggestions for similar or complementary books. Likewise, a user repeatedly purchasing beauty products will be recommended related cosmetic items according to their purchase profile.

Item-Item Relationship Item similarity is derived from co-engagement patterns. Consider a viewer who watches Superman. The system may recommend Aquaman due to shared characteristics within the DC Comics universe. This mechanism is particularly effective in cold-start situations for new items, where similarity to known items accelerates exposure.

User-User Relationship Users with similar historical patterns can guide recommendations for one another. For instance, if two readers have both engaged deeply with the Harry Potter series, and one has also read The Lord of the Rings, the system can recommend the latter title to the other user. This process is valuable in early-stage engagement when a user is exploring a new category. The conceptual structure of similarity signals in collaborative filtering is illustrated in Figure 1.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure1.gif

Figure 1. Conceptual relationship structure of a recommender system.

This diagram illustrates the foundational similarity relationships used in collaborative filtering, including user–user similarity, item–item similarity, and user–item interactions.

Collaborative filtering

Collaborative filtering (CF) operates on the principle that users who have exhibited similar preferences in the past are likely to share comparable interests in the future.6 CF models leverage observed interactions-such as ratings, clicks, or purchase histories-to infer latent preference structures without requiring explicit content features.

Two primary CF paradigms exist: user-user filtering and item-item filtering. In user-user CF, recommendations are derived by identifying users with similar historical engagement patterns and estimating the target user’s interest based on their neighbors’ preferences. Conversely, item-item CF examines correlations among items; a user who interacts with an item is recommended other items that exhibit high similarity to it.7

To generate meaningful recommendations, collaborative filtering relies on the computation of pairwise similarity scores. These metrics quantify how strongly two users or two items align based on observed data structures. Common similarity measures include:

  • Jaccard Similarity: Measures the ratio of shared items or interactions to the union of items across users, capturing the degree of overlap.

  • Euclidean Distance: Computes geometric distance between rating vectors, reflecting dissimilarity based on absolute deviations.

  • Cosine Similarity: Evaluates the cosine of the angle between two high-dimensional vectors, emphasizing directional alignment rather than magnitude differences. This metric is particularly effective in sparse rating matrices.

By applying these metrics, CF systems estimate preference scores and rank items to deliver personalized recommendations. The interaction between users and items within a collaborative filtering model is illustrated in Figure 2.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure2.gif

Figure 2. Conceptual illustration of user–user and item–item relationships in a collaborative filtering framework.

This diagram depicts how similarity is computed from user–user interactions and item–item rating patterns, forming the basis of collaborative filtering prediction.

The user-based collaborative filtering prediction is formally defined in Equation 1 which models the expected rating as a similarity-weighted adjustment of a user’s baseline preference.

(1)
r̂u,i=r-u+vN(u)sim(u,v)(rv,ir-v)vN(u)|sim(u,v)|

Notation summary

  • r̂u,i - Predicted rating given by user u to item i

  • r-u - Average rating of user u

  • N(u) - Set of neighboring users similar to u

  • sim(u,v) - Similarity between users u and v

  • rv,i - Rating given by user v to item i

  • r-v - Average rating of user v

Interpretation

This formulation estimates the unknown rating r̂u,i as the user’s baseline preference ( r-u ) plus a weighted average of rating deviations from similar users, where the weights correspond to pairwise similarity scores. In other words, users with greater similarity to u exert stronger influence on the prediction. This represents the foundational model for user-based collaborative filtering. The same logic extends to item-based collaborative filtering by interchanging the user and item indices, allowing the system to infer preferences by examining relationships among items rather than users.

Item-Based Collaborative Filtering Rating Prediction The corresponding item-based prediction model is expressed in Equation 2, where similarity among items guides the recommendation process.

(2)
r̂u,i=jN(i)sim(i,j)·ru,jjN(i)|sim(i,j)|

Notation summary

  • r̂u,i - Predicted rating/user preference score for user u on item i

  • N(i) - Set of items most similar to item i

  • sim(i,j) - Similarity between item i and item j

  • ru,j - Rating/interaction score of user u for item j

Interpretation

In this formulation, recommendations are generated based on the similarity among items that the user has already interacted with. Unlike the user-based method, which compares users to one another, item-based collaborative filtering compares items using shared patterns of user engagement. This approach is computationally efficient and widely adopted in large-scale systems, such as Amazon’s item-to-item recommendation engine.

Similarity metrics

Jaccard similarity

Jaccard Similarity measures the degree of overlap between two users or two items based on shared interactions. It is defined as the ratio of the intersection of item sets to their union, producing values between 0 and 1. A higher score indicates greater similarity. The relationship between intersection and union in Jaccard similarity is illustrated in Figure 3.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure3.gif

Figure 3. Illustration of Jaccard similarity showing intersection versus union of item sets.

This diagram visualizes how Jaccard similarity is computed by comparing the overlap between two sets (intersection) relative to their combined unique elements (union). It demonstrates how shared items between Movie A and Movie B contribute to their similarity score.

The Jaccard similarity formulation is shown in Equation 3.

(3)
Jaccard(A,B)=|AB||AB|

where:

  • A - Set of items associated with user or item A

  • B - Set of items associated with user or item B

  • |AB| - Number of items common to both sets

  • |AB| - Total number of unique items across both sets

Euclidean distance

Euclidean distance measures the geometric distance between two users or items in a multidimensional rating space. Unlike correlation-based similarity measures, Euclidean distance represents dissimilarity, where a smaller value indicates stronger similarity between two profiles. As shown in Figure 4, Euclidean distance captures the dissimilarity between items based on squared rating differences.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure4.gif

Figure 4. Euclidean distance representation between two items based on user interaction patterns.

This diagram illustrates how Euclidean distance quantifies dissimilarity between items by measuring squared rating or interaction differences across shared users. A larger distance indicates more divergent user engagement patterns between Movie A and Movie B.

For two items A and B , the distance is computed using Equation 4.

(4)
dist(A,B)=u=1n(ruAruB)2
where:
  • ruA - Rating or interaction value of user u for item A

  • ruB - Rating or interaction value of user u for item B

  • n - Number of users who interacted with either item

For interpretability, practitioners sometimes use the squared distance form ( Equation 5):

(5)
dist2(A,B)=u=1n(ruAruB)2

Euclidean distance performs effectively in low-dimensional or moderately sized datasets, particularly when only a limited number of overlapping users or items exist.

Cosine similarity

Cosine similarity measures the cosine of the angle between two vectors representing user or item profiles in a multidimensional space. It captures how closely aligned two entities are in direction, irrespective of magnitude, making it highly suitable for sparse, high-dimensional datasets such as movie ratings or user-item interaction logs. A vector-space interpretation of cosine similarity is illustrated in Figure 5, showing how the angle between two item vectors determines their similarity.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure5.gif

Figure 5. Cosine similarity representation showing the angle θ between two movie vectors.

This diagram visualizes cosine similarity in a vector space, illustrating how the angle between item vectors (Movie A and Movie B) determines similarity. A smaller angle indicates stronger alignment between rating patterns, while a larger angle indicates divergence.

For two items A and B , cosine similarity is defined in Equation 6.

(6)
sim(A,B)=cos(θ)=A·BAB
where A and B represent item (or user) vectors, A·B denotes their dot product, and A and B denote their vector magnitudes.

The cosine similarity score ranges between -1 and 1:

  • θ=0 (same direction) ⇒ similarity =1

  • θ=90 (orthogonal) ⇒ similarity =0

  • θ=180 (opposite direction) ⇒ similarity =1

The directional interpretation of cosine angles is summarized in Table 1.

Table 1. Cosine similarity interpretation.

θ Direction cos(θ)
0 Same1
90 Orthogonal0
180 Opposite-1

Building a recommender system (IMDB dataset)

Data source

The recommender framework was developed using the IMDB Extensive Dataset available on Kaggle.8 This dataset provides comprehensive metadata such as movie titles, genres, release information, production studios, and user generated ratings, making it suitable for collaborative filtering research.

The complete implementation, including preprocessing scripts and model code, is publicly available in the author’s Zenodo repository.13

Data preparation

Two primary data files-one containing movie attributes and another containing user ratings-were merged using the movie identifier as the key field. Missing or inconsistent observations were removed to reduce noise and minimize sparsity in the user-item rating matrix.

Categorical attributes (e.g., language, genre) were encoded using binary indicator variables. For instance, English language films were encoded as 1, and non-English films as 0; similarly, each genre category was assigned an individual binary flag. After data cleaning and dimensionality reduction, the final working dataset consisted of approximately 65,000 observations and 80 predictor variables. Trends in movie ratings and review volume over time are summarized in Figure 6, which illustrates how viewer engagement has evolved across decades. Demographic differences in genre preferences are summarized in Figure 7, which compares average movie ratings across four major age groups.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure6.gif

Figure 6. Average rating and number of reviews per year in the IMDB dataset.

This scatter plot shows how movie popularity (measured by number of reviews) and average ratings evolve over time. Darker points represent higher average ratings, highlighting trends in viewer engagement across different decades.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure7.gif

Figure 7. Average movie rating per genre across age demographics.

This figure compares how different age groups (0–18, 18–30, 30–45, and 45+) rate movies across various genres. Each panel represents one demographic segment, showing variations in genre preferences and average rating patterns across age groups.

Model framework

The recommender model is built on collaborative filtering, using cosine similarity as the primary distance metric. Each movie is represented as a vector in a multidimensional feature space derived from metadata and user-rating attributes.

Model workflow

  • 1. Data Integration: Merge movie metadata and user-rating tables.

  • 2. Pre-processing: Remove missing values, encode categorical variables, normalize numeric features.

  • 3. Feature Engineering: Retain key predictors such as release year, user vote counts, genre indicators, and language attributes.

  • 4. Similarity Computation: Compute cosine similarity across the item-item matrix.

Recommendation Generation: Rank all movies by similarity and return Top- N recommendations.

System architecture

The end-to-end workflow of the recommender system is illustrated in Figure 8, showing the stages of data collection, preprocessing, model computation, and recommendation generation.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure8.gif

Figure 8. Recommender system workflow from data ingestion to recommendation generation.

This diagram outlines the end-to-end pipeline used in the recommender system implementation, including data collection and merging, preprocessing, construction of item and user feature matrices, cosine similarity computation, and generation of Top-N recommended items.

Implementation summary

The Python 3.10 implementation uses pandas, numpy, and scikit-learn. Full code is available in the GitHub repository.9

Execution steps

  • 1. Construct the movie-feature matrix.

  • 2. Compute the cosine similarity matrix.

  • 3. Develop a scoring function to retrieve Top- N matches.

  • 4. Validate recommendations using IMDB metadata.

Illustrative output

To demonstrate system behavior, Saw (2004) was selected as the seed movie. The model returned strongly aligned horror titles such as The Silent Scream and Catacombs, indicating good thematic coherence.

Results

The recommender system was evaluated on a curated IMDB movie dataset using cosine-similarity-based collaborative filtering. The evaluation focuses on whether suggested movies align with behavioral and thematic tendencies observed in the user’s previously rated or viewed content.

Recommendation output

Based on the input movie Saw (2004), the system retrieved the top five movies exhibiting the highest cosine-similarity values. The recommendation outcomes are shown in Table 2.

Table 2. Recommendation output for seed movie “Saw (2004)”.

Seed movieRecommended title Cosine similarity
Saw (2004)The Silent Scream (2005)0.987
Catacombs (2007)0.984
House of 9 (2005)0.973
The Human Centipede (2009)0.968
Hostel (2005)0.962

Interpretation of findings

The recommender system successfully proposed closely related horror and psychological-thriller films, such as The Silent Scream and Catacombs, for a viewer who watched Saw. This demonstrates strong genre coherence and relevance in the generated suggestions.

The recommendations exhibit high internal consistency, indicating that cosine similarity effectively captures vector-space proximity between movies with comparable thematic and stylistic characteristics. The similarity scores, each approaching 1.0, reflect minimal angular separation between vectors, implying substantial shared audience engagement patterns. This alignment between the watched movie and the recommended titles is illustrated in Figure 9.

81544f7b-1b83-4a8f-b69f-31e64df6da49_figure9.gif

Figure 9. Watched versus recommended movies generated by the recommender system for a sample user.

This diagram contrasts a user’s previously watched movie (“Saw”, 2004) with the system-generated recommendations. The suggested movies share strong thematic and stylistic similarities with the watched title, illustrating how cosine similarity captures genre alignment and audience-engagement patterns in the model’s output.

Practical implications

The results indicate that a collaborative-filtering approach, supported by structured metadata and a lightweight similarity metric, can approximate human perception of content relatedness.

The framework is scalable and applicable to multiple domains beyond movies, including music, e-commerce, streaming services, and digital platforms. Item metadata-such as brand, category, or stylistic attributes-can seamlessly replace movie features in domain-specific deployments.

These findings reinforce the viability of similarity-based collaborative filtering as a practical and high-interpretability recommendation strategy for industrial systems.

Discussion

The evaluation results demonstrate that cosine-similarity-based collaborative filtering can reproduce genre-consistent and thematically aligned recommendations using a relatively simple feature representation. The close alignment between the recommended titles and the seed movie Saw (2004) indicates that vector-space similarity captures meaningful behavioral patterns that extend beyond explicit metadata. This suggests that item-item proximity in rating space can implicitly encode narrative style, pacing characteristics, and audience affinity, even when these attributes are not explicitly modeled.

These findings are consistent with prior work that has shown the effectiveness of item-item collaborative filtering in sparse environments.3 Similar to observations by Bobadilla et al.,6 the model benefits from the fact that cosine similarity emphasizes directional alignment rather than absolute magnitude, making it well-suited for datasets such as IMDB where users interact with only a small fraction of available items. The high internal consistency of similarity scores supports prior evidence that neighborhood-based methods can be competitive benchmarks against more complex latent factor models when interpretability and computational efficiency are required.

The results also highlight the practical utility of similarity-based recommenders as scalable and domain-agnostic tools. Because the model relies on structural patterns in user-item data, it can be deployed in applications such as e-commerce, music streaming, news personalization, or digital media platforms with minimal architectural modifications. The workflow demonstrated here serves as a transparent baseline system that can be implemented rapidly while still offering actionable personalization insights.

Nevertheless, the study also exposes constraints associated with neighborhood-based collaborative filtering. The observed performance is influenced by sparsity in user rating behavior, and the model does not explicitly correct for individual rating bias, which can skew similarity computations. Additionally, similarity-based recommenders inherently struggle with the cold-start problem for new items or users lacking historical data. These limitations motivate further research into hybrid systems that integrate metadata-driven signals or latent embedding methods to enhance robustness.

Overall, the results reaffirm that lightweight similarity-based approaches remain powerful tools for recommendation tasks, especially when transparency and operational simplicity are prioritized. The system presented here provides a strong foundation upon which more advanced or domain-specific enhancements can be developed.

Conclusions

This study presented a comprehensive overview of recommender systems, their foundational principles, and their role in modern digital ecosystems. We outlined the conceptual framework of user-item, item-item, and user-user relationships that underpin recommendation algorithms, followed by collaborative filtering fundamentals and similarity measures such as Jaccard, Euclidean, and Cosine metrics.

Using publicly available IMDB data, a case study demonstrated the practical implementation of these concepts. The resulting system successfully identified and ranked movies similar to a user-provided title, confirming the ability of a similarity-based model to replicate genre associations through vector-space analysis. The findings reinforce that even a simple similarity-driven framework can effectively model user preferences and generate contextually relevant recommendations.

Future extensions could incorporate hybrid architectures that leverage both collaborative and content-based signals, as well as temporal and contextual features to improve personalization accuracy and robustness.

Limitations and future work

While the proposed framework effectively demonstrates similarity-based collaborative filtering, several methodological limitations exist. First, the use of linear similarity assumptions may oversimplify complex nonlinear preference patterns observed in real-world behavior.10 Second, reliance on pairwise similarity metrics restricts the model’s ability to learn latent representations of users and items.11 Third, the absence of normalization for individual bias and variance may introduce skewness in affinity scoring, particularly for users with extreme rating tendencies or highly popular items. Additionally, the framework does not address the “cold start” problem associated with new users or items that lack historical interaction data.12 Future research could address these limitations by employing matrix factorization or neural embedding techniques,10 probabilistic models to capture uncertainty and behavior variability,12 and hybrid recommenders that fuse collaborative filtering with contextual and content-aware learning to improve scalability and generalizability.

Software availability

Source code available from: https://github.com/vinoalles/Recommender_System

Archived source code available from: https://doi.org/10.5281/zenodo.17822412

License: MIT License (OSI-approved)

Ethics and consent

No human subjects, private data, or biological specimens were involved.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 18 Dec 2025
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gunasekaran V and Elango I. Recommender Systems: A Data-Driven Framework for Personalized Decision Intelligence [version 1; peer review: awaiting peer review]. F1000Research 2025, 14:1409 (https://doi.org/10.12688/f1000research.174439.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 18 Dec 2025
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.