🎵 RateYourMusic Album Recommender System 🎵

Unstructured Data · Web Scraping · Text Analytics · Recommender Systems · NLP Similarity · Sentiment · Embeddings

Project Type: Content-Based Recommender (TF-IDF + Embeddings + Sentiment-Aware Ranking)
Tools: Python, Pandas, NLTK, Scikit-learn, VADER Sentiment, spaCy/Word2Vec Embeddings, Jupyter
Data: Scraped RateYourMusic album reviews + ratings
Outputs: Top-3 recommendations, Top-20 contenders, TF-IDF vs Embedding comparison

Project Overview

This project explores how unstructured text data, specifically user-written album reviews, can be transformed into structured, numerical representations and used to build a personalized recommendation system.

Using thousands of reviews scraped from RateYourMusic, I developed a pipeline that converts raw language into meaningful signals such as attributes, sentiment, and semantic similarity, ultimately enabling album recommendations based on a user’s desired vibe (e.g., romantic, moody, sad).

The core goal was not simply to recommend popular albums, but to recommend albums that match how a user wants to feel, using the language real listeners use to describe music.

Business Question

How can crowdsourced review text from RateYourMusic be used to recommend albums based on a listener’s desired attributes (e.g., genre + instrumentation + vibe), and how do recommendations differ when using TF-IDF versus embedding-based similarity?

Data & Problem Setup

Data Sources

I started by building a robust web scraper to collect over 8,000 album reviews from RateYourMusic.com, which is a large crowdsourced music database where fans share opinions and reviews across hundreds of genres.

I turned the review pages into a structured dataset containing:

Album name
Artist name
Review text
User rating (when available)

This mirrors real-world unstructured data work: the dataset had to be assembled from raw webpages before any modeling could begin.

Real World Challenge

Crowdsourced review text varies widely:

Inconsistent vocabulary
Subjective Tone
Different writing styles
Highly variable review lengths

This required transforming unstructured language into structured representations.

Screenshot 2026-02-10 at 10.54.34 AM.png

Screenshot 2026-02-10 at 11.20.45 AM.png

Data Characteristics & Modeling Implications

RateYourMusic reviews vary widely in both rating behavior and writing style. Ratings tend to cluster toward higher scores, meaning most albums are reviewed positively. Because of this, ratings alone are not enough to distinguish stylistic differences between albums — making the review text itself the most important signal for recommendations.

Review length also varies significantly. Some users write short reactions, while others provide long, detailed descriptions of sound, mood, and instrumentation. To make the data more consistent, review text was cleaned and aggregated at the album level before modeling. This reduces noise from individual writing styles and helps the system focus on shared descriptive language across multiple listeners.

Text Cleaning & Normalization (Unstructured -> Structured)

To make language comparable across thousands of reviews, text was normalized using:

lowercasing
punctuation / non-letter removal
contraction expansion (“don’t” → “do not”)
English stopwords + music-domain filler removal (album/track/release/etc.)
lemmatization (vocals→vocal, lyrics→lyric) to merge word variants
unigrams + bigrams to preserve meaningful multi-word concepts (e.g., heavy_metal, post_punk)

Without normalization, the same concept gets split into multiple tokens, weakening frequency analysis and similarity scoring.

Attribute Discovery

Facets / Lexicons (Genre, Instrument, Mood)

One challenge in music text analysis is that raw word frequency often surfaces generic tokens (“sound,” “album,” “feel”) rather than meaningful stylistic descriptors. To create interpretable recommendations, review text was transformed into structured attributes representing how listeners describe music.

Instead of selecting features purely from statistical frequency, tokens were grouped into higher-level conceptual facets:

Genres: rock, metal, punk, pop, ambient, electronic, …
Instruments: guitar, drums, bass, vocals, synth, …
Moods/Descriptors: aggressive, dark, melodic, gritty, raw, dreamy, …

This approach converts unstructured language into interpretable building blocks that align with how real listeners search for music.

Screenshot 2026-02-10 at 11.25.08 AM.png

Instrumentation and genre-related language appear most frequently in reviews, suggesting listeners often describe music through sound characteristics rather than technical production details. This validates using facet-based attribute queries, since common descriptive patterns emerge consistently across reviewers.

Lift / Co-Occurrence Check (Sanity Test)

Lift was used to verify that candidate attributes were co-mentioned more often than expected by chance, ensuring selected attributes naturally appear together in review language rather than being independently frequent.

Screenshot 2026-02-10 at 11.25.40 AM.png

Facet Co-Occurrence Validation

To ensure selected attributes reflected meaningful musical relationships rather than isolated keywords, I analyzed facet co-occurrence using lift-based normalization.

The heatmap shows how often attribute groups (genre, instrumentation, mood, etc.) appear together in review language. Strong co-occurrence patterns indicate that selected facets align with natural listening descriptors used by reviewers, validating the use of multi-attribute queries (e.g., genre + instrument + mood) in the recommendation system.

Recommendation Engine

A listener specifies 3 attributes (typically one from each facet: genre + instrument + mood). The system ranks albums by comparing the attribute query to each album’s review document.

Method 1: TF-IDF (Bag-of-Words) + Cosine Similarity

Convert each album document into a TF-IDF vector
Convert the user’s 3 attributes into a query vector
Rank albums by cosine similarity(query, album)

Why it works: TF-IDF emphasizes terms that are distinctive to an album relative to the full corpus.
Strength: Best for exact intent and precise vocabulary matching.
Limitation: Can miss semantic matches when reviewers use different wording.

Method 2: TF-IDF Similarity + Sentiment Evidence (Interpretability Layer)

Similarity alone doesn’t distinguish between positive and negative mentions of an attribute. To add context:

split album text into sentences
keep sentences that mention the chosen attributes
score those sentences using VADER sentiment
report positive vs negative attribute mentions and an evidence snippet

Why it matters: prevents “false matches” where the attribute appears in a negative context and makes recommendations explainable.

Method 3: Embeddings + Cosine Similarity (Semantic Retrieval)

To capture paraphrases and related concepts:

represent album documents using dense embedding vectors
represent the attribute query in the same embedding space
rank albums by cosine similarity in embedding space

Why embeddings help for reviews: they capture meaning beyond exact words:

“guitar riffs” ≈ “shredding”
“spacey” ≈ “atmospheric”
“dark” ≈ “moody”

This method is a more robust retrieval when vocabulary varies across reviewers, however, it can be less transparent than TF-IDF unless paired with evidence/snippets.

Results (Top 3 + Top 20 Recommendations)

For each method, outputs include:

Top 3 recommendations (final picks)
Top 20 contenders (ranked shortlist for transparency)

Each recommendation table includes:

similarity score (TF-IDF cosine or embedding cosine)
sentiment context (positive/negative attribute mentions)3
average rating + number of reviews (stability/context)
evidence snippet showing why the album matched

Why Top 20 matters: it makes the ranking interpretable and shows how close contenders compare to the final recommendations.

Model Comparison & Insights

Screenshot 2026-02-10 at 12.09.31 PM.png

To evaluate each approach, I queried all three recommendation systems using the same user request and compared the top three suggested albums. Each model produced noticeably different results, highlighting how retrieval strategy influences recommendations.

TF-IDF + Sentiment

The TF-IDF model relied heavily on keyword matching, which led to more genre-diverse recommendations. While technically aligned with the query vocabulary, some results (e.g., Ka and Wednesday) lacked conceptual alignment with the intended guitar-driven, atmospheric sound. This illustrates a common limitation of lexical retrieval: strong surface-level matches without deeper contextual understanding.

spaCy (Pre-trained Embeddings)

The embedding-based approach captured semantic relationships that keyword matching missed. Albums like Winged Victory and post-rock/ambient artists better reflected the intended “spacy” mood through broader conceptual similarity. Compared to TF-IDF, this model demonstrated improved understanding of vibe and stylistic meaning rather than strict word overlap.

Custom Word2Vec (Domain-Trained)

The domain-trained Word2Vec model produced the strongest results overall. By learning directly from music-review language, it generated higher-confidence similarity scores and more cohesive recommendations. Artists such as Thy Catafalque and A Winged Victory for the Sullen aligned closely with the query’s atmospheric and guitar-focused attributes, suggesting that domain-specific embeddings better capture nuanced musical descriptors.

Why This Matters

This project demonstrates a realistic NLP recommender workflow:

scrape and assemble unstructured text
normalize and engineer linguistic features
discover attributes directly from language
compare exact-match and semantic retrieval
add interpretability through sentiment and evidence
evaluate tradeoffs between methods

The same framework generalizes to recommendation systems for books, movies, restaurants, products, and other review-driven domains.

Tools & Skills Demonstrated

Web scraping and unstructured dataset creation
NLP preprocessing (stopwords, lemmatization, n-grams)
Attribute discovery and co-occurrence validation
TF-IDF and cosine similarity retrieval
Sentiment analysis for interpretability
Embedding-based semantic similarity
Transparent recommendation system design