🎵 RateYourMusic Album Recommender System 🎵
Unstructured Data · Web Scraping · Text Analytics · Recommender Systems · NLP Similarity · Sentiment · Embeddings
Project Type: Content-Based Recommender (TF-IDF + Embeddings + Sentiment-Aware Ranking)
Tools: Python, Pandas, NLTK, Scikit-learn, VADER Sentiment, spaCy/Word2Vec Embeddings, Jupyter
Data: Scraped RateYourMusic album reviews + ratings
Outputs: Top-3 recommendations, Top-20 contenders, TF-IDF vs Embedding comparison
Project Overview
This project explores how unstructured text data, specifically user-written album reviews, can be transformed into structured, numerical representations and used to build a personalized recommendation system.
Using thousands of reviews scraped from RateYourMusic, I developed a pipeline that converts raw language into meaningful signals such as attributes, sentiment, and semantic similarity, ultimately enabling album recommendations based on a user’s desired vibe (e.g., romantic, moody, sad).
The core goal was not simply to recommend popular albums, but to recommend albums that match how a user wants to feel, using the language real listeners use to describe music.
Business Question
How can crowdsourced review text from RateYourMusic be used to recommend albums based on a listener’s desired attributes (e.g., genre + instrumentation + vibe), and how do recommendations differ when using TF-IDF versus embedding-based similarity?
Data & Problem Setup
Data Sources
I started by building a robust web scraper to collect over 8,000 album reviews from RateYourMusic.com, which is a large crowdsourced music database where fans share opinions and reviews across hundreds of genres.
I turned the review pages into a structured dataset containing:
-
Album name
-
Artist name
-
Review text
-
User rating (when available)
This mirrors real-world unstructured data work: the dataset had to be assembled from raw webpages before any modeling could begin.
Real World Challenge
Crowdsourced review text varies widely:
-
Inconsistent vocabulary
-
Subjective Tone
-
Different writing styles
-
Highly variable review lengths
This required transforming unstructured language into structured representations.


Data Characteristics & Modeling Implications
RateYourMusic reviews vary widely in both rating behavior and writing style. Ratings tend to cluster toward higher scores, meaning most albums are reviewed positively. Because of this, ratings alone are not enough to distinguish stylistic differences between albums — making the review text itself the most important signal for recommendations.
Review length also varies significantly. Some users write short reactions, while others provide long, detailed descriptions of sound, mood, and instrumentation. To make the data more consistent, review text was cleaned and aggregated at the album level before modeling. This reduces noise from individual writing styles and helps the system focus on shared descriptive language across multiple listeners.
Text Cleaning & Normalization (Unstructured -> Structured)
To make language comparable across thousands of reviews, text was normalized using:
-
lowercasing
-
punctuation / non-letter removal
-
contraction expansion (“don’t” → “do not”)
-
English stopwords + music-domain filler removal (album/track/release/etc.)
-
lemmatization (vocals→vocal, lyrics→lyric) to merge word variants
-
unigrams + bigrams to preserve meaningful multi-word concepts (e.g., heavy_metal, post_punk)
Without normalization, the same concept gets split into multiple tokens, weakening frequency analysis and similarity scoring.
Attribute Discovery
Facets / Lexicons (Genre, Instrument, Mood)
One challenge in music text analysis is that raw word frequency often surfaces generic tokens (“sound,” “album,” “feel”) rather than meaningful stylistic descriptors. To create interpretable recommendations, review text was transformed into structured attributes representing how listeners describe music.
Instead of selecting features purely from statistical frequency, tokens were grouped into higher-level conceptual facets:
-
Genres: rock, metal, punk, pop, ambient, electronic, …
-
Instruments: guitar, drums, bass, vocals, synth, …
-
Moods/Descriptors: aggressive, dark, melodic, gritty, raw, dreamy, …
This approach converts unstructured language into interpretable building blocks that align with how real listeners search for music.

Instrumentation and genre-related language appear most frequently in reviews, suggesting listeners often describe music through sound characteristics rather than technical production details. This validates using facet-based attribute queries, since common descriptive patterns emerge consistently across reviewers.
Lift / Co-Occurrence Check (Sanity Test)
Lift was used to verify that candidate attributes were co-mentioned more often than expected by chance, ensuring selected attributes naturally appear together in review language rather than being independently frequent.

Facet Co-Occurrence Validation
To ensure selected attributes reflected meaningful musical relationships rather than isolated keywords, I analyzed facet co-occurrence using lift-based normalization.
The heatmap shows how often attribute groups (genre, instrumentation, mood, etc.) appear together in review language. Strong co-occurrence patterns indicate that selected facets align with natural listening descriptors used by reviewers, validating the use of multi-attribute queries (e.g., genre + instrument + mood) in the recommendation system.
Recommendation Engine
A listener specifies 3 attributes (typically one from each facet: genre + instrument + mood). The system ranks albums by comparing the attribute query to each album’s review document.
Method 1: TF-IDF (Bag-of-Words) + Cosine Similarity
-
Convert each album document into a TF-IDF vector
-
Convert the user’s 3 attributes into a query vector
-
Rank albums by cosine similarity(query, album)
Why it works: TF-IDF emphasizes terms that are distinctive to an album relative to the full corpus.
Strength: Best for exact intent and precise vocabulary matching.
Limitation: Can miss semantic matches when reviewers use different wording.
Method 2: TF-IDF Similarity + Sentiment Evidence (Interpretability Layer)
Similarity alone doesn’t distinguish between positive and negative mentions of an attribute. To add context:
-
split album text into sentences
-
keep sentences that mention the chosen attributes
-
score those sentences using VADER sentiment
-
report positive vs negative attribute mentions and an evidence snippet
Why it matters: prevents “false matches” where the attribute appears in a negative context and makes recommendations explainable.
Method 3: Embeddings + Cosine Similarity (Semantic Retrieval)
To capture paraphrases and related concepts:
-
represent album documents using dense embedding vectors
-
represent the attribute query in the same embedding space
-
rank albums by cosine similarity in embedding space
Why embeddings help for reviews: they capture meaning beyond exact words:
-
“guitar riffs” ≈ “shredding”
-
“spacey” ≈ “atmospheric”
-
“dark” ≈ “moody”
This method is a more robust retrieval when vocabulary varies across reviewers, however, it can be less transparent than TF-IDF unless paired with evidence/snippets.
Results (Top 3 + Top 20 Recommendations)
For each method, outputs include:
-
Top 3 recommendations (final picks)
-
Top 20 contenders (ranked shortlist for transparency)
Each recommendation table includes:
-
similarity score (TF-IDF cosine or embedding cosine)
-
sentiment context (positive/negative attribute mentions)3
-
average rating + number of reviews (stability/context)
-
evidence snippet showing why the album matched
Why Top 20 matters: it makes the ranking interpretable and shows how close contenders compare to the final recommendations.

Model Comparison & Insights

To evaluate each approach, I queried all three recommendation systems using the same user request and compared the top three suggested albums. Each model produced noticeably different results, highlighting how retrieval strategy influences recommendations.
TF-IDF + Sentiment
The TF-IDF model relied heavily on keyword matching, which led to more genre-diverse recommendations. While technically aligned with the query vocabulary, some results (e.g., Ka and Wednesday) lacked conceptual alignment with the intended guitar-driven, atmospheric sound. This illustrates a common limitation of lexical retrieval: strong surface-level matches without deeper contextual understanding.
spaCy (Pre-trained Embeddings)
The embedding-based approach captured semantic relationships that keyword matching missed. Albums like Winged Victory and post-rock/ambient artists better reflected the intended “spacy” mood through broader conceptual similarity. Compared to TF-IDF, this model demonstrated improved understanding of vibe and stylistic meaning rather than strict word overlap.
Custom Word2Vec (Domain-Trained)
The domain-trained Word2Vec model produced the strongest results overall. By learning directly from music-review language, it generated higher-confidence similarity scores and more cohesive recommendations. Artists such as Thy Catafalque and A Winged Victory for the Sullen aligned closely with the query’s atmospheric and guitar-focused attributes, suggesting that domain-specific embeddings better capture nuanced musical descriptors.
Why This Matters
This project demonstrates a realistic NLP recommender workflow:
-
scrape and assemble unstructured text
-
normalize and engineer linguistic features
-
discover attributes directly from language
-
compare exact-match and semantic retrieval
-
add interpretability through sentiment and evidence
-
evaluate tradeoffs between methods
The same framework generalizes to recommendation systems for books, movies, restaurants, products, and other review-driven domains.
Tools & Skills Demonstrated
-
Web scraping and unstructured dataset creation
-
NLP preprocessing (stopwords, lemmatization, n-grams)
-
Attribute discovery and co-occurrence validation
-
TF-IDF and cosine similarity retrieval
-
Sentiment analysis for interpretability
-
Embedding-based semantic similarity
-
Transparent recommendation system design