import pandas as pd
import numpy as np
import os
import re
import json
import random
import math
We will evaluate our models based on R-Precision, Normalized Discoutned Cumulative Gain (NDCG), and Recommended Songs Clicks. In order to clearly define our metrics, we use $G$ to denote the ordered ground truth list of songs that the user would like to listen to, and we use R to denote the ordered recommendations produced by our model. We use $\mid \cdot \mid$ to indicate the length of a list, and we use $R_i$ to refer to the i-th song in our recommendation. Furthermore, we say a song in our recommentation is relavent if it also exists in the ground truth list. We then define $r_i = 1$ if $R_i$ is relavent and $r_i = 0$ if otherwise.
R-Precision measures the overlap between the ground truth set and our recommendation. Its value is simply the number of relavent songs in our model’s first $\mid G \mid$ recommendations divided by the length of the ground truth set.
Normalized Discoutned Cumulative Gain (NDCG) further measures the quality of order in our recommendation. It gives more credit when a relavent song is placed higher in our recommendation. DCG is a score on our recommendation, and IDCG is the ideal DCG value is all of our top $\mid G \mid$ recommended songs are relavent. By dividing the two, NDCG gives us a normalized score.
Recommended Songs Clicks is a special metric targeted for Spotify. Spotify has a feature that generates ten songs in a round. The Recommended Songs Clicks is the minimal number of refreshes required to get the first relavent song.
When there are more songs in R than in G, we only consider the first $\mid G \mid$ songs in R. If none of the recommended songs is relavent, the value of the Recommended Songs Clicks would be $ \frac{|R|}{10}$, which is one more than the maximal number of rounds possible.
def R_precision(rec, Y):
count = 0
for song in Y:
if song in rec[:len(Y)]:
count += 1
return count/len(Y)
def NDCG(rec, Y):
IDCG = 0
for i in range(0,len(Y)):
if i == 0: IDCG += 1
else: IDCG += 1/math.log((i+2),2)
DCG = 0
for i in range(0,len(rec)):
if i == 0 and rec[i] in Y: DCG += 1
elif i > 0 and rec[i] in Y: DCG += 1/math.log((i+2),2)
return DCG/IDCG
def clicks(rec, Y):
found_at = -1
find = 0
while found_at == -1 and find < len(Y):
if rec[find] in Y: found_at = find
else: find += 1
if found_at == -1:
return len(Y)//10
else:
return found_at//10
def TEST_ALL(recs, Ys):
R_precision_scores = []
NDCG_scores = []
clicks_scores = []
for i in range(len(Ys)):
rec = recs[i]
Y = Ys[i]
R_precision_scores.append(R_precision(rec,Y))
NDCG_scores.append(NDCG(rec,Y))
clicks_scores.append(clicks(rec,Y))
return R_precision_scores,NDCG_scores, clicks_scoresdef test_recs(fn):
with open(fn) as json_file:
rec = json.load(json_file)
with open('validation/val_Y.json') as json_file:
val_Y = json.load(json_file)
empty = []
for i in range(len(rec)):
if len(rec[i])==0: empty.append(i)
for i in reversed(sorted(empty)):
del rec[i]
del val_Y[i]
R_precision_score, NDCG_score, clicks_score = TEST_ALL(rec,val_Y)
score1 = np.mean(R_precision_score)
score2 = np.mean(NDCG_score)
score3 = np.mean(clicks_score)
print(f'R_precision: {score1}')
print(f'NDCG: {score2}')
print(f'#clicks: {score3}')
return score1, score2, score3
def test_scores(fn):
with open(fn) as json_file:
scores = json.load(json_file)
with open('validation/val_Y.json') as json_file:
val_Y = json.load(json_file)
rec = [list(single_score.keys()) for single_score in scores]
empty = []
for i in range(len(rec)):
if len(rec[i])==0: empty.append(i)
for i in reversed(sorted(empty)):
del rec[i]
del val_Y[i]
R_precision_score, NDCG_score, clicks_score = TEST_ALL(rec,val_Y)
score1 = np.mean(R_precision_score)
score2 = np.mean(NDCG_score)
score3 = np.mean(clicks_score)
print(f'R_precision: {score1}')
print(f'NDCG: {score2}')
print(f'#clicks: {score3}')
return score1, score2, score3
During the hybridization process, we have chosen 5,000 playlists and divided them into input and output parts. The inputs have lengths from 0 to 150, distributed roughly evenly, and the outputs all have lengths of 100. We feed our models with the validation input, and each model produces 500 ordered song recommendations. We then calculate the three metrics for each of the models.
Because we generated our validation set from MPD, and MPD does not provide information on a user’s preference among the songs within a single playlist, we make the assumption that the position of a song indicates the user’s preference. That is to say, we consider that users prefer songs that are placed in the front of the playlists, and we calculate NDCG based on this assumption.
Top 500 Popular Songs
R_bl, N_bl, C_bl = test_recs('validation/val_Y_top500.json')
R_precision: 0.035814
NDCG: 0.08360293878734794
#clicks: 5.0066
| R-Precision | NDCG | Recommended Songs Clicks | |
|---|---|---|---|
| Baseline - Top 500 | 0.035814 | 0.083603 | 5.0066 |
R_cf_list,N_cf_list,C_cf_list = [],[],[]
Baseline (50000 Playlists)
R, N, C = test_scores('validation/score_baseline_CF.json')
R_cf_list.append(R)
N_cf_list.append(N)
C_cf_list.append(C)
R_precision: 0.022812
NDCG: 0.06110045476489474
#clicks: 5.5702
Meta-Playlist
R, N, C = test_scores('validation/score_metaplaylist.json')
R_cf_list.append(R)
N_cf_list.append(N)
C_cf_list.append(C)
R_precision: 0.017968
NDCG: 0.053416332195018505
#clicks: 5.2572
Advanced (Filtered Songs and Playlists)
R, N, C = test_scores('validation/score_advanced_CF.json')
R_cf_list.append(R)
N_cf_list.append(N)
C_cf_list.append(C)
R_precision: 0.020694000000000004
NDCG: 0.06297452664793372
#clicks: 6.12
| R-Precision | NDCG | Recommended Songs Clicks | |
|---|---|---|---|
| Baseline CF | 0.022812 | 0.061100 | 5.5702 |
| Meta-Playlist CF | 0.017968 | 0.053416 | 5.2572 |
| Advanced CF | 0.020694 | 0.062975 | 6.1200 |
R_cb_list,N_cb_list,C_cb_list = [],[],[]
Clustering - Emotion
R, N, C = test_scores('validation/val_Y_lyric_score_c.json')
R_cb_list.append(R)
N_cb_list.append(N)
C_cb_list.append(C)
R_precision: 0.0006313815378203792
NDCG: 0.0020591344073271735
#clicks: 9.656386747239008
Clustering - Genre
R, N, C = test_scores('validation/val_Y_genre_score_c.json')
R_cb_list.append(R)
N_cb_list.append(N)
C_cb_list.append(C)
R_precision: 7.052186177715091e-05
NDCG: 0.00025659978471954096
#clicks: 9.962522667741286
Clustering - Audio Feature
R, N, C = test_scores('validation/val_Y_audio_score_c.json')
R_cb_list.append(R)
N_cb_list.append(N)
C_cb_list.append(C)
R_precision: 0.0004755188394116462
NDCG: 0.0013314987133503497
#clicks: 9.748942172073344
No Clustering - Emotion
R, N, C = test_scores('validation/val_Y_lyric_score_a.json')
R_cb_list.append(R)
N_cb_list.append(N)
C_cb_list.append(C)
R_precision: 0.0008689310272973536
NDCG: 0.002719010744180856
#clicks: 9.550531360700147
No Clustering - Genre
R, N, C = test_scores('validation/val_Y_genre_score_a.json')
R_cb_list.append(R)
N_cb_list.append(N)
C_cb_list.append(C)
R_precision: 0.00010276042716099134
NDCG: 0.0003291407777316353
#clicks: 9.948418295385855
No Clustering - Audio Feature
R, N, C = test_scores('validation/val_Y_audio_score_a.json')
R_cb_list.append(R)
N_cb_list.append(N)
C_cb_list.append(C)
R_precision: 0.0003788031432601249
NDCG: 0.001110149776504973
#clicks: 9.795688091879912
| R-Precision | NDCG | Recommended Songs Clicks | |
|---|---|---|---|
| Clustering - Emotion | 0.000631 | 0.002059 | 9.656387 |
| Clustering - Genre | 0.000071 | 0.000257 | 9.962523 |
| Clustering - Audio Feature | 0.000476 | 0.001331 | 9.748942 |
| No Clustering - Emotion | 0.000869 | 0.002719 | 9.550531 |
| No Clustering - Genre | 0.000103 | 0.000329 | 9.948418 |
| No Clustering - Audio Feature | 0.000379 | 0.001110 | 9.795688 |
From the results above, we find that collaborative filtering models perform better than the content-based models. Therefore, in the hybridization process, we will focus on combining collaborative filtering models with other models. Based on the performance of the various content-based models, we will use the emotion model without clustering and the audio feature model with clustering. Lastly, because the training dataset for baseline content baseline model may have a few overlaps with the validation set, we exclude it from our final hybrid model.
Stacking with Logistic Regression CV
R_stack_list,N_stack_list,C_stack_list = [],[],[]
val_Y_files = ['validation/hybridize_BL2CF2CB.json',
'validation/hybridize_2CF2CB.json',
'validation/hybridize_2CFs.json',
'validation/hybridize_BLmeta.json']
for file in val_Y_files:
R, N, C = test_recs(file)
print()
R_stack_list.append(R)
N_stack_list.append(N)
C_stack_list.append(C)
R_precision: 0.035660000000000004
NDCG: 0.08328080372461233
#clicks: 5.0124
R_precision: 0.020246
NDCG: 0.06229288799130191
#clicks: 6.1938
R_precision: 0.018852
NDCG: 0.061758758863905396
#clicks: 6.3528
R_precision: 0.035814
NDCG: 0.08360293878734794
#clicks: 5.0066
| R-Precision | NDCG | Recommended Songs Clicks | |
|---|---|---|---|
| Top500 & CF-MetaPlaylist & CF-Advanced & CB-Emotion & CB-Audio | 0.035660 | 0.083281 | 5.0124 |
| CF-MetaPlaylist & CF-Advanced & CB-Emotion & CB-Audio | 0.020246 | 0.062293 | 6.1938 |
| CF-MetaPlaylist & CF-Advanced | 0.018852 | 0.061759 | 6.3528 |
| Top500 & CF-MetaPlaylist | 0.035814 | 0.083603 | 5.0066 |
Combining with Assigned Weights
R_comb_list,N_comb_list,C_comb_list = [],[],[]
val_Y_files = ['validation/combine_all_6.json',
'validation/combine_all_10.json',
'validation/combine_all_20.json',
'validation/combine_BL2CF_4.json',
'validation/combine_BL2CF_6.json',
'validation/combine_BL2CF_10.json',
'validation/combine_BL2CF_20.json']
for file in val_Y_files:
R, N, C = test_recs(file)
print()
R_comb_list.append(R)
N_comb_list.append(N)
C_comb_list.append(C)
R_precision: 0.03576800000000001
NDCG: 0.08076354993251983
#clicks: 4.791
R_precision: 0.035904000000000005
NDCG: 0.08239201537590553
#clicks: 4.8556
R_precision: 0.036
NDCG: 0.08302751185815865
#clicks: 4.9238
R_precision: 0.035618000000000004
NDCG: 0.08169420284008266
#clicks: 4.7388
R_precision: 0.03574600000000001
NDCG: 0.08227936904116455
#clicks: 4.7954
R_precision: 0.035886
NDCG: 0.08284576246083752
#clicks: 4.859
R_precision: 0.035991999999999996
NDCG: 0.08301416957075745
#clicks: 4.9244
| R-Precision | NDCG | Recommended Songs Clicks | |
|---|---|---|---|
| All five models with weight(Top 500) = 6 | 0.035768 | 0.080764 | 4.7910 |
| All five models with weight(Top 500) = 10 | 0.035904 | 0.082392 | 4.8556 |
| All five models with weight(Top 500) = 20 | 0.036000 | 0.083028 | 4.9238 |
| Top 500 & CF models with weight(Top 500) = 4 | 0.035618 | 0.081694 | 4.7388 |
| Top 500 & CF models with weight(Top 500) = 6 | 0.035746 | 0.082279 | 4.7954 |
| Top 500 & CF models with weight(Top 500) = 10 | 0.035886 | 0.082846 | 4.8590 |
| Top 500 & CF models with weight(Top 500) = 20 | 0.035992 | 0.083014 | 4.9244 |
We get some unexpeced results for the models. And we will analyze the performance of different models in this section and further propose how to improve the performance of our music recommendation system.
Among all the 3 collaborative filtering models, the Baseline collaborative filtering model gives the best result, which means preprocessing the playlists or the tracks is useless when traing the collaborative filtering models. The Meta-Playlist model and the Advanced model do not give better results may be because that the real utility matrix is destructed after processing the original playlists. To improve the performance of the collaborative filtering model, we may use more playlists to train the baseline collaborative filtering model.
When we do stacking with a logistic regression model on a set of different basic models, the final logistic regression model does not always give a better result than the basic models. After examining the coefficients of the Logistic Regression model, we find that when there is a gap between the precisions of the two basic models included in stacking, the coefficient of the basic model with worse performance is negative. It is intuitive as the model with worse performance is less likely to predict the right songs included in the real playlist. After realizing this problem with the stacking method, we try to combine the models by assigning different weights to rank scores given by different basic models.
After combining the basic models with different weights, we do get better results than the basic models. And our final model gives a precision of 3.6% on the validation set. It is worth noting that our final model has a good performance on the metric ‘clicks’, which means that among all the 500 songs our final model gives for an incompleted playlist, the number of songs included in the real playlist is considerable.