声明
本文代码均保存在
https://github.com/super-213/business_data_analysis
有需要的可以自行下载
查看数据
1 2
| df = pd.read_excel('电影推荐系统.xlsx') df.head()
|
| Unnamed: 0 |
电影编号 |
名称 |
类别 |
用户编号 |
评分 |
| 0 |
1 |
玩具总动员(1995) |
冒险|动画|儿童|喜剧|幻想 |
1 |
4.0 |
| 1 |
1 |
玩具总动员(1995) |
冒险|动画|儿童|喜剧|幻想 |
5 |
4.0 |
| 2 |
1 |
玩具总动员(1995) |
冒险|动画|儿童|喜剧|幻想 |
7 |
4.5 |
| 3 |
1 |
玩具总动员(1995) |
冒险|动画|儿童|喜剧|幻想 |
15 |
2.5 |
| 4 |
1 |
玩具总动员(1995) |
冒险|动画|儿童|喜剧|幻想 |
17 |
4.5 |
1
| df = df.drop(columns=['Unnamed: 0'])
|
数据处理
1 2
| user_movie = df.pivot_table(index='用户编号', columns='名称', values='评分') user_movie.head()
|
计算相关性
1 2 3 4 5
| FG = user_movie['阿甘正传(1994)']
corr_FG = user_movie.corrwith(FG) similarity = pd.DataFrame(corr_FG, columns=['相关系数']) similarity.head()
|
| 名称 |
相关系数 |
| 007之黄金眼(1995) |
0.217441 |
| 100个女孩(2000) |
NaN |
| 100条街道(2016) |
NaN |
| 101忠狗续集:伦敦大冒险(2003) |
NaN |
| 101忠狗(1961) |
0.141023 |
查看相关的电影
1 2 3 4 5 6 7
| similar_movies = similarity.dropna().sort_values(by='相关系数', ascending=False) similar_movies = similar_movies[similar_movies.index != '阿甘正传(1994)']
top10 = similar_movies.head(10) print(top10)
|
| 名称 |
相关系数 |
| Savannah Smiles(1982) |
1.0 |
| Badassdom骑士(2013) |
1.0 |
| 亲密的陌生人(有信心的时间)(2004) |
1.0 |
| 珀西杰克逊:怪兽之海(2013) |
1.0 |
| 追逐者(1994) |
1.0 |
| 地狱(2016) |
1.0 |
| 爬行之夜(1986) |
1.0 |
| 睡衣派对大屠杀,(1982) |
1.0 |
| 我男朋友的回归(1993) |
1.0 |
| 如何像英国人一样做爱(2014) |
1.0 |
SVD
1 2 3 4 5 6 7 8 9
| reader = Reader(rating_scale=(df['评分'].min(), df['评分'].max())) data = Dataset.load_from_df(df[['用户编号', '名称', '评分']], reader) trainset, testset = surprise_train_test_split(data, test_size=0.2, random_state=42)
svd_model = SVD() svd_model.fit(trainset) svd_pred = svd_model.test(testset) svd_rmse = accuracy.rmse(svd_pred) svd_mae = accuracy.mae(svd_pred)
|
RMSE: 0.8793
MAE: 0.6738
KNN 预测电影评分
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| knn = NearestNeighbors(metric='cosine', algorithm='brute') R = user_movie.fillna(0).values knn.fit(R)
def knn_predict(user_id, movie_name, user_movie, knn_model, k=5): user_vector = user_movie.loc[user_id].fillna(0).values.reshape(1, -1) distances, indices = knn_model.kneighbors(user_vector, n_neighbors=k+1) neighbors = indices.flatten()[1:] ratings = [] for n in neighbors: rating = user_movie.iloc[n][movie_name] if not np.isnan(rating): ratings.append(rating) if len(ratings) == 0: return user_movie[movie_name].mean() return np.mean(ratings)
pred = knn_predict(1, '阿甘正传(1994)', user_movie, knn) print(f"KNN 预测评分: {pred:.4f}")
|
KNN 预测评分: 3.7500
user-user协同过滤
1 2 3 4 5 6 7 8 9
| user_id = 10 movie_name = '阿甘正传(1994)'
user_corr = user_movie.corrwith(user_movie.loc[user_id], axis=1)
rated_users = user_movie[movie_name].dropna()
pred = np.dot(user_corr[rated_users.index], rated_users) / user_corr[rated_users.index].sum() print(f"加权平均预测评分: {pred:.4f}")
|
加权平均预测评分: nan
Item-Item 协同过滤
1 2 3 4 5 6 7
| user_id = 1 movie_name = '阿甘正传(1994)'
movie_corr = user_movie.corrwith(user_movie[movie_name]) rated_movies = user_movie.loc[user_id].dropna() pred = np.dot(movie_corr[rated_movies.index], rated_movies) / movie_corr[rated_movies.index].sum() print(f"基于电影相似度的预测评分: {pred:.4f}")
|
基于电影相似度的预测评分: nan