Commit db84319d by 前钰

Upload New File

parent f44a9913
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# 基于集成学习的 Amazon 用户评论质量预测\n",
"\n",
"## 案例简介\n",
"在进行线上商品挑选时,评论往往是我们十分关注的一个方面。然而目前电商网站的评论质量参差不齐,甚至有水军刷好评或者恶意差评的情况出现,严重影响了顾客的购物体验。因此,对于评论质量的预测成为电商平台越来越关注的话题,如果能自动对评论质量进行评估,就能根据预测结果避免展现低质量的评论。本案例中我们将基于集成学习的方法对Amazon现实场景中的评论质量进行预测。\n",
"\n",
"## 作业说明\n",
"本案例中需要大家完成两种集成学习算法的实现(Bagging、AdaBoost),其中基分类器要求使用 SVM 和决策树两种,因此,一共需要对比四组结果(AUC 作为评价指标):\n",
"1.Bagging + SVM\n",
"\n",
"2.Bagging + 决策树\n",
"\n",
"3.AdaBoost + SVM\n",
"\n",
"4.AdaBoost + 决策树\n",
"\n",
"注意集成学习的核心算法需要手动进行实现,基分类器可以调库。\n",
"\n",
"### 基本作业(80分)\n",
"1.根据数据格式设计特征的表示\n",
"\n",
"2.汇报不同组合下得到的 AUC\n",
"\n",
"3.结合不同集成学习算法的特点分析结果之间的差异\n",
"\n",
"(使用 sklearn 等第三方库的集成学习算法会酌情扣分)\n",
"\n",
"### 扩展作业(20分)\n",
"1.尝试其他基分类器(如 k-NN、朴素贝叶斯,神经网络)分析不同特征的影响\n",
"\n",
"2.分析集成学习算法参数的影响"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB, GaussianNB # 导入不同类型的朴素贝叶斯分类器\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # 导入文本特征提取工具:词频和TF-IDF向量化器\n",
"from sklearn import preprocessing, tree, ensemble, svm, metrics, calibration # 导入预处理、决策树、集成方法、支持向量机、评价指标等模块\n",
"from sklearn.model_selection import cross_val_score, train_test_split\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.metrics import auc, accuracy_score\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.feature_extraction import text\n",
"from matplotlib import pyplot as plt\n",
"from itertools import combinations\n",
"from wordcloud import WordCloud # 导入生成词云的工具\n",
"from collections import Counter\n",
"from textblob import TextBlob # 导入文本情感分析工具\n",
"from tqdm import tqdm\n",
"import pandas as pd\n",
"import numpy as np\n",
"import random\n",
"import math\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": " reviewerID asin overall votes_up votes_all \\\ncount 57039.000000 57039.000000 57039.000000 57039.000000 57039.000000 \nmean 33359.761865 19973.170866 3.535178 12.387594 18.475850 \nstd 30016.804127 14104.410152 1.529742 45.130499 50.149683 \nmin 50.000000 0.000000 1.000000 0.000000 5.000000 \n25% 9235.000000 8218.000000 2.000000 4.000000 6.000000 \n50% 22589.000000 17635.000000 4.000000 6.000000 10.000000 \n75% 53170.000000 30875.000000 5.000000 11.000000 18.000000 \nmax 123767.000000 50051.000000 5.000000 6084.000000 6510.000000 \n\n label \ncount 57039.000000 \nmean 0.226196 \nstd 0.418371 \nmin 0.000000 \n25% 0.000000 \n50% 0.000000 \n75% 0.000000 \nmax 1.000000 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>reviewerID</th>\n <th>asin</th>\n <th>overall</th>\n <th>votes_up</th>\n <th>votes_all</th>\n <th>label</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>33359.761865</td>\n <td>19973.170866</td>\n <td>3.535178</td>\n <td>12.387594</td>\n <td>18.475850</td>\n <td>0.226196</td>\n </tr>\n <tr>\n <th>std</th>\n <td>30016.804127</td>\n <td>14104.410152</td>\n <td>1.529742</td>\n <td>45.130499</td>\n <td>50.149683</td>\n <td>0.418371</td>\n </tr>\n <tr>\n <th>min</th>\n <td>50.000000</td>\n <td>0.000000</td>\n <td>1.000000</td>\n <td>0.000000</td>\n <td>5.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>9235.000000</td>\n <td>8218.000000</td>\n <td>2.000000</td>\n <td>4.000000</td>\n <td>6.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>22589.000000</td>\n <td>17635.000000</td>\n <td>4.000000</td>\n <td>6.000000</td>\n <td>10.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>53170.000000</td>\n <td>30875.000000</td>\n <td>5.000000</td>\n <td>11.000000</td>\n <td>18.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>123767.000000</td>\n <td>50051.000000</td>\n <td>5.000000</td>\n <td>6084.000000</td>\n <td>6510.000000</td>\n <td>1.000000</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df = pd.read_csv('train.xlsx', sep='\\t')\n",
"test_df = pd.read_csv('test.xlsx', sep='\\t',index_col=False)\n",
"train_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"分析数据集\n",
"\n",
"reviewID是用户ID\n",
"\n",
"asin是商品ID\n",
"\n",
"reviewText是评论内容\n",
"\n",
"overall是用户对商品的打分\n",
"\n",
"votes_up是认为评论有用的点赞数\n",
"\n",
"votes_all是该评论得到的总点赞数\n",
"\n",
"label是标签"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"train_df.shape (57039, 7)\n",
"test_df.shape (22418, 5)\n"
]
}
],
"source": [
"print('train_df.shape', train_df.shape) \n",
"print('test_df.shape', test_df.shape)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11203</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11204</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11205</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11206</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11207</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>22416 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" label\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
"... ...\n",
"11203 0\n",
"11204 0\n",
"11205 0\n",
"11206 0\n",
"11207 0\n",
"\n",
"[22416 rows x 1 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels=pd.read_csv(\"pre_l.csv\")\n",
"labels._append(labels)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>reviewerID</th>\n",
" <th>asin</th>\n",
" <th>reviewText</th>\n",
" <th>overall</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>82947</td>\n",
" <td>37386</td>\n",
" <td>I REALLY wanted this series but I am in SHOCK ...</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>10154</td>\n",
" <td>23543</td>\n",
" <td>I have to say that this is a work of art for m...</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>5789</td>\n",
" <td>5724</td>\n",
" <td>Alien 3 is certainly the most controversal fil...</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>9198</td>\n",
" <td>5909</td>\n",
" <td>I love this film...preachy? Well, of course i...</td>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>33252</td>\n",
" <td>21214</td>\n",
" <td>Even though I previously bought the Gamera Dou...</td>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id reviewerID asin reviewText \\\n",
"0 0 82947 37386 I REALLY wanted this series but I am in SHOCK ... \n",
"1 1 10154 23543 I have to say that this is a work of art for m... \n",
"2 2 5789 5724 Alien 3 is certainly the most controversal fil... \n",
"3 3 9198 5909 I love this film...preachy? Well, of course i... \n",
"4 4 33252 21214 Even though I previously bought the Gamera Dou... \n",
"\n",
" overall label \n",
"0 1 0.0 \n",
"1 4 0.0 \n",
"2 3 0.0 \n",
"3 5 0.0 \n",
"4 5 0.0 "
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_df['label']=labels\n",
"test_df.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment