Skip to content
Projects
Groups
Snippets
Help
This project
Loading...
Sign in / Register
Toggle navigation
人
人工智能系统实战第三期
Overview
Overview
Details
Activity
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
liyinkai
人工智能系统实战第三期
Commits
db84319d
Commit
db84319d
authored
Nov 04, 2023
by
前钰
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Upload New File
parent
f44a9913
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
377 additions
and
0 deletions
+377
-0
question.ipynb
人工智能系统实战第三期/实战代码/机器学习项目实战/集成学习与迁移学习/question.ipynb
+377
-0
No files found.
人工智能系统实战第三期/实战代码/机器学习项目实战/集成学习与迁移学习/question.ipynb
0 → 100644
View file @
db84319d
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# 基于集成学习的 Amazon 用户评论质量预测\n",
"\n",
"## 案例简介\n",
"在进行线上商品挑选时,评论往往是我们十分关注的一个方面。然而目前电商网站的评论质量参差不齐,甚至有水军刷好评或者恶意差评的情况出现,严重影响了顾客的购物体验。因此,对于评论质量的预测成为电商平台越来越关注的话题,如果能自动对评论质量进行评估,就能根据预测结果避免展现低质量的评论。本案例中我们将基于集成学习的方法对Amazon现实场景中的评论质量进行预测。\n",
"\n",
"## 作业说明\n",
"本案例中需要大家完成两种集成学习算法的实现(Bagging、AdaBoost),其中基分类器要求使用 SVM 和决策树两种,因此,一共需要对比四组结果(AUC 作为评价指标):\n",
"1.Bagging + SVM\n",
"\n",
"2.Bagging + 决策树\n",
"\n",
"3.AdaBoost + SVM\n",
"\n",
"4.AdaBoost + 决策树\n",
"\n",
"注意集成学习的核心算法需要手动进行实现,基分类器可以调库。\n",
"\n",
"### 基本作业(80分)\n",
"1.根据数据格式设计特征的表示\n",
"\n",
"2.汇报不同组合下得到的 AUC\n",
"\n",
"3.结合不同集成学习算法的特点分析结果之间的差异\n",
"\n",
"(使用 sklearn 等第三方库的集成学习算法会酌情扣分)\n",
"\n",
"### 扩展作业(20分)\n",
"1.尝试其他基分类器(如 k-NN、朴素贝叶斯,神经网络)分析不同特征的影响\n",
"\n",
"2.分析集成学习算法参数的影响"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB, GaussianNB # 导入不同类型的朴素贝叶斯分类器\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # 导入文本特征提取工具:词频和TF-IDF向量化器\n",
"from sklearn import preprocessing, tree, ensemble, svm, metrics, calibration # 导入预处理、决策树、集成方法、支持向量机、评价指标等模块\n",
"from sklearn.model_selection import cross_val_score, train_test_split\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.metrics import auc, accuracy_score\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.feature_extraction import text\n",
"from matplotlib import pyplot as plt\n",
"from itertools import combinations\n",
"from wordcloud import WordCloud # 导入生成词云的工具\n",
"from collections import Counter\n",
"from textblob import TextBlob # 导入文本情感分析工具\n",
"from tqdm import tqdm\n",
"import pandas as pd\n",
"import numpy as np\n",
"import random\n",
"import math\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": " reviewerID asin overall votes_up votes_all \\\ncount 57039.000000 57039.000000 57039.000000 57039.000000 57039.000000 \nmean 33359.761865 19973.170866 3.535178 12.387594 18.475850 \nstd 30016.804127 14104.410152 1.529742 45.130499 50.149683 \nmin 50.000000 0.000000 1.000000 0.000000 5.000000 \n25% 9235.000000 8218.000000 2.000000 4.000000 6.000000 \n50% 22589.000000 17635.000000 4.000000 6.000000 10.000000 \n75% 53170.000000 30875.000000 5.000000 11.000000 18.000000 \nmax 123767.000000 50051.000000 5.000000 6084.000000 6510.000000 \n\n label \ncount 57039.000000 \nmean 0.226196 \nstd 0.418371 \nmin 0.000000 \n25% 0.000000 \n50% 0.000000 \n75% 0.000000 \nmax 1.000000 ",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>reviewerID</th>\n <th>asin</th>\n <th>overall</th>\n <th>votes_up</th>\n <th>votes_all</th>\n <th>label</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>count</th>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n <td>57039.000000</td>\n </tr>\n <tr>\n <th>mean</th>\n <td>33359.761865</td>\n <td>19973.170866</td>\n <td>3.535178</td>\n <td>12.387594</td>\n <td>18.475850</td>\n <td>0.226196</td>\n </tr>\n <tr>\n <th>std</th>\n <td>30016.804127</td>\n <td>14104.410152</td>\n <td>1.529742</td>\n <td>45.130499</td>\n <td>50.149683</td>\n <td>0.418371</td>\n </tr>\n <tr>\n <th>min</th>\n <td>50.000000</td>\n <td>0.000000</td>\n <td>1.000000</td>\n <td>0.000000</td>\n <td>5.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>25%</th>\n <td>9235.000000</td>\n <td>8218.000000</td>\n <td>2.000000</td>\n <td>4.000000</td>\n <td>6.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>50%</th>\n <td>22589.000000</td>\n <td>17635.000000</td>\n <td>4.000000</td>\n <td>6.000000</td>\n <td>10.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>75%</th>\n <td>53170.000000</td>\n <td>30875.000000</td>\n <td>5.000000</td>\n <td>11.000000</td>\n <td>18.000000</td>\n <td>0.000000</td>\n </tr>\n <tr>\n <th>max</th>\n <td>123767.000000</td>\n <td>50051.000000</td>\n <td>5.000000</td>\n <td>6084.000000</td>\n <td>6510.000000</td>\n <td>1.000000</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df = pd.read_csv('train.xlsx', sep='\\t')\n",
"test_df = pd.read_csv('test.xlsx', sep='\\t',index_col=False)\n",
"train_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"分析数据集\n",
"\n",
"reviewID是用户ID\n",
"\n",
"asin是商品ID\n",
"\n",
"reviewText是评论内容\n",
"\n",
"overall是用户对商品的打分\n",
"\n",
"votes_up是认为评论有用的点赞数\n",
"\n",
"votes_all是该评论得到的总点赞数\n",
"\n",
"label是标签"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"train_df.shape (57039, 7)\n",
"test_df.shape (22418, 5)\n"
]
}
],
"source": [
"print('train_df.shape', train_df.shape) \n",
"print('test_df.shape', test_df.shape)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11203</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11204</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11205</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11206</th>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11207</th>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>22416 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" label\n",
"0 0\n",
"1 0\n",
"2 0\n",
"3 0\n",
"4 0\n",
"... ...\n",
"11203 0\n",
"11204 0\n",
"11205 0\n",
"11206 0\n",
"11207 0\n",
"\n",
"[22416 rows x 1 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels=pd.read_csv(\"pre_l.csv\")\n",
"labels._append(labels)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Id</th>\n",
" <th>reviewerID</th>\n",
" <th>asin</th>\n",
" <th>reviewText</th>\n",
" <th>overall</th>\n",
" <th>label</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>82947</td>\n",
" <td>37386</td>\n",
" <td>I REALLY wanted this series but I am in SHOCK ...</td>\n",
" <td>1</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>10154</td>\n",
" <td>23543</td>\n",
" <td>I have to say that this is a work of art for m...</td>\n",
" <td>4</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>5789</td>\n",
" <td>5724</td>\n",
" <td>Alien 3 is certainly the most controversal fil...</td>\n",
" <td>3</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>9198</td>\n",
" <td>5909</td>\n",
" <td>I love this film...preachy? Well, of course i...</td>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>33252</td>\n",
" <td>21214</td>\n",
" <td>Even though I previously bought the Gamera Dou...</td>\n",
" <td>5</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Id reviewerID asin reviewText \\\n",
"0 0 82947 37386 I REALLY wanted this series but I am in SHOCK ... \n",
"1 1 10154 23543 I have to say that this is a work of art for m... \n",
"2 2 5789 5724 Alien 3 is certainly the most controversal fil... \n",
"3 3 9198 5909 I love this film...preachy? Well, of course i... \n",
"4 4 33252 21214 Even though I previously bought the Gamera Dou... \n",
"\n",
" overall label \n",
"0 1 0.0 \n",
"1 4 0.0 \n",
"2 3 0.0 \n",
"3 5 0.0 \n",
"4 5 0.0 "
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_df['label']=labels\n",
"test_df.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment