Commit 9f31f008 by 前钰

test1

parents
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基于K-近邻的车牌号识别"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 一、案例简介\n",
"\n",
"图像的智能处理一直是人工智能领域广受关注的一类技术,在人工智能落地的进程中发挥着重要作用。其中车牌号识别作为一个早期应用场景,已经融入日常生活中,为我们提供了诸多便利,在各地的停车场和出入口都能看到它的身影。车牌号识别往往分为字符划分和字符识别两个子任务,本案例我们将关注字符识别的任务.\n",
"\n",
"## 二、作业说明\n",
"本次我们使用已经分割好的车牌图片作为数据集,包括数字 0-9、字母 A-Z(不包含 O 和 I)以及省份简称共 65 个类,编号从 0 到 64。数据已经分成了训练集和测试集,里面的文件夹用 label 编号命名,一个文件夹下的所有图片都属于该文件夹对应的类,每个图片都是 20 * 20 的二值化灰度图。尝试用 K-NN 的方法对分割好的字符图像进行自动识别和转化。 可以使用 sk-learn 等第三方库,不要求自己实现K-NN。\n",
"\n",
"### 基础任务(80分):\n",
"1. 数据预处理任务:将图片数据读入,标准化,将每个图像表示为一维向量,同时保留其对应的标签。这是进行模型训练之前的重要步骤。\n",
"\n",
"2. 模型训练任务:使用 sklearn库的KNeighborsClassifier类,构建K-NN模型,并对训练集进行训练。\n",
"\n",
"3. 模型评估任务:使用模型对测试集进行预测,然后计算模型的准确率。可以使用 sklearn库的accuracy_score函数来实现。\n",
"\n",
"4. 参数分析任务:探究当K值变化时,模型在测试集上的准确率如何变化。可以绘制一个图表,显示不同K值对应的准确率。\n",
"\n",
"5. 数据集大小影响任务:分析当训练集大小变化时,测试结果如何变化。可以尝试不同大小的训练集,记录并分析结果。\n",
"\n",
" \n",
"\n",
"### 扩展任务(20分):\n",
"1. 距离度量分析任务:分析在K-NN中使用不同的距离度量方式(如欧氏距离、曼哈顿距离等)对模型效果的影响。\n",
"\n",
"2. 方法对比任务:对比平权K-NN与加权K-NN的预测效果,分析不同权重设置对结果的影响。平权K-NN认为所有邻居的投票权重相同,而加权K-NN则根据距离远近来确定权重,更近的邻居有更大的投票权。\n",
"\n",
"3. 数据增强任务:考虑到车牌字符可能在不同的光照、角度和大小下出现,可以尝试进行数据增强,如旋转、缩放、剪切,亮度,对比度等操作,以提高模型的泛化能力。可调包(from PIL import Image,ImageEnhance )\n",
"\n",
"4. 数据均衡任务:如果数据集中的各类别样本数量不平衡,可能会对K-NN的性能产生影响。可以尝试使用过采样或者欠采样的方法,来使得各类别样本数量均衡。可调包(from imblearn.over_sampling import SMOTE )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"tips:\n",
"\n",
"图片数据的特征之一是冗余性,即一张128x128的图片降维到64x64依然不影响识别,因为图像的像素矩阵中的某个像素值与周边的像素值的相关性很强,因此相差并不大。所以,可以考虑对图像数据进行降维来增加模型的训练速度和精度."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.5 64-bit (conda)",
"name": "python385jvsc74a57bd007efdcd4b820c98a756949507a4d29d7862823915ec7477944641bea022f4f62"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基于决策树算法的英雄联盟游戏胜负预测"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 一、任务介绍\n",
"\n",
"英雄联盟(League of Legends,LoL)是一款多人在线竞技游戏,由拳头游戏(Riot Games)公司出品。游戏中,每位玩家控制一位拥有独特技能的英雄,红蓝两支队伍各有五位玩家进行对战,目标是摧毁对方的基地水晶。水晶有多座防御塔保护,通常需要先摧毁一些防御塔再摧毁水晶。玩家所控制的英雄起初较弱,需要击杀小兵、野怪和对方英雄来获得金币和经验。经验可以提升英雄等级和技能等级,金币可用来购买装备提升攻击、防御等属性。对战中没有己方单位附近的地点会没有视野,即无法看到对面单位。双方可以通过使用守卫来监视某个地点,洞察对面走向并制定战术。\n",
"\n",
"本数据集来自Kaggle,包含了9879场钻一到大师段位的单双排对局,对局双方几乎是同一水平。每条数据是前10分钟的对局情况,每支队伍有19个特征,红蓝双方共38个特征。这些特征包括英雄击杀、死亡,金钱、经验、等级情况等等。一局游戏一般会持续30至40分钟,但是实际前10分钟的局面很大程度上影响了之后胜负的走向。作为最成功的电子竞技游戏之一,对局数据、选手数据的量化与研究具有重要意义,可以启发游戏将来的发展和改进。\n",
"\n",
"## 二、作业说明\n",
"### 基础任务(80):\n",
"1.合理的进行特征工程处理\n",
"\n",
"2.划分训练集和测试集\n",
"\n",
"3.使用决策树算法完成游戏胜负的预测\n",
"\n",
"4.对比不同特征组合对模型效果的影响\n",
"\n",
"5.提交代码和实验报告\n",
"\n",
"### 扩展任务(20):\n",
"1.尝试自行实现决策树算法细节\n",
"\n",
"2.决策树算法的调参"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pytorch_gpu",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基于回归分析的大学综合得分预测\n",
"\n",
"## 1.案例简介\n",
"\n",
"大学排名的问题具有显著的重要性,同时也充满了挑战和争议。一所大学的全方位能力包括科研、师资和学生等多个因素。现在,全球有多达百家的评估机构致力于评估并排列大学的综合评分,然而,这些机构的评分结果经常存在不一致的情况。在这些机构当中,世界大学排名中心(Center for World University Rankings,简称CWUR)以其评估教育质量、校友就业、研究产出和引用,而非依赖调查和大学提交的数据的方式而著名,其影响力十分显著。\n",
"在本项目中,我们将依据CWUR提供的全球知名大学的各项排名(包括师资和科研等)来进行工作。一方面,我们将通过数据可视化来探究各个大学的独特性。另一方面,我们希望利用机器学习模型(例如线性回归)来预测大学的综合得分。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.作业说明\n",
"我们将使用Kaggle的数据集,利用线性回归模型,依据大学各项排名的指标来预测其综合得分。可以使用 sk-learn 等第三方库,不要求自己实现线性回归.\n",
"\n",
"基础任务(80分):\n",
"- 1.观察和可视化数据,揭示数据的特性。\n",
"- 2.训练集和测试集应按照7:3的比例随机划分,采用RMSE(均方根误差)作为模型的评估标准,计算并获取测试集上的线性回归模型的RMSE值。\n",
"- 3.对线性回归模型中的系数进行分析。\n",
"- 4.尝试使用其他类型的回归模型,并比较其效果。\n",
"\n",
"进阶任务(20分):\n",
"- 1.尝试将地区的离散特征融入到线性回归模型中,然后比较并分析结果。\n",
"- 2.利用R2指标和VIF指标进行模型评价和特征筛选, 尝试是否可以增加模型精度。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 数据展示"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>world_rank</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>institution</th>\n",
" <td>Harvard University</td>\n",
" <td>Massachusetts Institute of Technology</td>\n",
" <td>Stanford University</td>\n",
" </tr>\n",
" <tr>\n",
" <th>region</th>\n",
" <td>USA</td>\n",
" <td>USA</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>national_rank</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>quality_of_education</th>\n",
" <td>7</td>\n",
" <td>9</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>citations</th>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>broad_impact</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>patents</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>score</th>\n",
" <td>100.0</td>\n",
" <td>91.67</td>\n",
" <td>89.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>year</th>\n",
" <td>2012</td>\n",
" <td>2012</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>14 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 \\\n",
"world_rank 1 \n",
"institution Harvard University \n",
"region USA \n",
"national_rank 1 \n",
"quality_of_education 7 \n",
"... ... \n",
"citations 1 \n",
"broad_impact NaN \n",
"patents 5 \n",
"score 100.0 \n",
"year 2012 \n",
"\n",
" 1 \\\n",
"world_rank 2 \n",
"institution Massachusetts Institute of Technology \n",
"region USA \n",
"national_rank 2 \n",
"quality_of_education 9 \n",
"... ... \n",
"citations 4 \n",
"broad_impact NaN \n",
"patents 1 \n",
"score 91.67 \n",
"year 2012 \n",
"\n",
" 2 \n",
"world_rank 3 \n",
"institution Stanford University \n",
"region USA \n",
"national_rank 3 \n",
"quality_of_education 17 \n",
"... ... \n",
"citations 2 \n",
"broad_impact NaN \n",
"patents 15 \n",
"score 89.5 \n",
"year 2012 \n",
"\n",
"[14 rows x 3 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np \n",
" \n",
"pd.set_option('display.max_rows', 10) # 设置显示最大行 \n",
"np.set_printoptions(threshold=10)\n",
"\n",
"\n",
"data_df = pd.read_csv('./cwurData.csv') # 读入 csv 文件为 pandas 的 DataFrame\n",
"data_df.head(3).T # 观察前几列并转置方便观察"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"---\n",
"---\n",
"# <center>答案区</center>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tips\n",
"\n",
"# 1. 数据可视化和分析\n",
"# 2. 注意特征工程,特征的相关性分析,特征的组合,重建和舍弃\n",
"# 3. 注意不同回归模型的选择,模型的参数调整,模型的比较"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基于集成学习的 Amazon 用户评论质量预测\n",
"\n",
"## 案例简介\n",
"在进行线上商品挑选时,评论往往是我们十分关注的一个方面。然而目前电商网站的评论质量参差不齐,甚至有水军刷好评或者恶意差评的情况出现,严重影响了顾客的购物体验。因此,对于评论质量的预测成为电商平台越来越关注的话题,如果能自动对评论质量进行评估,就能根据预测结果避免展现低质量的评论。本案例中我们将基于集成学习的方法对Amazon现实场景中的评论质量进行预测。\n",
"\n",
"## 作业说明\n",
"本案例中需要大家完成两种集成学习算法的实现(Bagging、AdaBoost),其中基分类器要求使用 SVM 和决策树两种,因此,一共需要对比四组结果(AUC 作为评价指标):\n",
"1.Bagging + SVM\n",
"\n",
"2.Bagging + 决策树\n",
"\n",
"3.AdaBoost + SVM\n",
"\n",
"4.AdaBoost + 决策树\n",
"\n",
"注意集成学习的核心算法需要手动进行实现,基分类器可以调库。\n",
"\n",
"### 基本作业(80分)\n",
"1.根据数据格式设计特征的表示\n",
"\n",
"2.汇报不同组合下得到的 AUC\n",
"\n",
"3.结合不同集成学习算法的特点分析结果之间的差异\n",
"\n",
"(使用 sklearn 等第三方库的集成学习算法会酌情扣分)\n",
"\n",
"### 扩展作业(20分)\n",
"1.尝试其他基分类器(如 k-NN、朴素贝叶斯,神经网络)分析不同特征的影响\n",
"\n",
"2.分析集成学习算法参数的影响"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"train_df = pd.read_csv('train.xlsx', sep='\\t')\n",
"test_df = pd.read_csv('test.xlsx', sep='\\t',index_col=False)\n",
"train_df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"分析数据集\n",
"\n",
"reviewID是用户ID\n",
"\n",
"asin是商品ID\n",
"\n",
"reviewText是评论内容\n",
"\n",
"overall是用户对商品的打分\n",
"\n",
"votes_up是认为评论有用的点赞数\n",
"\n",
"votes_all是该评论得到的总点赞数\n",
"\n",
"label是标签"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pytorch_gpu",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 某闯关类手游用户流失预测"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 一、案例简介\n",
"\n",
"手游在当下的日常娱乐中占据着主导性地位,成为人们生活中放松身心的一种有效途径。近年来,各种类型的手游,尤其是闯关类的休闲手游,由于其对碎片化时间的利用取得了非常广泛的市场。然而在此类手游中,新用户流失是一个非常严峻的问题,有相当多的新用户在短暂尝试后会选择放弃,而如果能在用户还没有完全卸载游戏的时候针对流失可能性较大的用户施以干预(例如奖励道具、暖心短信),就可能挽回用户从而提升游戏的活跃度和公司的潜在收益,因此用户的流失预测成为一个重要且挑战性的问题。在毕业项目中我们将从真实游戏中非结构化的日志数据出发,构建用户流失预测模型,综合已有知识设计适合的算法解决实际问题。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 二、作业说明\n",
"\n",
"* 根据给出的实际数据(包括用户游玩历史,关卡特征等),预测测试集中的用户是否为流失用户(二分类);\n",
"* 方法不限,使用百度云进行评测,评价指标使用 AUC;\n",
"* 提交代码与实验报告,报告展示对数据的观察、分析、最后的解决方案以及不同尝试的对比等;\n",
"* 最终评分会参考达到的效果以及对所尝试方法的分析。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 三、数据概览\n",
"\n",
"本次使用的是一个休闲类闯关手游的数据,用户在游戏中不断闯关,每一关的基本任务是在限定步数内达到某个目标。每次闯关可能成功也可能失败,一般情况下用户只在完成一关后进入下一关,闯关过程中可以使用道具或提示等帮助。\n",
"\n",
"对大多数手游来说,用户流失往往发生在早期,因此次周的留存情况是公司关注的一个重点。本次数据选取了 2020.2.1 注册的所有用户在 2.1-2.4 的交互数据,数据经过筛选保证这些注册用户在前四日至少有两日登录。流失的定义则参照次周(2.7-2.13)的登录情况,如果没有登录为流失。\n",
"\n",
"本次的数据和以往结构化的形式不同,展现的是更原始的数据记录,更接近公司实际日志的形式,共包含 5 个文件:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### train.csv\n",
"训练集用户,包括用户 id(从 1 开始)以及对应是否为流失用户的 label(1:流失,0:留存)。这里对应了2774~10931的user_id\n",
"\n",
"### test.csv \n",
"测试集只包含用户 id,任务就是要预测这些用户的流失概率。要预测的是1~2773的user_id\n",
"\n",
"### dev.csv \n",
"验证集格式和训练集相同,主要为了方便离线测试与模型选择。这里对应了10932~13589的user_id\n",
"\n",
"### level_seq.csv \n",
"这个是核心的数据文件,包含用户游玩每个关卡的记录,每一条记录是对某个关卡的一次尝试,具体每列的含义如下\n",
"user_id :用户 id,和训练、验证、测试集中的可以匹配\n",
"\n",
"level_id :关卡 id\n",
"\n",
"f_success :是否通关(1:通关,0:失败)\n",
"\n",
"f_duration :此次尝试所用的时间(单位 s)\n",
"\n",
"f_reststep :剩余步数与限定步数之比(失败为 0)\n",
"\n",
"f_help :是否使用了道具、提示等额外帮助(1:使用,0:未使用)\n",
"\n",
"time :时间戳\n",
"\n",
"### level_meta.csv\n",
"每个关卡的一些统计特征,可用于表示关卡,具体每列的含义如下\n",
"\n",
"f_avg_duration :平均每次尝试花费的时间(单位 s,包含成功与失败的尝试)\n",
"\n",
"f_avg_passrate :平均通关率\n",
"\n",
"f_avg_win_duration :平均每次通关花费的时间(单位 s,只包含通关的尝试)\n",
"\n",
"f_avg_retrytimes :平均重试次数(第二次玩同一关算第 1 次重试)\n",
"\n",
"level_id :关卡 id,可以和 level_seq.csv 中的关卡匹配"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 四、Tips\n",
"\n",
"基本的分析和建模思路可以是:首先根据用户的评论内容和其他相关信息为每个用户提取特征 → 接下来结合标签(例如评论质量)构建表格式的数据集 → 然后使用不同的机器学习模型(例如集成学习方法)进行训练和测试。\n",
"\n",
"如果数据量太大,导致运行时间过长,可以考虑先在一个采样的小训练集上进行模型调参和验证,然后再应用到整个数据集。\n",
"\n",
"集成多种模型通常可以提高预测性能,达到更优的效果。例如,使用Bagging和AdaBoost等集成方法可以整合多个基分类器的强项,增强模型的泛化能力。\n",
"\n",
"还可以考虑使用各种开源工具和库(例如scikit-learn)来加速开发和实验过程,这些工具通常提供了许多预处理、特征工程和模型训练的便捷功能。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pytorch_gpu",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment