Commit 7bdcf60e by 前钰

Upload New File

parent 63bf3a3c
{
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基于回归分析的大学综合得分预测\n",
"\n",
"## 1.案例简介\n",
"\n",
"大学排名的问题具有显著的重要性,同时也充满了挑战和争议。一所大学的全方位能力包括科研、师资和学生等多个因素。现在,全球有多达百家的评估机构致力于评估并排列大学的综合评分,然而,这些机构的评分结果经常存在不一致的情况。在这些机构当中,世界大学排名中心(Center for World University Rankings,简称CWUR)以其评估教育质量、校友就业、研究产出和引用,而非依赖调查和大学提交的数据的方式而著名,其影响力十分显著。\n",
"在本项目中,我们将依据CWUR提供的全球知名大学的各项排名(包括师资和科研等)来进行工作。一方面,我们将通过数据可视化来探究各个大学的独特性。另一方面,我们希望利用机器学习模型(例如线性回归)来预测大学的综合得分。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2.作业说明\n",
"我们将使用Kaggle的数据集,利用线性回归模型,依据大学各项排名的指标来预测其综合得分。可以使用 sk-learn 等第三方库,不要求自己实现线性回归.\n",
"\n",
"基础任务(80分):\n",
"- 1.观察和可视化数据,揭示数据的特性。\n",
"- 2.训练集和测试集应按照7:3的比例随机划分,采用RMSE(均方根误差)作为模型的评估标准,计算并获取测试集上的线性回归模型的RMSE值。\n",
"- 3.对线性回归模型中的系数进行分析。\n",
"- 4.尝试使用其他类型的回归模型,并比较其效果。\n",
"\n",
"进阶任务(20分):\n",
"- 1.尝试将地区的离散特征融入到线性回归模型中,然后比较并分析结果。\n",
"- 2.利用R2指标和VIF指标进行模型评价和特征筛选, 尝试是否可以增加模型精度。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 数据展示"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>world_rank</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>institution</th>\n",
" <td>Harvard University</td>\n",
" <td>Massachusetts Institute of Technology</td>\n",
" <td>Stanford University</td>\n",
" </tr>\n",
" <tr>\n",
" <th>region</th>\n",
" <td>USA</td>\n",
" <td>USA</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>national_rank</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>quality_of_education</th>\n",
" <td>7</td>\n",
" <td>9</td>\n",
" <td>17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>citations</th>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>broad_impact</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>patents</th>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>score</th>\n",
" <td>100.0</td>\n",
" <td>91.67</td>\n",
" <td>89.5</td>\n",
" </tr>\n",
" <tr>\n",
" <th>year</th>\n",
" <td>2012</td>\n",
" <td>2012</td>\n",
" <td>2012</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>14 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 \\\n",
"world_rank 1 \n",
"institution Harvard University \n",
"region USA \n",
"national_rank 1 \n",
"quality_of_education 7 \n",
"... ... \n",
"citations 1 \n",
"broad_impact NaN \n",
"patents 5 \n",
"score 100.0 \n",
"year 2012 \n",
"\n",
" 1 \\\n",
"world_rank 2 \n",
"institution Massachusetts Institute of Technology \n",
"region USA \n",
"national_rank 2 \n",
"quality_of_education 9 \n",
"... ... \n",
"citations 4 \n",
"broad_impact NaN \n",
"patents 1 \n",
"score 91.67 \n",
"year 2012 \n",
"\n",
" 2 \n",
"world_rank 3 \n",
"institution Stanford University \n",
"region USA \n",
"national_rank 3 \n",
"quality_of_education 17 \n",
"... ... \n",
"citations 2 \n",
"broad_impact NaN \n",
"patents 15 \n",
"score 89.5 \n",
"year 2012 \n",
"\n",
"[14 rows x 3 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np \n",
" \n",
"pd.set_option('display.max_rows', 10) # 设置显示最大行 \n",
"np.set_printoptions(threshold=10)\n",
"\n",
"\n",
"data_df = pd.read_csv('./cwurData.csv') # 读入 csv 文件为 pandas 的 DataFrame\n",
"data_df.head(3).T # 观察前几列并转置方便观察"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"---\n",
"---\n",
"# <center>答案区</center>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tips\n",
"\n",
"# 1. 数据可视化和分析\n",
"# 2. 注意特征工程,特征的相关性分析,特征的组合,重建和舍弃\n",
"# 3. 注意不同回归模型的选择,模型的参数调整,模型的比较"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment