💡 作者:韩信子@ShowMeAI
📘 数据分析实战系列:https://www.showmeai.tech/tutorials/40
📘 机器学习实战系列:https://www.showmeai.tech/tutorials/41
📘 本文地址:https://www.showmeai.tech/article-detail/400
📢 声明:版权所有,转载请联系平台与作者并注明出处
📢 收藏ShowMeAI查看更多精彩内容
FIFA 2022世界杯已经落幕!关于哪支球队将赢得冠军的讨论,也有了明确答案。恭喜梅西!恭喜阿根廷!赛前 ShowMeAI 使用数据科学和机器学习的技能,开发一个基于历史数据的模型来预测 FIFA 2022 世界杯比赛结果。现在尘埃落定,让我们一起看看机器学习的预测与实际比赛结果,有多大大大大的差距吧!
对比下方官网发布的赛程结果汇总, ShowMeAI 将机器学习的预测结果可视化后与之进行了比较。
可以看到,从小组赛开始直到1/4决赛,机器学习模型预测的正确率都是比较高的。然而从半决赛开始,模型预测准确度急转直下,不论是参赛球队还是输赢判断都降为0,冠亚季军无一预测正确。
但这也正是足球的魅力所在。正是竞技体育中存在的不确定性,让我们更深刻地感受到了奋斗、勇气、英雄和梦想的含义。(下文是赛前完整的建模过程,一起来看看吧!)
我们先为机器学习建模准备数据,我们需要一些数据来体现各支球队的表现。我们本次用到的是FIFA 相关的数据:🏆1872到2022历史比赛数据 和 🏆FIFA 排名数据,数据可以直接在Kaggle平台获取,也可以在ShowMeAI的百度网盘获取。
🏆 实战数据集下载(百度网盘):公众号『ShowMeAI研究中心』回复『实战』,或者点击 这里 获取本文 [35]基于机器学习的2022世界杯预测实战 『FIFA 2022数据集』
⭐ ShowMeAI官方GitHub:https://github.com/ShowMeAI-Hub
哪些特征会影响足球比赛的胜负结果?这个开放的问题涉及很多特征维度:从选定的球员到当天球场的温度。我们简单一点处理,仅使用参与比赛的每个团队的过去统计数据构建一个数据集,优先考虑可以通过简单方式收集的可量化统计数据,例如进球数、平均排名、赢得的分数等。这些数据可以在我们上面谈到的两个数据集中整合得到。
另外,我们仅分析 2018 之后的数据,这样我们可以更聚焦在本届世界杯备战这几年球队队员表现的变化。数据构建代码如下:
import pandas as pd import re df = pd.read_csv("results.csv") #games between national teams df["date"] = pd.to_datetime(df["date"]) df = df[(df["date"] >= "2018-8-1")].reset_index(drop=True) #games at the 2022 wc cycle df_wc = df #pre-wc outcomes rank = pd.read_csv("fifa_ranking-2022-10-06.csv") #rankings rank["rank_date"] = pd.to_datetime(rank["rank_date"]) rank = rank[(rank["rank_date"] >= "2018-8-1")].reset_index(drop=True) #selecting games from the 2022 wc cycle rank["country_full"] = rank["country_full"].str.replace("IR Iran", "Iran").str.replace("Korea Republic", "South Korea").str.replace("USA", "United States") #ajustando nomes de algumas seleções rank = rank.set_index(['rank_date']).groupby(['country_full'], group_keys=False).resample('D').first().fillna(method='ffill').reset_index() rank_wc = rank #dataframe with rankings #Making the merge df_wc_ranked = df_wc.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "home_team"], right_on=["rank_date", "country_full"]).drop(["rank_date", "country_full"], axis=1) df_wc_ranked = df_wc_ranked.merge(rank[["country_full", "total_points", "previous_points", "rank", "rank_change", "rank_date"]], left_on=["date", "away_team"], right_on=["rank_date", "country_full"], suffixes=("_home", "_away")).drop(["rank_date", "country_full"], axis=1)
最终的数据集结果如下:
对特征工程细节感兴趣的同学,可以阅读ShowMeAI的详解文章,学习理论知识与实战方法:
准备好数据之后,我们就可以进行特征工程了,我们希望从原始数据中构建有预测能力的特征信息,我们这里采用了如下特征:
我们选取以上特征的原因是:
df = df_wc_ranked def result_finder(home, away): if home > away: return pd.Series([0, 3, 0]) if home < away: return pd.Series([1, 0, 3]) else: return pd.Series([2, 1, 1]) results = df.apply(lambda x: result_finder(x["home_score"], x["away_score"]), axis=1) df[["result", "home_team_points", "away_team_points"]] = results df["rank_dif"] = df["rank_home"] - df["rank_away"] df["sg"] = df["home_score"] - df["away_score"] df["points_home_by_rank"] = df["home_team_points"]/df["rank_away"] df["points_away_by_rank"] = df["away_team_points"]/df["rank_home"] home_team = df[["date", "home_team", "home_score", "away_score", "rank_home", "rank_away","rank_change_home", "total_points_home", "result", "rank_dif", "points_home_by_rank", "home_team_points"]] away_team = df[["date", "away_team", "away_score", "home_score", "rank_away", "rank_home","rank_change_away", "total_points_away", "result", "rank_dif", "points_away_by_rank", "away_team_points"]] home_team.columns = [h.replace("home_", "").replace("_home", "").replace("away_", "suf_").replace("_away", "_suf") for h in home_team.columns] away_team.columns = [a.replace("away_", "").replace("_away", "").replace("home_", "suf_").replace("_home", "_suf") for a in away_team.columns] team_stats = home_team.append(away_team) team_stats_raw = team_stats.copy() stats_val = [] for index, row in team_stats.iterrows(): team = row["team"] date = row["date"] past_games = team_stats.loc[(team_stats["team"] == team) & (team_stats["date"] < date)].sort_values(by=['date'], ascending=False) last5 = past_games.head(5) goals = past_games["score"].mean() goals_l5 = last5["score"].mean() goals_suf = past_games["suf_score"].mean() goals_suf_l5 = last5["suf_score"].mean() rank = past_games["rank_suf"].mean() rank_l5 = last5["rank_suf"].mean() if len(last5) > 0: points = past_games["total_points"].values[0] - past_games["total_points"].values[-1]#qtd de pontos ganhos points_l5 = last5["total_points"].values[0] - last5["total_points"].values[-1] else: points = 0 points_l5 = 0 gp = past_games["team_points"].mean() gp_l5 = last5["team_points"].mean() gp_rank = past_games["points_by_rank"].mean() gp_rank_l5 = last5["points_by_rank"].mean() stats_val.append([goals, goals_l5, goals_suf, goals_suf_l5, rank, rank_l5, points, points_l5, gp, gp_l5, gp_rank, gp_rank_l5]) stats_cols = ["goals_mean", "goals_mean_l5", "goals_suf_mean", "goals_suf_mean_l5", "rank_mean", "rank_mean_l5", "points_mean", "points_mean_l5", "game_points_mean", "game_points_mean_l5", "game_points_rank_mean", "game_points_rank_mean_l5"] stats_df = pd.DataFrame(stats_val, columns=stats_cols) full_df = pd.concat([team_stats.reset_index(drop=True), stats_df], axis=1, ignore_index=False) home_team_stats = full_df.iloc[:int(full_df.shape[0]/2),:] away_team_stats = full_df.iloc[int(full_df.shape[0]/2):,:] home_team_stats = home_team_stats[home_team_stats.columns[-12:]] away_team_stats = away_team_stats[away_team_stats.columns[-12:]] home_team_stats.columns = ['home_'+str(col) for col in home_team_stats.columns] away_team_stats.columns = ['away_'+str(col) for col in away_team_stats.columns] match_stats = pd.concat([home_team_stats, away_team_stats.reset_index(drop=True)], axis=1, ignore_index=False) full_df = pd.concat([df, match_stats.reset_index(drop=True)], axis=1, ignore_index=False) def find_friendly(x): if x == "Friendly": return 1 else: return 0 full_df["is_friendly"] = full_df["tournament"].apply(lambda x: find_friendly(x)) full_df = pd.get_dummies(full_df, columns=["is_friendly"]) base_df = full_df[["date", "home_team", "away_team", "rank_home", "rank_away","home_score", "away_score","result", "rank_dif", "rank_change_home", "rank_change_away", 'home_goals_mean', 'home_goals_mean_l5', 'home_goals_suf_mean', 'home_goals_suf_mean_l5', 'home_rank_mean', 'home_rank_mean_l5', 'home_points_mean', 'home_points_mean_l5', 'away_goals_mean', 'away_goals_mean_l5', 'away_goals_suf_mean', 'away_goals_suf_mean_l5', 'away_rank_mean', 'away_rank_mean_l5', 'away_points_mean', 'away_points_mean_l5','home_game_points_mean', 'home_game_points_mean_l5', 'home_game_points_rank_mean', 'home_game_points_rank_mean_l5','away_game_points_mean', 'away_game_points_mean_l5', 'away_game_points_rank_mean', 'away_game_points_rank_mean_l5', 'is_friendly_0', 'is_friendly_1']] base_df.tail()
在建模之前,我们对于数据做一点分析。比赛的结果有3种情况:赢、平、输,但作为 3 类分类问题进行建模,类别不均衡是一个很大的问题,且评估也会有点麻烦,我们做一点合并和调整:汇总到「主队赢」和「主队平/输」2种情况。
关于数据分析与可视化的详细教程,可以阅读ShowMeAI关于的数据分析系列教程与文章
我们按照不同的结果(赢/输平)来对不同的特征维度进行分布分析,我们这里使用小提琴图。
base_df_no_fg = base_df.dropna() df = base_df_no_fg def no_draw(x): if x == 2: return 1 else: return x df["target"] = df["result"].apply(lambda x: no_draw(x)) import matplotlib.pyplot as plt data1 = df[list(df.columns[8:20].values) + ["target"]] scaled = (data1[:-1] - data1[:-1].mean()) / data1[:-1].std() scaled["target"] = data1["target"] violin1 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(15,10)) sns.violinplot(x="features", y="value", hue="target", data=violin1,split=True, inner="quart") plt.xticks(rotation=90) plt.show()
data2 = df[df.columns[20:]] scaled = (data2[:-1] - data2[:-1].mean()) / data2[:-1].std() scaled["target"] = data2["target"] violin2 = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(15,10)) sns.violinplot(x="features", y="value", hue="target", data=violin2,split=True, inner="quart") plt.xticks(rotation=90) plt.show()
对于第一组数据,目前的特征中只有rank_dif
(两队排名的差值)对 target classes 有影响。因此,我们考虑创建更多差异特征,这类特征似乎是很强的特征信息,构建如下特征:
dif = df.copy() dif.loc[:, "goals_dif"] = dif["home_goals_mean"] - dif["away_goals_mean"] dif.loc[:, "goals_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_mean_l5"] dif.loc[:, "goals_suf_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_suf_mean"] dif.loc[:, "goals_suf_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_suf_mean_l5"] dif.loc[:, "goals_made_suf_dif"] = dif["home_goals_mean"] - dif["away_goals_suf_mean"] dif.loc[:, "goals_made_suf_dif_l5"] = dif["home_goals_mean_l5"] - dif["away_goals_suf_mean_l5"] dif.loc[:, "goals_suf_made_dif"] = dif["home_goals_suf_mean"] - dif["away_goals_mean"] dif.loc[:, "goals_suf_made_dif_l5"] = dif["home_goals_suf_mean_l5"] - dif["away_goals_mean_l5"]
我们再次使用小提琴图分析。
data_difs = dif.iloc[:, -8:] scaled = (data_difs - data_difs.mean()) / data_difs.std() scaled["target"] = data2["target"] violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(10,10)) sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart") plt.xticks(rotation=90) plt.show()
进球差异和失球差异特征对目标有很好的区分度。然而,球队进球与对手进球之间差异的特征没有影响。那我们再考虑:
此外,我们还可以计算积分的差异、排名位置的差异以及排名所获得的积分差异。而且,为了衡量对手的水平,我们可以考虑:排名所造成的进球与失球之间的差异。
dif.loc[:, "dif_points"] = dif["home_game_points_mean"] - dif["away_game_points_mean"] dif.loc[:, "dif_points_l5"] = dif["home_game_points_mean_l5"] - dif["away_game_points_mean_l5"] dif.loc[:, "dif_points_rank"] = dif["home_game_points_rank_mean"] - dif["away_game_points_rank_mean"] dif.loc[:, "dif_points_rank_l5"] = dif["home_game_points_rank_mean_l5"] - dif["away_game_points_rank_mean_l5"] dif.loc[:, "dif_rank_agst"] = dif["home_rank_mean"] - dif["away_rank_mean"] dif.loc[:, "dif_rank_agst_l5"] = dif["home_rank_mean_l5"] - dif["away_rank_mean_l5"] dif.loc[:, "goals_per_ranking_dif"] = (dif["home_goals_mean"] / dif["home_rank_mean"]) - (dif["away_goals_mean"] / dif["away_rank_mean"]) dif.loc[:, "goals_per_ranking_suf_dif"] = (dif["home_goals_suf_mean"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean"] / dif["away_rank_mean"]) dif.loc[:, "goals_per_ranking_dif_l5"] = (dif["home_goals_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_mean_l5"] / dif["away_rank_mean"]) dif.loc[:, "goals_per_ranking_suf_dif_l5"] = (dif["home_goals_suf_mean_l5"] / dif["home_rank_mean"]) - (dif["away_goals_suf_mean_l5"] / dif["away_rank_mean"])
我们用提琴图和箱线图对数据进行分析:
data_difs = dif.iloc[:, -10:] scaled = (data_difs - data_difs.mean()) / data_difs.std() scaled["target"] = data2["target"] violin = pd.melt(scaled,id_vars="target", var_name="features", value_name="value") plt.figure(figsize=(15,10)) sns.violinplot(x="features", y="value", hue="target", data=violin,split=True, inner="quart") plt.xticks(rotation=90) plt.show()
plt.figure(figsize=(15,10)) sns.boxplot(x="features", y="value", hue="target", data=violin) plt.xticks(rotation=90) plt.show()
积分差异、排名的进球差异、排名的积分差异是很好的特征。但是,我们有一些特征之间的相关度非常高,我们通过jointplot
进行联合分布分析:
sns.jointplot(data = data_difs, x = 'dif_rank_agst', y = 'dif_rank_agst_l5', kind="reg") plt.show()
sns.jointplot(data = data_difs, x = 'goals_per_ranking_dif', y = 'goals_per_ranking_dif_l5', kind="reg") plt.show()
sns.jointplot(data = data_difs, x = 'dif_points_rank', y = 'dif_points_rank_l5', kind="reg") plt.show()
sns.jointplot(data = data_difs, x = 'dif_points', y = 'dif_points_l5', kind="reg") plt.show()
分析相关性可以看出,我们选择其中的1组特征就好,这里我们选择了考虑全周期的版本。最后保留的特征有下面这些:
rank_dif
)goals_dif
/ goals_dif_l5
)goals_suf_dif
/ goals_suf_dif_l5
)dif_rank_agst
/ dif_rank_agst_l5
)goals_per_ranking_dif
)dif_points_rank
/ dif_points_rank_l5
)is_friendly
)这样,我们最终的数据集如下,包含后续机器学习模型所需的全部特征。
def create_db(df): columns = ["home_team", "away_team", "target", "rank_dif", "home_goals_mean", "home_rank_mean", "away_goals_mean", "away_rank_mean", "home_rank_mean_l5", "away_rank_mean_l5", "home_goals_suf_mean", "away_goals_suf_mean", "home_goals_mean_l5", "away_goals_mean_l5", "home_goals_suf_mean_l5", "away_goals_suf_mean_l5", "home_game_points_rank_mean", "home_game_points_rank_mean_l5", "away_game_points_rank_mean", "away_game_points_rank_mean_l5","is_friendly_0", "is_friendly_1"] base = df.loc[:, columns] base.loc[:, "goals_dif"] = base["home_goals_mean"] - base["away_goals_mean"] base.loc[:, "goals_dif_l5"] = base["home_goals_mean_l5"] - base["away_goals_mean_l5"] base.loc[:, "goals_suf_dif"] = base["home_goals_suf_mean"] - base["away_goals_suf_mean"] base.loc[:, "goals_suf_dif_l5"] = base["home_goals_suf_mean_l5"] - base["away_goals_suf_mean_l5"] base.loc[:, "goals_per_ranking_dif"] = (base["home_goals_mean"] / base["home_rank_mean"]) - (base["away_goals_mean"] / base["away_rank_mean"]) base.loc[:, "dif_rank_agst"] = base["home_rank_mean"] - base["away_rank_mean"] base.loc[:, "dif_rank_agst_l5"] = base["home_rank_mean_l5"] - base["away_rank_mean_l5"] base.loc[:, "dif_points_rank"] = base["home_game_points_rank_mean"] - base["away_game_points_rank_mean"] base.loc[:, "dif_points_rank_l5"] = base["home_game_points_rank_mean_l5"] - base["away_game_points_rank_mean_l5"] model_df = base[["home_team", "away_team", "target", "rank_dif", "goals_dif", "goals_dif_l5", "goals_suf_dif", "goals_suf_dif_l5", "goals_per_ranking_dif", "dif_rank_agst", "dif_rank_agst_l5", "dif_points_rank", "dif_points_rank_l5", "is_friendly_0", "is_friendly_1"]] return model_df model_db = create_db(df) model_db
关于机器学习建模与调优的相关知识与实战方法,可以查看ShowMeAI的系列教程与文章
下面我们就可以开始建模了,我们使用两个模型 Random Forest 和 Gradient Boosting 来建模,进行效果对比。对于模型调参,我们使用 SkLearn 的 📘GridSearchCV 进行参数优化,挑选最佳模型。
import numpy as np from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.model_selection import train_test_split, GridSearchCV #separating the target from the features X = model_db.iloc[:, 3:] y = model_db[["target"]] #dividing the database X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=1) gb = GradientBoostingClassifier(random_state=5) params = {"learning_rate": [0.01, 0.1, 0.5], "min_samples_split": [5, 10], "min_samples_leaf": [3, 5], "max_depth":[3,5,10], "max_features":["sqrt"], "n_estimators":[100, 200] } gb_cv = GridSearchCV(gb, params, cv = 3, n_jobs = -1, verbose = False) gb_cv.fit(X_train.values, np.ravel(y_train)) #getting the best model gb = gb_cv.best_estimator_
我们对随机森林也进行调参和优化:
params_rf = {"max_depth": [20], "min_samples_split": [5, 10], "max_leaf_nodes": [175, 200], "min_samples_leaf": [5, 10], "n_estimators": [250], "max_features": ["sqrt"], } rf = RandomForestClassifier(random_state=1) rf_cv = GridSearchCV(rf, params_rf, cv = 3, n_jobs = -1, verbose = False) rf_cv.fit(X_train.values, np.ravel(y_train)) rf = rf_cv.best_estimator_
输出结果:
GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=1), n_jobs=-1, param_grid={'max_depth': [20], 'max_features': ['sqrt'], 'max_leaf_nodes': [175, 200], 'min_samples_leaf': [5, 10], 'min_samples_split': [5, 10], 'n_estimators': [250]}, verbose=False)
我们使用混淆矩阵和ROC-AUC曲线进行了模型分析,结果是:
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score def analyze(model): fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test.values)[:,1]) #test AUC plt.figure(figsize=(15,10)) plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr, label="test") fpr_train, tpr_train, _ = roc_curve(y_train, model.predict_proba(X_train.values)[:,1]) #train AUC plt.plot(fpr_train, tpr_train, label="train") auc_test = roc_auc_score(y_test, model.predict_proba(X_test.values)[:,1]) auc_train = roc_auc_score(y_train, model.predict_proba(X_train.values)[:,1]) plt.legend() plt.title('AUC score is %.2f on test and %.2f on training'%(auc_test, auc_train)) plt.show() plt.figure(figsize=(15, 10)) cm = confusion_matrix(y_test, model.predict(X_test.values)) sns.heatmap(cm, annot=True, fmt="d") analyze(gb)
对随机森林进行分析:
analyze(rf)
随机森林模型的性能稍好,但结果上有一点过拟合。分析 Gradient Boosting 模型的 AUC-ROC,它风险较低,我们最终选择它。
下面我们基于这个模型将预测世界杯结果。我们先使用了 📘Pandas的read_html 方法获取参加世界杯的球队名单。
dfs = pd.read_html(r"https://en.wikipedia.org/wiki/2022_FIFA_World_Cup#Teams") from collections.abc import Iterable for i in range(len(dfs)): df = dfs[i] cols = list(df.columns.values) if isinstance(cols[0], Iterable): if any("Tie-breaking criteria" in c for c in cols): start_pos = i+1 if any("Match 46" in c for c in cols): end_pos = i+1 matches = [] groups = ["A", "B", "C", "D", "E", "F", "G", "H"] group_count = 0 table = {} #TABLE -> TEAM, POINTS, WIN PROBS (CRITERIO DE DESEMPATE) table[groups[group_count]] = [[a.split(" ")[0], 0, []] for a in list(dfs[start_pos].iloc[:, 1].values)] for i in range(start_pos+1, end_pos, 1): if len(dfs[i].columns) == 3: team_1 = dfs[i].columns.values[0] team_2 = dfs[i].columns.values[-1] matches.append((groups[group_count], team_1, team_2)) else: group_count+=1 table[groups[group_count]] = [[a, 0, []] for a in list(dfs[i].iloc[:, 1].values)] table
matches[:10]
我们的模型对主队获胜和客队获胜/平局进行了分类。那这里面又怎么区分平局呢? 我们处理的办法如下,我们以两种形式进行预测:
如果两个预测都是 A 队或 B 队获胜,则直接判定该队获胜。如果一次预测A队获胜,而第二次预测B队获胜,则判定结果为平局。下面我们构建代码来逐场模拟比赛,计算分数。
def find_stats(team_1): #team_1 = "Qatar" past_games = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date") last5 = team_stats_raw[(team_stats_raw["team"] == team_1)].sort_values("date").tail(5) team_1_rank = past_games["rank"].values[-1] team_1_goals = past_games.score.mean() team_1_goals_l5 = last5.score.mean() team_1_goals_suf = past_games.suf_score.mean() team_1_goals_suf_l5 = last5.suf_score.mean() team_1_rank_suf = past_games.rank_suf.mean() team_1_rank_suf_l5 = last5.rank_suf.mean() team_1_gp_rank = past_games.points_by_rank.mean() team_1_gp_rank_l5 = last5.points_by_rank.mean() return [team_1_rank, team_1_goals, team_1_goals_l5, team_1_goals_suf, team_1_goals_suf_l5, team_1_rank_suf, team_1_rank_suf_l5, team_1_gp_rank, team_1_gp_rank_l5] def find_features(team_1, team_2): rank_dif = team_1[0] - team_2[0] goals_dif = team_1[1] - team_2[1] goals_dif_l5 = team_1[2] - team_2[2] goals_suf_dif = team_1[3] - team_2[3] goals_suf_dif_l5 = team_1[4] - team_2[4] goals_per_ranking_dif = (team_1[1]/team_1[5]) - (team_2[1]/team_2[5]) dif_rank_agst = team_1[5] - team_2[5] dif_rank_agst_l5 = team_1[6] - team_2[6] dif_gp_rank = team_1[7] - team_2[7] dif_gp_rank_l5 = team_1[8] - team_2[8] return [rank_dif, goals_dif, goals_dif_l5, goals_suf_dif, goals_suf_dif_l5, goals_per_ranking_dif, dif_rank_agst, dif_rank_agst_l5, dif_gp_rank, dif_gp_rank_l5, 1, 0] advanced_group = [] last_group = "" for k in table.keys(): for t in table[k]: t[1] = 0 t[2] = [] for teams in matches: draw = False team_1 = find_stats(teams[1]) team_2 = find_stats(teams[2]) features_g1 = find_features(team_1, team_2) features_g2 = find_features(team_2, team_1) probs_g1 = gb.predict_proba([features_g1]) probs_g2 = gb.predict_proba([features_g2]) team_1_prob_g1 = probs_g1[0][0] team_1_prob_g2 = probs_g2[0][1] team_2_prob_g1 = probs_g1[0][1] team_2_prob_g2 = probs_g2[0][0] team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2 team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2 if ((team_1_prob_g1 > team_2_prob_g1) & (team_2_prob_g2 > team_1_prob_g2)) | ((team_1_prob_g1 < team_2_prob_g1) & (team_2_prob_g2 < team_1_prob_g2)): draw=True for i in table[teams[0]]: if i[0] == teams[1] or i[0] == teams[2]: i[1] += 1 elif team_1_prob > team_2_prob: winner = teams[1] winner_proba = team_1_prob for i in table[teams[0]]: if i[0] == teams[1]: i[1] += 3 elif team_2_prob > team_1_prob: winner = teams[2] winner_proba = team_2_prob for i in table[teams[0]]: if i[0] == teams[2]: i[1] += 3 for i in table[teams[0]]: #adding criterio de desempate (probs por jogo) if i[0] == teams[1]: i[2].append(team_1_prob) if i[0] == teams[2]: i[2].append(team_2_prob) if last_group != teams[0]: if last_group != "": print("n") print("Group %s advanced: "%(last_group)) for i in table[last_group]: #adding crieterio de desempate i[2] = np.mean(i[2]) final_points = table[last_group] final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True) advanced_group.append([final_table[0][0], final_table[1][0]]) for i in final_table: print("%s -------- %d"%(i[0], i[1])) print("n") print("-"*10+" Starting Analysis for Group %s "%(teams[0])+"-"*10) if draw == False: print("Group %s - %s vs. %s: Winner %s with %.2f probability"%(teams[0], teams[1], teams[2], winner, winner_proba)) else: print("Group %s - %s vs. %s: Draw"%(teams[0], teams[1], teams[2])) last_group = teams[0] print("n") print("Group %s advanced: "%(last_group)) for i in table[last_group]: #adding crieterio de desempate i[2] = np.mean(i[2]) final_points = table[last_group] final_table = sorted(final_points, key=itemgetter(1, 2), reverse = True) advanced_group.append([final_table[0][0], final_table[1][0]]) for i in final_table: print("%s -------- %d"%(i[0], i[1]))
结果是:
---------- Starting Analysis for Group A ---------- Group A - Qatar vs. Ecuador: Winner Ecuador with 0.62 probability Group A - Senegal vs. Netherlands: Winner Netherlands with 0.62 probability Group A - Qatar vs. Senegal: Winner Senegal with 0.60 probability Group A - Netherlands vs. Ecuador: Winner Netherlands with 0.73 probability Group A - Ecuador vs. Senegal: Draw Group A - Netherlands vs. Qatar: Winner Netherlands with 0.78 probability Group A advanced: Netherlands -------- 9 Senegal -------- 4 Ecuador -------- 4 Qatar -------- 0 ---------- Starting Analysis for Group B ---------- Group B - England vs. Iran: Winner England with 0.62 probability Group B - United States vs. Wales: Draw Group B - Wales vs. Iran: Draw Group B - England vs. United States: Winner England with 0.61 probability Group B - Wales vs. England: Winner England with 0.64 probability Group B - Iran vs. United States: Winner United States with 0.58 probability Group B advanced: England -------- 9 United States -------- 4 Wales -------- 2 Iran -------- 1 ---------- Starting Analysis for Group C ---------- Group C - Argentina vs. Saudi Arabia: Winner Argentina with 0.79 probability Group C - Mexico vs. Poland: Draw Group C - Poland vs. Saudi Arabia: Winner Poland with 0.70 probability Group C - Argentina vs. Mexico: Winner Argentina with 0.67 probability Group C - Poland vs. Argentina: Winner Argentina with 0.71 probability Group C - Saudi Arabia vs. Mexico: Winner Mexico with 0.71 probability Group C advanced: Argentina -------- 9 Poland -------- 4 Mexico -------- 4 Saudi Arabia -------- 0 ---------- Starting Analysis for Group D ---------- Group D - Denmark vs. Tunisia: Winner Denmark with 0.68 probability Group D - France vs. Australia: Winner France with 0.71 probability Group D - Tunisia vs. Australia: Draw Group D - France vs. Denmark: Draw Group D - Australia vs. Denmark: Winner Denmark with 0.71 probability Group D - Tunisia vs. France: Winner France with 0.69 probability Group D advanced: France -------- 7 Denmark -------- 7 Tunisia -------- 1 Australia -------- 1 ---------- Starting Analysis for Group E ---------- Group E - Germany vs. Japan: Winner Germany with 0.62 probability Group E - Spain vs. Costa Rica: Winner Spain with 0.76 probability Group E - Japan vs. Costa Rica: Winner Japan with 0.63 probability Group E - Spain vs. Germany: Draw Group E - Japan vs. Spain: Winner Spain with 0.67 probability Group E - Costa Rica vs. Germany: Winner Germany with 0.65 probability Group E advanced: Spain -------- 7 Germany -------- 7 Japan -------- 3 Costa Rica -------- 0 ---------- Starting Analysis for Group F ---------- Group F - Morocco vs. Croatia: Winner Croatia with 0.58 probability Group F - Belgium vs. Canada: Winner Belgium with 0.75 probability Group F - Belgium vs. Morocco: Winner Belgium with 0.67 probability Group F - Croatia vs. Canada: Winner Croatia with 0.64 probability Group F - Croatia vs. Belgium: Winner Belgium with 0.64 probability Group F - Canada vs. Morocco: Draw Group F advanced: Belgium -------- 9 Croatia -------- 6 Morocco -------- 1 Canada -------- 1 ---------- Starting Analysis for Group G ---------- Group G - Switzerland vs. Cameroon: Winner Switzerland with 0.69 probability Group G - Brazil vs. Serbia: Winner Brazil with 0.72 probability Group G - Cameroon vs. Serbia: Winner Serbia with 0.66 probability Group G - Brazil vs. Switzerland: Draw Group G - Serbia vs. Switzerland: Winner Switzerland with 0.57 probability Group G - Cameroon vs. Brazil: Winner Brazil with 0.81 probability Group G advanced: Brazil -------- 7 Switzerland -------- 7 Serbia -------- 3 Cameroon -------- 0 ---------- Starting Analysis for Group H ---------- Group H - Uruguay vs. South Korea: Winner Uruguay with 0.62 probability Group H - Portugal vs. Ghana: Winner Portugal with 0.81 probability Group H - South Korea vs. Ghana: Winner South Korea with 0.76 probability Group H - Portugal vs. Uruguay: Winner Portugal with 0.60 probability Group H - Ghana vs. Uruguay: Winner Uruguay with 0.77 probability Group H - South Korea vs. Portugal: Winner Portugal with 0.67 probability Group H advanced: Portugal -------- 9 Uruguay -------- 6 South Korea -------- 3 Ghana -------- 0
上面的模型有一些结果很有趣,比如巴西和瑞士以及丹麦和法国之间的平局。
在季后赛中,思路是一样的:
advanced = advanced_group playoffs = {"Round of 16": [], "Quarter-Final": [], "Semi-Final": [], "Final": []} for p in playoffs.keys(): playoffs[p] = [] actual_round = "" next_rounds = [] for p in playoffs.keys(): if p == "Round of 16": control = [] for a in range(0, len(advanced*2), 1): if a < len(advanced): if a % 2 == 0: control.append((advanced*2)[a][0]) else: control.append((advanced*2)[a][1]) else: if a % 2 == 0: control.append((advanced*2)[a][1]) else: control.append((advanced*2)[a][0]) playoffs[p] = [[control[c], control[c+1]] for c in range(0, len(control)-1, 1) if c%2 == 0] for i in range(0, len(playoffs[p]), 1): game = playoffs[p][i] home = game[0] away = game[1] team_1 = find_stats(home) team_2 = find_stats(away) features_g1 = find_features(team_1, team_2) features_g2 = find_features(team_2, team_1) probs_g1 = gb.predict_proba([features_g1]) probs_g2 = gb.predict_proba([features_g2]) team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2 team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2 if actual_round != p: print("-"*10) print("Starting simulation of %s"%(p)) print("-"*10) print("n") if team_1_prob < team_2_prob: print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob)) next_rounds.append(away) else: print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob)) next_rounds.append(home) game.append([team_1_prob, team_2_prob]) playoffs[p][i] = game actual_round = p else: playoffs[p] = [[next_rounds[c], next_rounds[c+1]] for c in range(0, len(next_rounds)-1, 1) if c%2 == 0] next_rounds = [] for i in range(0, len(playoffs[p])): game = playoffs[p][i] home = game[0] away = game[1] team_1 = find_stats(home) team_2 = find_stats(away) features_g1 = find_features(team_1, team_2) features_g2 = find_features(team_2, team_1) probs_g1 = gb.predict_proba([features_g1]) probs_g2 = gb.predict_proba([features_g2]) team_1_prob = (probs_g1[0][0] + probs_g2[0][1])/2 team_2_prob = (probs_g2[0][0] + probs_g1[0][1])/2 if actual_round != p: print("-"*10) print("Starting simulation of %s"%(p)) print("-"*10) print("n") if team_1_prob < team_2_prob: print("%s vs. %s: %s advances with prob %.2f"%(home, away, away, team_2_prob)) next_rounds.append(away) else: print("%s vs. %s: %s advances with prob %.2f"%(home, away, home, team_1_prob)) next_rounds.append(home) game.append([team_1_prob, team_2_prob]) playoffs[p][i] = game actual_round = p
结果如下:
---------- Starting simulation of Round of 16 ---------- Netherlands vs. United States: Netherlands advances with prob 0.54 Argentina vs. Denmark: Argentina advances with prob 0.59 Spain vs. Croatia: Spain advances with prob 0.61 Brazil vs. Uruguay: Brazil advances with prob 0.64 Senegal vs. England: England advances with prob 0.64 Poland vs. France: France advances with prob 0.67 Germany vs. Belgium: Belgium advances with prob 0.53 Switzerland vs. Portugal: Portugal advances with prob 0.57 ---------- Starting simulation of Quarter-Final ---------- Netherlands vs. Argentina: Netherlands advances with prob 0.51 Spain vs. Brazil: Brazil advances with prob 0.54 England vs. France: England advances with prob 0.51 Belgium vs. Portugal: Portugal advances with prob 0.52 ---------- Starting simulation of Semi-Final ---------- Netherlands vs. Brazil: Brazil advances with prob 0.55 England vs. Portugal: England advances with prob 0.51 ---------- Starting simulation of Final ---------- Brazil vs. England: Brazil advances with prob 0.56
我们以图示的方式来展示我们的结果。
import networkx as nx from networkx.drawing.nx_pydot import graphviz_layout plt.figure(figsize=(15, 10)) G = nx.balanced_tree(2, 3) labels = [] for p in playoffs.keys(): for game in playoffs[p]: label = f"{game[0]}({round(game[2][0], 2)}) n {game[1]}({round(game[2][1], 2)})" labels.append(label) labels_dict = {} labels_rev = list(reversed(labels)) for l in range(len(list(G.nodes))): labels_dict[l] = labels_rev[l] pos = graphviz_layout(G, prog='twopi') labels_pos = {n: (k[0], k[1]-0.08*k[1]) for n,k in pos.items()} center = pd.DataFrame(pos).mean(axis=1).mean() nx.draw(G, pos = pos, with_labels=False, node_color=range(15), edge_color="#bbf5bb", width=10, font_weight='bold',cmap=plt.cm.Greens, node_size=5000) nx.draw_networkx_labels(G, pos = labels_pos, bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="black", lw=.5, alpha=1), labels=labels_dict) texts = ["Round nof 16", "Quarter n Final", "Semi n Final", "Finaln"] pos_y = pos[0][1] + 55 for text in reversed(texts): pos_x = center pos_y -= 75 plt.text(pos_y, pos_x, text, fontsize = 18) plt.axis('equal') plt.show()
模拟世界杯的结果如下,我们的模型预测巴西队获胜,决赛中对阵英格兰队的概率为 56%! 模型预测结果中最大的冷门是比利时击败德国和英格兰进入决赛,在四分之一决赛中淘汰法国。看到一些概率非常小的比赛很有趣,比如荷兰对阿根廷。
在本篇内容中,ShowMeAI应用机器学习的方法,对世界杯参赛球队进行分析和建模,模拟与预测世界杯比赛结果。全篇内容包括详细的数据预处理、数据分析、特征工程、机器学习建模与模型调参优化,模型应用及结果可视化。当然,世界杯的有趣之处就在于,比赛场上瞬息万变,任何的结果都可能会发生,让我们一起跟随世界杯,欣赏每一场精彩的比赛吧!