学习理论:单阶段代理损失的(H, R) – 一致界证明

1 导引

我们在上一篇博客《学习理论:预测器-拒绝器多分类弃权学习》中介绍了弃权学习[1]的基本概念和方法,其中包括了下列针对多分类问题的单阶段预测器-拒绝器弃权损失(L_{text{abst}})

[L_{text{abst}}(h, r, x, y) = underbrace{mathbb{I}_{text{h}(x) neq y}mathbb{I}_{r(x) > 0}}_{text{不弃权}} + underbrace{c(x) mathbb{I}_{r(x)leqslant 0}}_{text{弃权}} ]

其中((x, y)in mathcal{X}times mathcal{Y})(标签(mathcal{Y} = {1, cdots, n})(ngeqslant 2))),((h, r)in mathcal{H}timesmathcal{R})为预测器-拒绝器对((mathcal{H})(mathcal{R})为两个从(mathcal{X})(mathbb{R})的函数构成的函数族),(text{h}(x) = text{arg max}_{yin mathcal{Y}} {h(x)}_y)直接输出实例(x)的预测标签。为了简化讨论,在后文中我们假设(cin (0, 1))为一个常量花费函数。

(mathcal{l})为在标签(mathcal{Y})上定义的0-1多分类损失的代理损失,则我们可以在此基础上进一步定义弃权代理损失(L)

[L(h, r, x, y) = mathcal{l}(h, x, y)phi(-alpha r(x)) + psi(c) phi(beta r(x)) ]

其中(psi)是非递减函数,(phi)是非递增辅助函数(做为(z mapsto mathbb{I}_{z leqslant 0})的上界),(alpha)(beta)为正常量。下面,为了简便起见,我们主要对(phi(z) = exp(-z))进行分析,尽管相似的分析也可以应用于其它函数(phi)

在上一篇博客中,我们还提到了单阶段代理损失满足的((mathcal{H}, mathcal{R}))-一致性界:

定理 1 单阶段代理损失的((mathcal{H}, mathcal{R})) - 一致性界 假设(mathcal{H})是对称与完备的。则对(alpha=beta)(mathcal{l} = mathcal{l}_{text{mae}}),或者(mathcal{l} = mathcal{l}_{rho})(psi(z) = z),或者(mathcal{l} = mathcal{l}_{rho - text{hinge}})(psi(z) = z),有下列((mathcal{H}, mathcal{R})) - 一致性界对(hin mathcal{H}, rin mathcal{R})和任意分布成立:

[R_{L_{text{abst}}}(h, r) - R_{L_{text{abst}}}^{*}(mathcal{H}, mathcal{R}) + M_{L_{text{abst}}}(mathcal{H}, mathcal{R}) leqslant Gamma(R_L(h, r) - R_{L}^{*}(mathcal{H}, mathcal{R}) + M_{L}(mathcal{H}, mathcal{R})) ]

其中对(mathcal{l} = mathcal{l}_{text{mae}})(Gamma (z) = max{2nsqrt{z}, nz});对(mathcal{l}=mathcal{l}_{rho})(Gamma (z) = max{2sqrt{z}, z});对(mathcal{l} = mathcal{l}_{rho - text{hinge}})(Gamma (z) = max{2sqrt{nz}, z})

不过,在上一篇博客中,我们并没有展示单阶段代理损失的((mathcal{H}, mathcal{R}))-一致性界的详细证明过程,在这片文章里我们来看该如何对该定理进行证明(正好我导师也让我仔细看看这几篇论文[1][2]中相关的分析部分,并希望我掌握单阶段方法的证明技术)。

2 一些分析的预备概念

我们假设带标签样本(S=((x_1, y_1), cdots, (x_m, y_m)))独立同分布地采自(p(x, y))。则对于目标损失(L_{text{abst}})和代理损失(L)而言,可分别定义(L_{text{abst}})-期望弃权损失(R_{L}(h, r))(也即目标损失函数的泛化误差)和(L)-期望弃权代理损失(R_{L}(h, r))(也即代理损失函数的泛化误差)如下:

[R_{L_{text{abst}}}(h, r) = mathbb{E}_{p(x, y)}left[L_{text{abst}}(h, r, x, y)right], quad R_{L}(h, r) = mathbb{E}_{p(x, y)}left[L(h, r, x, y)right] ]

(R_{{L}^{*}_{text{abst}}}(mathcal{H}, mathcal{R}) = inf_{hin mathcal{H}, rin mathcal{R}}R_{L_{text{abst}}}(mathcal{H}, mathcal{R}))(R_{L}^{*}(mathcal{H}, mathcal{R}) = inf_{hin mathcal{H}, rin mathcal{R}}R_{L}(mathcal{H}, mathcal{R}))分别为(R_{L_{text{abst}}})(R_L)(mathcal{H}times mathcal{R})上的下确界。
为了进一步简化后续的分析,我们根据概率的乘法规则将(R_L(h, r))写为:

[R_{L}(h, r) = mathbb{E}_{p(x, y)}left[L(h, r, x, y)right] = mathbb{E}_{p(x)}underbrace{left[mathbb{E}_{p(ymid x)}left[L(h, r, x, y)right]right]}_{text{conditional risk }C_L} ]

我们称其中内层的条件期望项为代理损失(L)条件风险(conditional risk)(也称为代理损失(L)的pointwise风险[2]),由于在其计算过程中(y)取期望取掉了,因此该项只和(h)(r)(x)相关,因此我们将其记为(C_L(h, r, x))

[C_L(h, r, x) = mathbb{E}_{p(ymid x)}left[L(h, r, x, y)right] = sum_{yin mathcal{Y}}p(ymid x)L(h, r, x, y) ]

我们用(C^*_L(mathcal{H}, mathcal{R}, x) = inf_{hin mathcal{H}, rin mathcal{R}} C_L(h, r, x))来表示假设类最优(best-in-class)(L)的条件风险。同理,我们用(C_{L_{text{abst}}})来表示目标损失(L_{text{abst}})的条件风险,并用(C^*_{L_{text{abst}}})来表示假设类最优的(L_{text{abst}})的条件风险。

根据(R_{L}^*(h, r))(C^*_L(mathcal{H}, mathcal{R}, x)),我们可以表示出最小化能力差距(minimizability gap)

[M_L(mathcal{H}, mathcal{R}) = R_{L}^*(mathcal{H}, mathcal{R}) - mathbb{E}_{p(x)}left[C_L^*(mathcal{H}, mathcal{R}, x)right] ]

(M_{L_{text{abst}}})的表示同理。

于是,我们可以对要证明的((mathcal{H}, mathcal{R}))-一致性界进行改写:

[R_{L_{text{abst}}}(h, r) - R_{L_{text{abst}}}^{*}(mathcal{H}, mathcal{R}) + M_{L_{text{abst}}}(mathcal{H}, mathcal{R}) leqslant Gamma(R_L(h, r) - R_{L}^{*}(mathcal{H}, mathcal{R}) + M_{L}(mathcal{H}, mathcal{R}))\ Rightarrow R_{L_{text{abst}}}(h, r) - mathbb{E}_{p(x)}left[C_{L_{text{abst}}}^*(mathcal{H}, mathcal{R}, x)right] leqslant Gammaleft(R_{L}(h, r) - mathbb{E}_{p(x)}left[C_{L}^*(mathcal{H}, mathcal{R}, x)right]right) ]

其中(R_{L_{text{abst}}}(h, r))(R_L(h, r))分别为(mathbb{E}_{p(x)}left[C_{L_{text{abst}}}(h, r, x)right])(mathbb{E}_{p(x)}left[C_{L}(h, r, x)right]),于是上述不等式即为

[mathbb{E}_{p(x)}underbrace{left[C_{L_{text{abst}}}(h, r, x) - C_{L_{text{abst}}}^*(mathcal{H}, mathcal{R}, x)right]}_{Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x)} leqslant Gammaleft(mathbb{E}_{p(x)}underbrace{left[C_{L}(h, r, x) - C_{L}^*(mathcal{H}, mathcal{R}, x)right]}_{Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)}right) ]

我们将上述不等式两边的被取期望的项简记为(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x))(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)),其中(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x))被称为校准差距(calibration gap)。由于按定义(Gamma(cdot))是凹函数,由Jensen不等式有:

[mathbb{E}_{p(x)}left[Gammaleft(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right)right] leqslant Gammaleft(mathbb{E}_{p(x)}left[Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right]right) ]

于是,若我们能证明下述不等式,则原不等式得证:

[Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right) ]

我们后面将会看到,((mathcal{H}, mathcal{R}))-一致性界的证明过程中重要的一步即是证明(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x))能被(Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))界定。

3 (Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x))的表示

我们先来看(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) = C_{L_{text{abst}}}(h, r, x) - C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x))如何表示。根据定义,我们有:

[begin{aligned} C_{L_{text{abst}}}(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x)L_{text{abst}}(h, r, x, y) \ &= sum_{yin mathcal{Y}}p(ymid x) mathbb{I}_{text{h}(x) neq y}mathbb{I}_{r(x) > 0} + c(x) mathbb{I}_{r(x)leqslant 0} end{aligned} ]

由于是关于(y)的条件期望,上式最后一行中只需要对(mathbb{I}_{text{h}(x) neq y})进行加权求和即可。为了进一步对(C_{L_{text{abst}}}(h, r, x))进行表示,我们需要对(r(x))的正负情况进行分类讨论:

  1. (r(x) > 0):此时(C_{L_{text{abst}}}(h, r, x) = sum_{yin mathcal{Y}}p(ymid x) mathbb{I}_{text{h}(x) neq y} = 1 - p(text{h}(x)mid x))
  2. (r(x) leqslant 0):此时(C_{L_{text{abst}}}(h, r, x) = c)

接下来我们来看(C^*_{L_{text{abst}}})如何表示。我们假设拒绝函数集(mathcal{R})是完备的(也即对任意(xin mathcal{X}, {r(x): rin mathcal{R}} = mathbb{R})),那么(mathcal{R})也是弃权正规的(也即使得对任意(xin mathcal{X}),存在(r_1, r_2in mathcal{R})满足(r_1(x) > 0)(r_2(x) leqslant 0))。于是我们有

[begin{aligned} C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x) &= inf_{hin mathcal{H}, rin mathcal{R}}C_{L_{text{abst}}}(h, r, x)\ & = min left{min_{hin mathcal{H}}left(1 - pleft( text{h}(x)mid xright)right), cright}\ & = 1 - maxleft{max_{hin mathcal{H}}pleft(text{h}(x)mid xright), 1 - cright} end{aligned} ]

我们假设(mathcal{H})是对称的且完备的(具体定义参见博客《学习理论:预测器-拒绝器多分类弃权学习》),则我们有(left{text{h}(x): hin mathcal{H}right} = mathcal{Y}),于是

[C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x) = 1 - maxleft{max_{yin mathcal{Y}}pleft(ymid xright), 1 - cright} ]

为了进一步对(C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x))进行表示,我们需要对(max_{yin mathcal{Y}}p(ymid x))((1 - c))的大小比较情况进行分类讨论:

  1. (max_{yin mathcal{Y}}p(ymid x) > 1 - c):此时(C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x) = 1 - max_{yin mathcal{Y}}p(ymid x))
  2. (max_{yin mathcal{Y}}p(ymid x) leqslant 1 - c):此时(C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x) = c)

于是,我们有:

[begin{aligned} Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) &= C_{L_{text{abst}}}(h, r, x) - C^*_{L_{text{abst}}}(mathcal{H}, mathcal{R}, x) \ & = left{begin{aligned} &max_{yin mathcal{Y}}p(ymid x) - p(text{h}(x)mid x)quad &text{if } max_{yin mathcal{Y}} p(ymid x) > (1 - c),r(x) > 0 \ &1 - c - p(text{h}(x)mid x) quad &text{if } max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c),r(x) > 0 \ &0 quad &text{if } max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c),r(x) leqslant 0 \ &max_{yin mathcal{Y}}p(ymid x) - 1 + c quad &text{if } max_{yin mathcal{Y}} p(ymid x) > (1 - c),r(x) leqslant 0 \ end{aligned}right. end{aligned} ]

4 (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x))的表示

4.1 分类讨论的准备

接下来我们来看(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) = C_L(h, r, x) - C^*_L(mathcal{H}, mathcal{R}, x))如何表示。根据定义,若(alpha = beta)(phi(z) = exp(-z)),我们有:

[begin{aligned} C_L(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x)L(h, r, x, y) \ &= sum_{yin mathcal{Y}}p(ymid x) mathcal{l}(h, x, y)e^{alpha r(x)} + psi(c) e^{-alpha r(x)} end{aligned} ]

由于是关于(y)的条件期望,上式最后一行中只需要对(mathcal{l}(h, x, y))进行加权求和即可。在后文中我们将会针对下列三种不同的(mathcal{l})函数以及(psi(z))的选择情况来分别对(C_L(h, r, x))进行讨论:

  1. (mathcal{l} = mathcal{l}_{text{mae}})(psi(z) = z)
  2. (mathcal{l} = mathcal{l}_{rho})(psi(z) = z)
  3. (mathcal{l} = mathcal{l}_{rho-text{hinge}})(psi(z) = nz)

这三种不同(mathcal{l})的定义参见博客《学习理论:预测器-拒绝器多分类弃权学习》),我在这里把它们的定义贴一下:

  • 平均绝对误差损失:(mathcal{l}_{text{mae}}(h, x, y) = 1 - frac{e^{{h(x)}_y}}{sum_{y^{prime}in mathcal{Y}}e^{{h(x)}_{y^{prime}}}})
  • 约束(rho)-合页损失:(mathcal{l}_{rho-text{hinge}}(h, x, y) = sum_{y^{prime}neq y}phi_{rho-text{hinge}}(-{h(x)}_{y^{prime}}), rho > 0),其中(phi_{rho-text{hinge}}(z) = max{0, 1 - frac{z}{rho}})(rho)-合页损失,且约束条件(sum_{yin mathcal{Y}}{h(x)}_y=0)
  • (rho)-间隔损失:(mathcal{l}_{rho}(h, x, y) = phi_{rho}({rho_h (x, y)})),其中(rho_{h}(x, y) = h(x)_y - max_{y^{prime} neq y}h(x)_{y^{prime}})是置信度间隔,(phi_{rho}(z) = min{max{0, 1 - frac{z}{rho}}, 1}, rho > 0)(rho)-间隔损失。

4.2 (mathcal{l} = mathcal{l}_{text{mae}})(psi(z) = z)

在这种情况下(C_L(h, r, x))可以表示为:

[begin{aligned} C_L(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x) underbrace{left(1 - frac{e^{{h(x)}_y}}{sum_{y^{prime}in mathcal{Y}}e^{{h(x)}_{y^{prime}}}}right)}_{mathcal{l}_{text{mae}}}e^{alpha r(x)} + c e^{-alpha r(x)} \ &= sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} end{aligned} ]

其中(s_h(x, y) = frac{e^{{h(x)}_y}}{sum_{y^{prime}in mathcal{Y}}e^{{h(x)}_{y^{prime}}}})

于是

[begin{aligned} C_L^*(mathcal{H}, mathcal{R}, x) &= inf_{hin mathcal{H}, rinmathcal{R}} left{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)}right} \ &= inf_{rinmathcal{R}} left{inf_{hin mathcal{H}}left{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)right}e^{alpha r(x)} + c e^{-alpha r(x)}right} end{aligned} ]

由于假设了(mathcal{H})是对称的与完备的,我们有

[begin{aligned} &inf_{hin mathcal{H}}left{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, yright))right} \ &= 1 - sup_{hin mathcal{H}}sum_{yin mathcal{Y}}p(ymid x)s_h(x, y) \ &= 1 - max_{yin mathcal{Y}}p(ymid x)quad left(s_h(x, y)in (0, 1)right) end{aligned} ]

实际上,对任意(hin mathcal{H}),有:

[begin{aligned} &sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right) - left(1 - max_{yin mathcal{Y}}pleft(ymid xright)right) \ &= max_{yin mathcal{Y}} p(ymid x) - sum_{yin mathcal{Y}}p(ymid x)s_h(x, y) \ &= max_{yin mathcal{Y}} p(ymid x) - left(pleft(text{h}(x)mid xright)s_hleft(x, text{h}(x)right) + sum_{yneq text{h}(x)}p(ymid x)s_h(x, y)right) \ &geqslant max_{yin mathcal{Y}} p(ymid x) - left(pleft(text{h}(x)mid xright)s_hleft(x, text{h}(x)right) + max_{yin mathcal{Y}}p(ymid x)left(1 - s_hleft(x, text{h}(x)right)right)right) \ &= s_hleft(x, text{h}(x)right)left(max_{yin mathcal{Y}}p(ymid x) - pleft(text{h}(x)mid xright)right) \ &geqslant frac{1}{n} left(max_{yin mathcal{Y}}p(ymid x) - pleft(text{h}(x)mid xright)right) end{aligned} ]

这个结论我们会在后面的证明中多次用到。该结论的一个推论是如果分类器(h^*)为贝叶斯最优分类器(也即(p(text{h}^*(x)mid x) = max_{yin mathcal{Y}} p(ymid x))),则(sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right) - left(1 - max_{yin mathcal{Y}}p(ymid x)right) geqslant 0),可直观地将其理解为(mathbb{E}_{p(ymid x)}left[mathcal{l}_{text{mae}}right])更可能接近其下确界。

于是

[C_L^*(mathcal{H}, mathcal{R}, x) = inf_{rinmathcal{R}} left{left(1 - max_{yin mathcal{Y}}p(ymid x)right)e^{alpha r(x)} + c e^{-alpha r(x)}right} ]

记上式中需要求极值的部分为泛函(F(r)),则其泛函导数为

[frac{delta F}{delta r(x)} = alpha left(1 - max_{yin mathcal{Y}}p(ymid x)right)e^{alpha r(x)} - calpha e^{-alpha r(x)} ]

(frac{delta F}{delta p(x)} = 0)(对(forall xin mathcal{X})),解得(r^*(x) = -frac{1}{2alpha}log left(frac{1 - max_{yin mathcal{Y}}p(ymid x)}{c}right))。将其代入(F(r))可得:

[C_L^*(mathcal{H}, mathcal{R}, x) = 2sqrt{c(1 - max_{yin mathcal{Y}}p(ymid x))} ]

于是

[begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(mathcal{H}, mathcal{R}, x) \ &= sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{c(1 - max_{yin mathcal{Y}}p(ymid x))} end{aligned} ]

为了构建(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x))(Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))的不等式关系,接下来我们将会采用第3节中类似的做法,针对(max_{yin mathcal{Y}} p(ymid x))(1 - c)的大小比较情况与(r(x))的正负情况来对(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x))进行分类讨论:

  1. (max_{yin mathcal{Y}} p(ymid x) > (1 - c))(r(x) > 0)
    此时

    [begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{cleft(1 - max_{yin mathcal{Y}}p(ymid x)right)} \ & geqslant sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} - left(c + underbrace{left(1 - max_{yin mathcal{Y}}p(ymid x)right)}_{<c}right) \ & quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad quad (text{AM-GM inequality}) \ &geqslant sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} - ce^{-alpha r(x)} - left(1 - max_{yin mathcal{Y}}p(ymid x)right)e^{alpha r(x)} \ &geqslant sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right) - left(1 - max_{yin mathcal{Y}}p(ymid x)right)\ &geqslant frac{1}{n} left(max_{yin mathcal{Y}}p(ymid x) - pleft(text{h}(x)mid xright)right) \ &= frac{1}{n} Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x) end{aligned} ]

    (其中(text{AM-GM inequality})为算术-几何平均值不等式)
    (Gamma (z) = nz),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))得证。

  2. (max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c))(r(x) > 0)
    此时

    [ begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{cleft(1 - max_{yin mathcal{Y}}p(ymid x)right)} \ & geqslant underbrace{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_hleft(x, yright)right)}_{geqslant c}e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{cleft(sum_{yin mathcal{Y}}p(ymid x)left(1 - s_h(x, y)right)right)} \ & geqslant sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right) + c - 2sqrt{cleft(sum_{yin mathcal{Y}}p(ymid x)left(1 - s_h(x, y)right)right)} \ &= left(sqrt{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)} - sqrt{c}right)^2 \ &= left(frac{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right) - c}{sqrt{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)} + sqrt{c}}right)^2 \ &geqslant left(frac{sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right) - left(1 - max_{yin mathcal{Y}}p(ymid x)right) + left(1 - max_{yin mathcal{Y}}p(ymid x) - cright)}{2}right)^2 \ &geqslant left(frac{frac{1}{n} left(max_{yin mathcal{Y}}p(ymid x) - pleft(text{h}(x)mid xright)right) + frac{1}{n}left(1 - max_{yin mathcal{Y}}p(ymid x) - cright)}{2}right)^2 \ &= frac{1}{4n^2}left(1 - c - pleft(text{h}(x)mid xright)right)^2 \ &= frac{Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2}{4n^2} end{aligned} ]

    (Gamma (z) = 2nsqrt{z}),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))得证。

  3. (max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c))(r(x) leqslant 0)
    由于此时(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) = 0),因此(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gammaleft(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))对任意(Gamma geqslant 0)成立。

  4. (max_{yin mathcal{Y}} p(ymid x) > (1 - c))(r(x) leqslant 0)
    此时

    [ begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x) left(1 - s_h(x, y)right)e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{cleft(1 - max_{yin mathcal{Y}}p(ymid x)right)} \ &geqslant left(1 - max_{yin mathcal{Y}}p(ymid x)right)underbrace{e^{alpha r(x)}}_{leqslant 1} + c underbrace{e^{-alpha r(x)}}_{geqslant 1} - 2sqrt{cleft(1 - max_{yin mathcal{Y}}p(ymid x)right)} \ &geqslant 1 - max_{yin mathcal{Y}}p(ymid x) + c - 2sqrt{cleft(1 - max_{yin mathcal{Y}}p(ymid x)right)} \ &= left(sqrt{1 - max_{yin mathcal{Y}}p(ymid x)} - sqrt{c}right)^2 \ &= left(frac{1 - max_{yin mathcal{Y}}p(ymid x) - c}{sqrt{1 - max_{yin mathcal{Y}}p(ymid x)} + sqrt{c}}right)^2 \ &geqslant left(frac{max_{yin mathcal{Y}}p(ymid x) - 1 + c}{2}right)^2 \ &= frac{Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2}{4} end{aligned} ]

    (Gamma (z) = 2sqrt{z}),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))得证。

综上所述,若取(Gamma(z) = max{Gamma_1(z), Gamma_2(z), Gamma_3(z)} = max{2nsqrt{z}, nz}),则恒有(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))。于是(mathcal{l} = mathcal{l}_{text{mae}})(psi(z) = z)时单阶段代理损失的((mathcal{H}, mathcal{R}))-一致性界得证。

4.3 (mathcal{l} = mathcal{l}_{rho})(psi(z) = z)

在这种情况下(C_L(h, r, x))可以表示为:

[begin{aligned} C_L(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x) underbrace{minleft{maxleft{0, 1 - frac{rho_h(x, y)}{rho}right}, 1right}}_{mathcal{l}_{rho}}e^{alpha r(x)} + c e^{-alpha r(x)} \ &= left(1 - sum_{yin mathcal{Y}} p(ymid x)maxleft{minleft{1, frac{rho_h(x, y)}{rho}right}, 0right}right)e^{alpha r(x)} + c e^{-alpha r(x)} \ &= left(1 - sum_{yin mathcal{Y}} p(ymid x)minleft{1, frac{rho_h(x, y)}{rho}right}right)e^{alpha r(x)} + c e^{-alpha r(x)} end{aligned} ]

其中(rho_h(x, y) = h(x)_y - max_{y^{prime}neq y}h(x)_{y^{prime}})为间隔。

由于假设了(mathcal{H})是对称的与完备的,我们有

[begin{aligned} &inf_{hin mathcal{H}}left{1 - sum_{yin mathcal{Y}} p (ymid x)minleft{1, frac{rho_h(x, y)}{rho}right}right} \ &= 1 - sup_{hin mathcal{H}}sum_{yin mathcal{Y}}p(ymid x)minleft{1, frac{rho_h(x, y)}{rho}right} \ &= 1 - max_{yin mathcal{Y}}pleft(ymid xright)quad (minleft{1, frac{rho_h(x, y)}{rho}right}in [0, 1]) end{aligned} ]

实际上,对任意(hin mathcal{H}),有:

[begin{aligned} &left(1 - sum_{yin mathcal{Y}} p (ymid x)minleft{1, frac{rho_h(x, y)}{rho}right}right) - left(1 - max_{yin mathcal{Y}}pleft(ymid xright)right) \ &= max_{yin mathcal{Y}} p(ymid x) - sum_{yin mathcal{Y}} p (ymid x)minleft{1, frac{rho_h(x, y)}{rho}right} \ &= max_{yin mathcal{Y}} p(ymid x) - min left{1, frac{rho_hleft(x, text{h}(x)right)}{rho}right}pleft(text{h}(x)mid xright) \ &geqslant max_{yin mathcal{Y}}pleft(ymid xright) - pleft(text{h}(x)mid xright) end{aligned} ]

和之前(mathcal{l}_{text{mae}})的证明类似,这个结论我们会在后面的证明中多次用到。

于是和之前(mathcal{l}_{text{mae}})类似,我们有

[C_L^*(mathcal{H}, mathcal{R}, x) = 2sqrt{c(1 - max_{yin mathcal{Y}}pleft(ymid xright))} ]

于是

[begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(mathcal{H}, mathcal{R}, x) \ &= left(1 - sum_{yin mathcal{Y}} p (ymid x)minleft{1, frac{rho_h(x, y)}{rho}right}right)e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{c(1 - max_{yin mathcal{Y}}pleft(ymid xright))} end{aligned} ]

为了构建(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x))(Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))的不等式关系,接下来我们将会采用(mathcal{l}_{text{mae}})的证明中类似的做法,针对(max_{yin mathcal{Y}} p(ymid x))(1 - c)的大小比较情况与(r(x))的正负情况来对(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x))进行分类讨论:

  1. (max_{yin mathcal{Y}} p(ymid x) > (1 - c))(r(x) > 0)
    此时

    [begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= left(1 - sum_{yin mathcal{Y}} p (ymid x)minleft{1, frac{rho_h(x, y)}{rho}right}right)e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{cleft(1 - max_{yin mathcal{Y}}pleft(ymid xright)right)} \ &geqslant frac{1}{4}left(1 - c - pleft(text{h}left(xright)mid xright)right)^2 \ &= frac{Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2}{4} end{aligned} ]

    (由于证明步骤与(mathcal{l}_{text{mae}})类似,这里对证明步骤进行了一些精简,下面同理)
    (Gamma_2 (z) = 2sqrt{z}),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))得证。

  2. (max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c))(r(x) > 0)
    此时

    [Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) geqslant max_{yin mathcal{Y}}p(ymid x) - p(text{h}(x)mid x) = Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x) ]

    (Gamma_1 (z) = z),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))得证。

  3. (max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c))(r(x) leqslant 0)
    由于此时(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) = 0),因此(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))对任意(Gamma geqslant 0)成立。

  4. (max_{yin mathcal{Y}} p(ymid x) > (1 - c))(r(x) leqslant 0)
    此时

    [Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) geqslant left(frac{max_{yin mathcal{Y}}p(ymid x) - 1 + c}{2}right)^2 = frac{Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2}{4} ]

    (Gamma_3 (z) = 2sqrt{z}),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))得证。

综上所述,若取(Gamma(z) = max{Gamma_1(z), Gamma_2(z), Gamma_3(z)} = max{2sqrt{z}, z}),则恒有(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))。于是(mathcal{l} = mathcal{l}_{rho})(psi(z) = z)时单阶段代理损失的((mathcal{H}, mathcal{R}))-一致性界得证。

4.4 (mathcal{l} = mathcal{l}_{rho-text{hinge}})(psi(z) = nz)

在这种情况下(C_L(h, r, x))可以表示为:

[begin{aligned} C_L(h, r, x) &= sum_{yin mathcal{Y}}p(ymid x) underbrace{sum_{y^{prime} neq y}maxleft{0, 1 + frac{h(x)_{y^{prime}}}{rho}right}}_{mathcal{l}_{rho}-text{hinge}}e^{alpha r(x)} + nce^{-alpha r(x)} \ &= sum_{yin mathcal{Y}}left(1 - p(ymid x)right)maxleft{0, 1 + frac{h(x)_y}{rho}right}e^{alpha r(x)} + nce^{-alpha r(x)} \ end{aligned} ]

由于假设了(mathcal{H})是对称的与完备的,我们有

[begin{aligned} &inf_{hin mathcal{H}}left{sum_{yin mathcal{Y}}left(1 - p(ymid x)right)maxleft{0, 1 + frac{h(x)_y}{rho}right}right} \ &= n - sup_{hin mathcal{H}}sum_{yin mathcal{Y}}p(ymid x)maxleft{0, 1 + frac{h(x)_y}{rho}right} \ &= nleft(1 - max_{yin mathcal{Y}}pleft(ymid xright)right) end{aligned} ]

实际上,若取(h_{rho})使得(h_{rho}(x)_y = left{begin{aligned} &h(x)_yquad &text{if } ynotin left{y_{max}, text{h}(x)right} \ &-rho quad &text{if } y = text{h}(x) \ &hleft(xright)_{y_{text{max}}} + hleft(xright)_{text{h}(x)} + rho quad &text{if } y = y_{text{max}} \ end{aligned}right.)满足约束(sum_{yin mathcal{Y}}h_{rho}(ymid x)=0),其中(y_{max} = text{arg max}_{yin mathcal{Y}}p(ymid x)),则对任意(hin mathcal{H})有:

[begin{aligned} &sum_{yin mathcal{Y}}left(1 - p(ymid x)right)maxleft{0, 1 + frac{h(x)_y}{rho}right} - nleft(1 - max_{yin mathcal{Y}}pleft(ymid xright)right) \ &geqslant sum_{yin mathcal{Y}}left(1 - p(ymid x)right)minleft{n, maxleft{0, 1 + frac{h(x)_y}{rho}right}right} - nleft(1 - max_{yin mathcal{Y}}pleft(ymid xright)right) \ &geqslant sum_{yin mathcal{Y}}left(1 - p(ymid x)right)minleft{n, maxleft{0, 1 + frac{h(x)_y}{rho}right}right} \ &quad - sum_{yin mathcal{Y}}left(1 - p(ymid x)right)minleft{n, maxleft{0, 1 + frac{h_{rho}(x)_y}{rho}right}right} \ &= left(p(y_{text{max}}mid x) - p(text{h}(x)mid x)right)minleft{n, 1 + frac{h(x)_{text{h}(x)}}{rho}right} \ &geqslant max_{yin mathcal{Y}}pleft(ymid xright) - pleft(text{h}left(xright)mid xright) end{aligned} ]

和之前(mathcal{l}_{mae})(mathcal{l}_{rho})的证明类似,这个结论我们会在后面的证明中多次用到。

于是和之前(mathcal{l}_{mae})(mathcal{l}_{rho})类似,我们有

[C_L^*(mathcal{H}, mathcal{R}, x) = 2sqrt{n^2c(1 - max_{yin mathcal{Y}}pleft(ymid xright))} ]

于是

[begin{aligned} Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) &= C_L(h, r, x) - C^*_L(mathcal{H}, mathcal{R}, x) \ &= sum_{yin mathcal{Y}}left(1 - p(ymid x)right)maxleft{0, 1 + frac{h(x)_y}{rho}right}e^{alpha r(x)} + c e^{-alpha r(x)} - 2sqrt{c(1 - max_{yin mathcal{Y}}pleft(ymid xright))} end{aligned} ]

为了构建(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x))(Gamma left(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)right))的不等式关系,接下来我们将会采用(mathcal{l}_{text{mae}})(mathcal{l}_{rho})的证明中类似的做法,针对(max_{yin mathcal{Y}} p(ymid x))(1 - c)的大小比较情况与(r(x))的正负情况来对(Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x))进行分类讨论:

  1. (max_{yin mathcal{Y}} p(ymid x) > (1 - c))(r(x) > 0)
    此时

    [Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) geqslant max_{yin mathcal{Y}}p(ymid x) - p(text{h}(x)mid x) = Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2 ]

    (Gamma_1 (z) = z),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))得证。

  2. (max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c))(r(x) > 0)
    此时

    [Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) geqslant frac{1}{4n}left(1 - c - pleft(text{h}left(xright)mid xright)right)^2 = frac{Delta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2}{4n} ]

    (Gamma_1 (z) = 2sqrt{nz}),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))得证。

  3. (max_{yin mathcal{Y}} p(ymid x) leqslant (1 - c))(r(x) leqslant 0)
    由于此时(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) = 0),因此(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))对任意(Gamma geqslant 0)成立。

  4. (max_{yin mathcal{Y}} p(ymid x) > (1 - c))(r(x) leqslant 0)
    此时

    [Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x) geqslant nleft(frac{max_{yin mathcal{Y}}p(ymid x) - 1 + c}{2}right)^2 = frac{nDelta C_{text{abst}, mathcal{H}, mathcal{R}}(h, r, x)^2}{4} ]

    (Gamma_3 (z) = 2sqrt{z/n}),于是(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))得证。

综上所述,若取(Gamma(z) = max{Gamma_1(z), Gamma_2(z), Gamma_3(z)} = max{2sqrt{nz}, z}),则恒有(Delta C_{L_text{abst}, mathcal{H}, mathcal{R}}(h, r, x) leqslant Gamma (Delta C_{L, mathcal{H}, mathcal{R}}(h, r, x)))。于是(mathcal{l} = mathcal{l}_{rho-text{hinge}})(psi(z) = nz)时单阶段代理损失的((mathcal{H}, mathcal{R}))-一致性界得证。

参考

  • [1] Mao A, Mohri M, Zhong Y. Predictor-rejector multi-class abstention: Theoretical analysis and algorithms[C]//International Conference on Algorithmic Learning Theory. PMLR, 2024: 822-867.
  • [2] Ni C, Charoenphakdee N, Honda J, et al. On the calibration of multiclass classification with rejection[J]. Advances in Neural Information Processing Systems, 2019, 32.
发表评论

评论已关闭。

相关文章