模型预处理层介绍（1） – Discretization

技术分享 2年前 (2023-02-15) 0 999+

关注

预处理的作用主要在于将难以表达的string或者数组转换成模型容易训练的向量表示，其中转化过程大多是形成一张查询表用来查询。

常见的预处理方式包括：

class Discretization: Buckets data into discrete ranges.
class Hashing: Implements categorical feature hashing, also known as "hashing trick".
class IntegerLookup: Maps integers from a vocabulary to integer indices.
class Normalization: Feature-wise normalization of the data.
class StringLookup: Maps strings from a vocabulary to integer indices.

接下来，本文将介绍下这些常用的预处理方式的作用和内容

Discretization

离散化层。该层将其输入数据的每个元素放入几个连续的范围之一，并输出一个整数索引，指示每个元素位于哪个范围。这个索引也就是索引编号，通过分桶边界值判断输入的数字属于哪个分桶，以此给出桶号。

换句话说，它的作用在于将连续的数值特征转换为整数分类特征。

在2.5版本的tf当中，该层的入参只有一个分桶边界bins，用法为：

tf.keras.layers.experimental.preprocessing.Discretization(     bins, **kwargs )

在2.11版本的tf当中,Discretization层被定义为

tf.keras.layers.Discretization(     bin_boundaries=None,     num_bins=None,     epsilon=0.01,     output_mode='int',     sparse=False,     **kwargs )

其中多了一个epsilon，用来作为误差容忍度，通常是一个接近于零的小分数(例如0.01)。较大的epsilon值会增加分位数近似，从而导致更多不相等的桶，但可以提高性能和资源消耗。

另外还多了一个output_mode:

"int":直接返回离散化的bin索引。
"one_hot":将输入中的每个元素编码到与num_bins大小相同的数组中，在输入的bin索引处包含一个1。如果最后一个维度的大小为1，则对该维度进行编码。如果最后一个维度的大小不是1，则将为编码后的输出追加一个新维度。
"multi_hot":将输入中的每个样本编码到与num_bins大小相同的单个数组中，为样本中出现的每个bin索引索引包含一个1。将最后一个维度作为样本维度，如果输入形状为(…, sample_length)，输出形状将是(…, num_tokens)。
"count":作为"multi_hot"，但int数组包含bin索引在示例中出现次数的计数。

举一个现有的官方的例子：

>>> input = np.array([[-1.5, 1.0, 3.4, .5], [0.0, 3.0, 1.3, 0.0]]) >>> layer = tf.keras.layers.experimental.preprocessing.Discretization( ...          bins=[0., 1., 2.]) >>> layer(input) <tf.Tensor: shape=(2, 4), dtype=int32, numpy= array([[0, 1, 3, 1],        [0, 3, 2, 0]], dtype=int32)>

在这个例子中，传入的参数 bins=[0., 1., 2.] 代表着该层以0、1、2 作为数值边界进行分桶，所以整体的查询表大概如下所示：