6. Kennard-Stone 采样方法在 Python/C 下的高效实现 \(O(N^2)\)#

创建时间：2021-01-29

在这篇文档中，我们会非常简单地回顾 Kennard-Stone 采样方法的原理，并集中精力讨论其算法实现方式。

标题的 \(O(N^2)\) 的 \(N\) 表示的是样本数量 \(n_\mathrm{sample}\)。严格的计算复杂度是 \(O(n_\mathrm{sample}^2 n_\mathrm{feature} + n_\mathrm{sample} n_\mathrm{result})\)，其中 \(n_\mathrm{feature}\) 为特征向量长度且认为是常数值，\(n_\mathrm{result}\) 为采样数。

Kennard-Stone Algorithm 的原始文献是 Kennard, Stone [1]。

Kennard-Stone 采样的重要性在于，对于绝大多数机器学习方法而言，选择一个较好的采样方式构建训练集，会一定程度地增强泛化能力。该采样方法在小样本数据集中较常使用；大样本数据集大多采用随机分割方法。

这篇文档的特色有四处：

通过 Python 中的 numpy 函数，可以并行、低额外内存消耗、且快速地构建欧式距离 (Euclidean Distance) 矩阵；
通过活用 numpy 的函数，可以高效串行地实现 Kennard-Stone 采样方法，且相信比目前大多数开源方案快很多；
通过 C/OpenMP 与 ctype binding，可以并行且进一步压榨采样效率。
通过高效的程序实现，对于 Intel-i5 7300HQ 的 4 核个人电脑，30000 个样本 (特征向量长度为 100) 的采样时间最快仅仅需要不到 10 秒。如果有更多的内存，这几乎意味着十万 (100k) 级别的数据可以迅速以 Kennard-Stone 方法采样。

需要注意，C 语言的程序需要在 gcc 的编译器上实现。同时，计算过程采用了单浮点，因此可能会在超级大样本中产生结果的差异。

import numpy as np
import numpy.ma as ma

np.set_printoptions(precision=6, linewidth=120, suppress=True)
np.random.seed(0)

6.1. Python 实现#

6.1.1. 随机数据集#

首先我们创造一个随机数据集，该数据集包含 5000 个样本。

n_sample \(n_\mathrm{sample}\) 为样本数。现在定为 5000。
n_feature \(n_\mathrm{feature}\) 为特征数 (描述样本性质的数据)。现在定为 100。
X \(\mathbf{X}\) 为完整的数据集。它是一个 \((n_\mathrm{sample}, n_\mathrm{feature})\) 大小的矩阵。
- 单个数据写为向量 \(\boldsymbol{x}_{i}\)，矩阵元写为 \(x_{ia}\)。其中，\(i, j, k, \cdots\) 为样本角标，\(a\) 为特征角标。

n_sample  = 5000
n_feature = 100
X = np.random.randn(n_sample, n_feature)
X *= 100

由于这份文档的目标是说明代码的实现方式与效率，因此不打算具体地用小数据作结果演示，而是用使用中等的数据集。

6.1.2. 距离矩阵的生成 (欧氏距离)#

两个样本之间的欧氏距离 (Euclidean Distance) 定义为

\[ d_{ij} = \Vert \boldsymbol{x}_i - \boldsymbol{x}_j \Vert_2 = \sqrt{\sum_a (x_{ia} - x_{ja})^2} \]

因此，距离矩阵 \(\mathbf{D}\) 会是一个维度为 \((n_\mathrm{sample}, n_\mathrm{sample})\)。

但如果直接用上式计算欧式距离，计算代价会稍大。因此，一个取巧的方法是利用矩阵相乘的结果。

定义中间变量的矩阵 \(\mathbf{Y} = \mathbf{X} \mathbf{X}^\dagger\)，并定义向量 \(\boldsymbol{t} = \mathrm{diag} (\mathbf{Y})\) 即 \(\mathbf{Y}\) 的对角线，或写成

\[ y_{ij} = \sum_{a} x_{ia} x_{ja}, \quad t_i = y_{ii} \]

那么距离矩阵的平方就可以写为

\[ d_{ij}^2 = \sum_a (x_{ia} - x_{ja})^2 = \sum_a x_{ia}^2 - \sum_a x_{ia} x_{ja} + \sum_a x_{ia}^2 = t_i - 2 y_{ij} + t_j \]

这就可以很大程度上简化计算量，并且不会引入很大的内存损耗。对于 numpy，由于 Boardcasting 机制，内存消耗可以是严格的 \(n_\mathrm{sample}^2 + n_\mathrm{sample}\)。

计算复杂度是 \(O(n_\mathrm{sample}^2 n_\mathrm{feature})\)，这是因为两 \((n_\mathrm{sample}, n_\mathrm{sample})\) 维度的矩阵相乘复杂度便是如此。

现在我们考察的实现过程：

dist 距离矩阵 \(\mathbf{D}\) 或写成 \(d_{ij}\)，维度 \((n_\mathrm{sample}, n_\mathrm{sample})\)；它是这一步的输出。
t 中间变量矩阵 \(\boldsymbol{t} = \mathrm{diag} (\mathbf{Y})\) 或写成 \(t_i\)，维度 \((n_\mathrm{sample}, )\)。

def get_dist(X):
    dist = X @ X.T              # - 1
    t = dist.diagonal().copy()  # - 2
    dist *= -2                  # - 3
    dist += t[:, None]          # - 4
    dist += t[None, :]          # - 5
    return np.sqrt(dist)        # - 6

其具体算法过程是

生成 \(y_{ij} = \sum_a x_{ia} x_{ja}\)，并将 \(y_ij\) 储存在 dist 中；
生成 \(t_i = y_{ii}\)，并将 \(t_i\) 储存在 t 中；
生成 \(- 2 y_{ij}\)，覆盖在 dist 中；
生成 \(- 2 y_{ij} + t_i\)，覆盖在 dist 中；
生成 \(- 2 y_{ij} + t_i + t_j\)，覆盖在 dist 中；
生成 \(\sqrt{- 2 y_{ij} + t_i + t_j}\)，覆盖在 dist 中。

那么当前样本所对应的距离矩阵就通过下述代码生成：

%%time
dist = get_dist(X)

CPU times: user 592 ms, sys: 61.2 ms, total: 654 ms
Wall time: 223 ms

最后指出，这部分代码显然应当可通过 C 语言加速，但由于 numpy 的底层计算实现已经利用了 C 语言与经过优化的 Blas，因此我认为没有必要对距离矩阵的生成再使用 C 语言优化。

6.1.3. 默认的种子样本：距离最远的两个样本#

Kennard-Stone 采样需要基于一部分“种子样本”(Seed) 进行。令记号 \(s_i\) 代表种子样本，\(n_\mathrm{seed}\) 代表种子数量。

默认的种子是距离最远的两个样本，即 \(n_\mathrm{seed} = 2\)，

\[ (s_0, s_1) = \arg \max_{(i, j)} d_{ij} \text{, or equilvently, } d_{s_0 s_1} = d_{s_1 s_0} = \max_{(i, j)} d_{ij} \]

寻找最大的元素位置可以直接地通过 numpy.argmax 函数得到。

下面定义一些程序变量：

max_indexes：距离最远的两个样本指标 \(\arg \max_{(i, j)} d_{ij}\)，为 tuple 型变量；
max_dist：最远的距离 \(d_\mathrm{max} = \max_{(i, j)} d_{ij}\)；
seed：种子样本 \(\boldsymbol{s}\) 或写为 \(s_i\)，维度为 \((n_\mathrm{seed}, )\)；它原则上可以是任何非空样本集，但默认是 \((s_0, s_1) = \arg \max_{(i, j)} d_{ij}\)。

%%time
max_indexes = np.unravel_index(np.argmax(dist), dist.shape)
max_dist = dist[max_indexes]
seed = max_indexes
seed

CPU times: user 78.7 ms, sys: 0 ns, total: 78.7 ms
Wall time: 19.7 ms

(450, 4092)

6.1.4. Kennard-Stone 采样#

令 \(r_0, r_1, \cdots, r_{n - 1}\) 代表的是第 \(0, 1, \cdots, n-1\) 次采样的样本序号。现在需要采样第 \(n\) 个样本。

如果我们现在知道，对于任意未采样的第 \(i\) 个样本，在第 \(n\) 步时有最小距离向量

\[ {}^n m_i = \min \{ d_{r_0 i}, d_{r_1 i}, \cdots, d_{r_{n-1} i} \} \]

那么第 \(n\) 步被采样的样本应当是 (到已采样点最小距离能取到最大值的样本)

\[ r_n = \underset{i \not \in \{ r_0, r_1, \cdots, r_{n-1} \}}{\text{arg max}} \big\{ \min \{ d_{r_0 i}, d_{r_1 i}, \cdots, d_{r_{n-1} i} \} \big\} = \underset{i \not \in \{ r_0, r_1, \cdots, r_{n-1} \}}{\text{arg max}} {}^n m_i \]

需要注意，上式的 \(i\) 必须是在未采样的样本中得到的；但这个记号较为冗余，因此后文会省略。

随后需要为下一步采样过程作准备。下一步需要使用到更新后的最小距离向量

\[ {}^{n+1} m_i = \min \{ d_{r_0 i}, d_{r_1 i}, \cdots, d_{r_{n-1} i}, d_{r_n i} \} \]

这个向量的构建对于单个数据 \(i\) 来说，不需要 \(O(n)\) 的计算复杂度，因为上式等价于

\[ {}^{n+1} m_i = \min \{ {}^{n} m_i, d_{r_n i} \} \]

随后就可以得到 \(r_{n+1}\)，以此类推。这是算法可以变得较快的实际原因。

Kennard-Stone 算法的简述就是这些。对于程序作下述说明。

函数输入与输出

已经定义的量是

输入量 dist 距离矩阵 \(d_{ij}\)，维度 \((n_\mathrm{sample}, n_\mathrm{sample})\)
输入量 seed 种子样本 \(s_i\)，维度 \((n_\mathrm{seed}, )\)。

还需要定义的

输入量 n_result \(n_\mathrm{result}\) 待采样数据数量；这里我们进行 4990 个样本采样；
输出量 result \(\boldsymbol{r}\) 或 \(r_i\) 为采样结果序号，维度 \((n_\mathrm{result}, )\)，并依照采样顺序记录结果；
输出量 v_dist \(\boldsymbol{v}\) 或 \(v_i\) 为采样距离，维度 \((n_\mathrm{result}, )\)：

\[ v_{i} = \max_j {}^{i + 1} m_j = \max_j \big\{ \min \{ d_{r_0 j}, d_{r_1 j}, \cdots, d_{r_i j} \} \big\} \]

中间变量定义 (Definition: Intermediate Variables)

n_seed 种子样本的维度 \(n_\mathrm{seed}\)；
selected 储存是否已被采样的信息向量，维度 \((n_\mathrm{sample}, )\)；
min_vals 最小值向量 \({}^n \boldsymbol{m}\) 或 \({}^n m_i\)，维度 \((n_\mathrm{sample}, )\)

初始化过程 (--- Initialization ---)

首先将种子 \((s_0, s_1, \cdots, s_{n_\mathrm{seed}})\) 放到结果中，即令 \(r_0 = s_0\), \(r_1 = s_1\), \(\cdots\), \(r_{n_\mathrm{seed}} = s_{n_\mathrm{seed}}\)；
如果种子样本数量恰为 2，那么就认为选取了欧氏距离最远的两个点，并认为 \(v_0 = d_{s_0 s_1} = \max_{(i, j)} d_{ij}\)；
初始化种子样本下的最小距离向量。注意到 \(r_0 = s_0\)，因此

\[ {}^1 m_i = \min \{ d_{r_0 i} \} = d_{s_0 i} \]
确定一个值 upper_bound，使得对于任意的 \(n, i\)，都有 \({}^n m_i \leqslant \mathtt{upper\_bound}\)。这是利用了 \({}^n m_i \leqslant {}^n m_i \leqslant \max_i {}^n m_i = \mathtt{upper\_bound}\) 的性质而写的
- 这个变量存在与否不影响算法本身，它是出于 numpy.min 函数在含有 where 传参时，必须要指定一个 initial 参数而定；
对于所有未选中的样本，都使用下述方式迭代更新最小距离向量：

\[ {}^{n+1} m_i = \min \{ {}^{n} m_i, d_{r_n i} \} = \min \{ {}^{n} m_i, d_{s_n i} \} \]

注意如果 \(i\) 已经在被采样样本中，就无需作更新了；通过 selected 确定是否要更新样本 \(i\)。

循环过程 (--- Loop argmax minimum ---)

循环的目的是确定第 \(n\) n 个样本所在位置。\(n\) 从种子数 \(n_\mathrm{seed}\) 开始，到采样数 \(n_\mathrm{result}\) 结束。

获得采样的样本位置 \(r_n = \arg \max_i {}^n m_i\)，并储存在临时变量 sup_index 中；
将样本位置放到向量 \(\boldsymbol{r}\) 中；
具体的数值 \(\max_i {}^n m_i\) 放到 \(\boldsymbol{v}\) 中；
告知 selected 样本 \(i\) 已经被选中了；
更新最小距离向量：\({}^{n+1} m_i = \min \{ {}^{n} m_i, d_{r_n i} \}\)。

程序所需要的额外内存是 \(2 n_\mathrm{result} + 2 n_\mathrm{sample}\)；这相对于 \(n_\mathrm{sample}^2\) 的距离矩阵来说已经不是很重要了。

程序的计算复杂度是 \(O(n_\mathrm{sample} n_\mathrm{result})\)。这是因为需要采样 \(n_\mathrm{result}\) 个样本；每次需要在 \(n_\mathrm{sample}\) 数量的最小距离向量中寻找最大值，以及更新最小距离向量，因此复杂度就是采样数与样本数的简单乘积。

n_result = 4990
result = np.zeros(n_result, dtype=int)
v_dist = np.zeros(n_result, dtype=float)

def ks_sampling_core(dist, seed, n_result):
    # Definition: Output Variables
    result = np.zeros(n_result, dtype=int)
    v_dist = np.zeros(n_result, dtype=float)
    # Definition: Intermediate Variables
    n_seed = len(seed)
    selected = np.zeros(n_sample, dtype=bool)
    min_vals = np.zeros(n_sample, dtype=float)
    # --- Initialization ---
    result[:n_seed] = seed                   # - 1
    if n_seed == 2:
        v_dist[0] = dist[seed[0], seed[1]]   # - 2
    min_vals[:] = dist[seed[0]]              # - 3
    upper_bound = min_vals.max()             # - 4
    for n in seed:                           # - 5
        np.min(np.array([min_vals, dist[n]]), axis=0, initial=upper_bound, where=np.logical_not(selected), out=min_vals)
    # --- Loop argmax minimum ---
    for n in range(n_seed, n_result):
        sup_index = ma.array(min_vals, mask=selected).argmax()  # - 1
        result[n] = sup_index                                   # - 2
        v_dist[n - 1] = min_vals[sup_index]                     # - 3
        selected[sup_index] = True                              # - 4     # | 5
        np.min(np.array([min_vals, dist[sup_index]]), axis=0, initial=upper_bound, where=np.logical_not(selected), out=min_vals)
    return result, v_dist

执行上述代码后，就可以得到筛选出来的 4990 个样本 \(\boldsymbol{r}\) 序号，以及前 4889 个样本的采样距离 \(\boldsymbol{v}\)。

%%time
ks_sampling_core(dist, seed, n_result)

CPU times: user 628 ms, sys: 0 ns, total: 628 ms
Wall time: 610 ms

(array([ 450, 4092, 3342, ..., 1696, 4495, 4400]),
 array([1963.36697 , 1819.286499, 1770.680538, ...,  959.175021,  956.828118,    0.      ]))

6.2. C 语言代码#

在 C 语言代码中只进行默认种子筛选与 Kennard-Stone 采样过程。我们暂且相信 numpy 可以很好地处理距离矩阵计算的过程。

其编写是通过 ctype 的 Python/C binding 实现的。C 文件为 ks_cpp.c，其 Python 接口文件为 KS_Sampling.py。调用方式可以与上面相似：

from KS_Sampling import ks_sampling_core_cpp

%%time
ks_sampling_core_cpp(dist, seed, n_result)

CPU times: user 158 ms, sys: 2.97 ms, total: 161 ms
Wall time: 56.3 ms

(array([ 450, 4092, 3342, ..., 1696, 4495, 4400]),
 array([1963.366943, 1819.286499, 1770.680542, ...,  959.175049,  956.828125,    0.      ]))

如果不希望提供种子文件，且希望 C 程序自己找到两个距离最远的样本作为种子，那么通过下述方式调用：

%%time
ks_sampling_core_cpp(dist, n_result=n_result)

CPU times: user 250 ms, sys: 11.4 ms, total: 261 ms
Wall time: 65.5 ms

(array([4092,  450, 3342, ..., 1696, 4495, 4400]),
 array([1963.366943, 1819.286499, 1770.680542, ...,  959.175049,  956.828125,    0.      ]))

如果想抽取所有样本，查看样本的抽样顺序，那么只需要提供距离矩阵即可：

%%time
ks_sampling_core_cpp(dist)

CPU times: user 254 ms, sys: 8.13 ms, total: 262 ms
Wall time: 65.6 ms

(array([4092,  450, 3342, ..., 1768, 1419, 3485]),
 array([1963.366943, 1819.286499, 1770.680542, ...,  921.688293,  910.627808,    0.      ]))

在具体的 C 语言实现中，考虑了以下的优化代码因素：

尽可能使用并行。在选取最大值、更新最小距离向量时，都使用 OpenMP 并行。并行的方式基本是默认。
避免大量的指针地址计算。

6.3. 完整的 Kennard-Stone 采样函数#

在引入上面定义过的函数 get_dist, ks_sampling_core, ks_sampling_core_cpp 的情况下，可以定义下述最终的采样函数；它的必选输入是样本数据，与其他类似程序的函数签名比较接近。

def ks_sampling(X, seed=None, n_result=None, get_dist=get_dist, backend="Python"):
    """
    ks_sampling(X, seed=None, n_result=None, backend="Python")
    
    KS Sampling Program
    
    Parameters
    ----------
    
    X: np.ndarray, shape: (n_sample, n_feature)
        Original data, need to be generated by user.
        
    seed: np.ndarray or list or None, shape: (n_seed, ), optional
        Initial selected seed.
        If set as `None`, the program will find the two samples
        which have largest distance as seed.
        
    n_result: int or None, optional
        Number of samples that should be selected.
        If set as `None`, `n_sample` will be used instead, i.e.
        selectet all data.
    
    get_dist: function
        A function `get_dist(X)` that will read original data, and
        return distance.
        Default Implementation is Euclidean distance.
    
    backend: str, "Python" or "C"
        Specify Kennard-Stone sampling function backend in Python
        language or C language.
    """
    X = np.asarray(X, dtype=np.float32)
    if n_result is None:
        n_result = X.shape[0]
    dist = get_dist(X)
    if backend == "Python":
        if seed is None or len(seed) == 0:
            seed = np.unravel_index(np.argmax(dist), dist.shape)
        return ks_sampling_core(dist, seed, n_result)
    elif backend == "C":
        return ks_sampling_core_cpp(dist, seed, n_result)
    else:
        raise NotImplemented("Other backends are not implemented!")

如果希望在含有 5000 个数据的样本 X 中，依 Kennard-Stone 算法挑选 4990 个样本，可以用下述函数给出结果：

%%time
ks_sampling(X, n_result=4990)

CPU times: user 1.32 s, sys: 47.6 ms, total: 1.37 s
Wall time: 738 ms

(array([4092,  450, 3342, ..., 1696, 4495, 4400]),
 array([1963.367065, 1819.286499, 1770.680542, ...,  959.174927,  956.828064,    0.      ]))

更为快速的方法是使用 C 语言作为 Kennard-Stone 算法引擎：

%%time
ks_sampling(X, n_result=4990, backend="C")

CPU times: user 530 ms, sys: 44.5 ms, total: 574 ms
Wall time: 151 ms

(array([4092,  450, 3342, ..., 1696, 4495, 4400]),
 array([1963.367065, 1819.286499, 1770.680542, ...,  959.174927,  956.828064,    0.      ]))

如果希望前两个被采样样本序号是 (345, 456)，那么就需要需要指定种子大小 (采样顺序会严格地全部输出，但此时不一定会输出第一个样本距离)：

%%time
ks_sampling(X, seed=[345, 456], n_result=4990, backend="C")

CPU times: user 514 ms, sys: 35.7 ms, total: 550 ms
Wall time: 138 ms

(array([ 345,  456,  450, ..., 1696, 4495, 4400]),
 array([1388.464722, 1649.251587, 1633.396362, ...,  959.174927,  956.828064,    0.      ]))

最后，如果希望计算的是 L1 范数距离而非欧氏距离 (即 L2 范数)，那么就更改 get_dist 函数即可：

from sklearn.metrics import pairwise_distances

%%time
ks_sampling(X, seed=[345, 456], n_result=4990, get_dist=lambda X: pairwise_distances(X, metric="l1"), backend="C")

CPU times: user 1.28 s, sys: 36 ms, total: 1.31 s
Wall time: 1.2 s

(array([ 345,  456,  450, ...,  999, 2095, 2046]),
 array([11158.767578, 13465.605469, 13155.375   , ...,  7544.62793 ,  7521.103516,     0.      ]))

之所以上述计算过程耗时长，是因为 L1 范数距离的耗时很大。

6.4. 代码效率比较#

下面是一些代码效率的比较，以示程序的高效程度与能力范围。

这些代码是在无图形界面的 Linux 下运行，CPU 为 Intel-i5 7300HQ，内存为 16 GB，默认在可以 4 核并行时尽可能并行。

6.4.1. 距离矩阵生成 (5000×100 数据)#

我们这里比较三种做法。

get_dist 是上面的程序实现方法；
get_dist2 使用了更为直观的做法，即直接生成 \(d_{ij} = \sqrt{t_i - 2 y_{ij} + t_j}\)；
euclidean_distances 是最为常用的生成欧氏距离的库函数之一。

from sklearn.metrics import pairwise_distances

def get_dist2(X):
    Y = X @ X.T
    t = Y.diagonal()
    dist = np.sqrt(t[:, None] - 2 * Y + t[:, None])
    return dist

get_dist 的耗时最少：

%%timeit -r 7 -n 3
get_dist(X)

161 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)

get_dist2 作为直观解法，耗时多一些，且内存消耗更大：

%%timeit -r 7 -n 3
get_dist2(X)

224 ms ± 63.2 µs per loop (mean ± std. dev. of 7 runs, 3 loops each)

pairwise_distances 库函数相对较慢，但仍然支持并行运算，效率相对还算可观。

%%timeit -r 7 -n 3
pairwise_distances(X)

283 ms ± 5.63 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

6.4.2. KS 采样程序 (500×10 数据)#

下面我们会考察一些具体的采样程序。我们会对网络上可以找到的开源代码做速度上的比较与测评，并简单验证我们 \(O(n_\mathrm{sample} n_\mathrm{result})\) 程序的正确性。

由于大多数开源的代码是 \(O(N^3)\) 计算量级，且编程细节存在差异，因此只能在短时间内处理非常小的样本。我们先拿 500 样本的数据做验证。

n_sample  = 500
n_feature = 10
X = np.random.randn(n_sample, n_feature)
X *= 100

下述代码是用 Python 程序，完整采样 500 个数据：

%%time
result_my_python = ks_sampling(X)

CPU times: user 74.8 ms, sys: 264 µs, total: 75.1 ms
Wall time: 25 ms

下述代码是用 C 程序，完整采样 500 个数据：

%%time
result_my_cpp = ks_sampling(X, backend="C")

CPU times: user 93.1 ms, sys: 0 ns, total: 93.1 ms
Wall time: 31 ms

通过 np.allclose 可以两种代码的验证采样顺序是否一致。上面的 C 语言代码并没有更快，是因为小样本下 C 语言代码体现不出速度优势。

np.allclose(result_my_python[0][2:], result_my_cpp[0][2:])

True

下面找了比较常见的三个开源实现的纯 Python 代码并做比对；用于比对的程序放在 KS_Sampling_Others.py 中：

https://hxhc.github.io/post/kennardstone-spxy/
https://github.com/karoka/Kennard-Stone-Algorithm/blob/master/kenStone.py
https://github.com/XiaqiongFan/PC-CCA/blob/master/PC-CCA.py

第一个程序的实现效率较快，但尚没有达到我们上文定义的 ks_sampling 函数。在大数据量下，这种差异会更明显。

from KS_Sampling_Others import ks_from_hxhc, ks_from_karoka, ks_from_XiaqiongFan

%%time
result_hxhc = ks_from_hxhc(X, test_size=0)

CPU times: user 651 ms, sys: 13.7 ms, total: 664 ms
Wall time: 275 ms

np.allclose(result_hxhc[0][2:], result_my_cpp[0][2:])

True

%%time
result_karoka = ks_from_karoka(X, n_sample)

CPU times: user 6.91 s, sys: 71.9 ms, total: 6.98 s
Wall time: 6.38 s

np.allclose(result_karoka[2:], result_my_cpp[0][2:])

True

%%time
result_XiaqiongFan = ks_from_XiaqiongFan(X, n_sample)

CPU times: user 9.26 s, sys: 114 ms, total: 9.38 s
Wall time: 9.2 s

np.allclose(result_XiaqiongFan[0][2:], result_my_cpp[0][2:])

True

6.4.3. KS 采样程序 (2000×100 数据)#

现在选用一个适中的数据大小。在当前大小下，我们的程序仍然可以快速进行计算：

n_sample  = 2000
n_feature = 100
X = np.random.randn(n_sample, n_feature)
X *= 100

%%time
result_my_python = ks_sampling(X)

CPU times: user 710 ms, sys: 15.8 ms, total: 726 ms
Wall time: 184 ms

%%time
result_my_cpp = ks_sampling(X, backend="C")

CPU times: user 96.4 ms, sys: 0 ns, total: 96.4 ms
Wall time: 24.2 ms

%%time
result_hxhc = ks_from_hxhc(X, test_size=0)

CPU times: user 19.4 s, sys: 1.1 s, total: 20.5 s
Wall time: 19.9 s

np.allclose(result_hxhc[0][2:], result_my_cpp[0][2:])

False

6.4.4. KS 采样程序 (30000×100 数据)#

在相当大的数据集下，我们的程序可以确实地在较短时间内进行采样。对于个人电脑而言，由于 30000 个数据点所需要的距离矩阵已经相当庞大 (对于单浮点的数据，它占用大约 3.4 GB)，因此可以认为是个人计算机可以处理的数据量的极限。

30000**2 * 32/8 / 1024**3

3.3527612686157227

在这个数据量下，我们的程序仍然能相当快地进行采样。对于高效且并行的 C 程序而言，其速度是非常快的；其最耗时的一步或许不是 Kennard-Stone 采样过程，而是距离矩阵的生成过程。而 Python 程序尽管较慢，但耗时仍然是可以接受的。

n_sample  = 30000
n_feature = 100
X = np.random.randn(n_sample, n_feature)
X = np.array(100 * X, dtype=np.float32)

%%timeit -r 7 -n 1
result_my_python = ks_sampling(X)

19.8 s ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit -r 7 -n 1
result_my_cpp = ks_sampling(X, backend="C")

5.21 s ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit -r 7 -n 1
dist = get_dist(X)

2.95 s ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Kennard-Stone 采样方法在 Python/C 下的高效实现 O(N^2)

文档目录

6. Kennard-Stone 采样方法在 Python/C 下的高效实现 \(O(N^2)\)#

6.1. Python 实现#

6.1.1. 随机数据集#

6.1.2. 距离矩阵的生成 (欧氏距离)#

6.1.3. 默认的种子样本：距离最远的两个样本#

6.1.4. Kennard-Stone 采样#

6.2. C 语言代码#

6.3. 完整的 Kennard-Stone 采样函数#

6.4. 代码效率比较#

6.4.1. 距离矩阵生成 (5000×100 数据)#

6.4.2. KS 采样程序 (500×10 数据)#

6.4.3. KS 采样程序 (2000×100 数据)#

6.4.4. KS 采样程序 (30000×100 数据)#