7. 有限内存的 Kennard-Stone 采样方法实现#

创建时间：2021-01-31

我们继续上一篇文档，对更大数据量，导致无法储存 \((n_\mathrm{sample}, n_\mathrm{sample})\) 距离矩阵时，可以采用的 Kennard-Stone 采样策略。

该方法的计算复杂度是 \(O(n_\mathrm{sample}^2 n_\mathrm{feature} + n_\mathrm{sample} n_\mathrm{result} n_\mathrm{feature})\)，其中 \(n_\mathrm{result}\) 为采样数，\(n_\mathrm{feature}\) 为特征向量的长度。如果将 \(n_\mathrm{feature}\) 当做常数值，那么该方法大约是代价比较大的 \(O(n_\mathrm{sample}^2)\) 方法。

Kennard-Stone Algorithm 的原始文献是 Kennard, Stone [1]。

这篇文档的特色是：

可以在任意内存大小限制下，找到两个欧氏距离 (Euclidean Distance) 最远的样本点；
上述算法可以通过 multiprocessing 实现，多核机器的效率会比先前的方法高；
通过 on-the-fly 计算距离的方式，在 Kennard-Stone 采样过程中不需要使用完整的距离矩阵；
对于 Kennard-Stone 的 C 语言实现，一样可以通过 OpenMP 实现并行化。

需要注意，内存有限的意思并非任意小的内存都可以实现算法，而是在至少能储存原始数据集 \(n_\mathrm{sample} n_\mathrm{feature}\) 大小之外，还有适量空余内存，其内存大小也是 \(O(n_\mathrm{sample})\) 量级。具体的讨论会放在后文中。

import numpy as np
from multiprocessing import Pool

np.set_printoptions(precision=6, linewidth=120, suppress=True)
np.random.seed(0)

7.1. Python 实现#

7.1.1. 随机数据集#

n_sample \(n_\mathrm{sample}\) 为样本数。现在定为 20000。
n_feature \(n_\mathrm{feature}\) 为特征数 (描述样本性质的数据)。现在定为 100。
X \(\mathbf{X}\) 或 \(x_{ia}\) 为完整数据集。

n_sample  = 20000
n_feature = 100
X = np.random.randn(n_sample, n_feature)
X *= 100

7.1.2. 找到距离最远的两个样本#

我们仍然需要生成距离矩阵，来找到距离最远的两个样本。但这个过程不一定需要储存完整的距离矩阵 \(d_{ij}\)，而是分块计算，最后统合。

我们考虑多进程分块。如果进程数量是 \(n_\mathrm{proc}\)，在每个进程中计算的距离矩阵分块是 \((n_\mathrm{batch}, n_\mathrm{batch})\)，那么为了计算距离分块需要 \(n_\mathrm{proc} n_\mathrm{batch}^2\)。同时，每个分块的最大值需要进行存储，因此需要 \(n_\mathrm{sample}^2 / n_\mathrm{batch}^2\) 的内存 (这块内存尽管可以节省，但为了程序容易编写，就保留了这部分内存)。因此，总内存需要

\[ n_\mathrm{proc} n_\mathrm{batch}^2 + n_\mathrm{sample}^2 / n_\mathrm{batch}^2 \]

根据三角不等式，最少的内存需求在 \(n_\mathrm{batch} = n_\mathrm{proc}^{-1/4} \sqrt{n_\mathrm{sample}}\) 时成立，即 \(2 \sqrt{n_\mathrm{proc}} n_\mathrm{sample}\)。

但如果单从效率上考虑，每个分块越大，那么多进程的通讯次数越少，消耗时间也就越少。这里我们选用 \(n_\mathrm{batch}\) 为 1000。

n_batch = 1000
t = np.einsum("ia, ia -> i", X, X)

def get_dist_slice(X, t, sliceA, sliceB):
    distAB = t[sliceA, None] - 2 * X[sliceA] @ X[sliceB].T + t[None, sliceB]
    if sliceA == sliceB:
        np.fill_diagonal(distAB, 0)
    return np.sqrt(distAB)

def get_slices(n_sample, n_batch):
    p = list(np.arange(0, n_sample, n_batch)) + [n_sample]
    return [slice(p[i], p[i+1]) for i in range(len(p) - 1)]

def get_maxloc_slice(slice_pair):
    dist_slice = get_dist_slice(X, t, slice_pair[0], slice_pair[1])
    max_indexes = np.unravel_index(np.argmax(dist_slice), dist_slice.shape)
    return (dist_slice[max_indexes], max_indexes[0] + slice_pair[0].start, max_indexes[1] + slice_pair[1].start)

slices = get_slices(n_sample, n_batch)
n_slices = len(slices)
slice_pairs = [(slices[i], slices[j]) for i in range(n_slices) for j in range(n_slices) if i <= j]
n_slices

最终得到的两个样本就如下所述：

%%time
with Pool(4) as p:
    maxloc_slice_list = p.map(get_maxloc_slice, slice_pairs)
max_indexes = maxloc_slice_list[np.argmax([v[0] for v in maxloc_slice_list])][1:]
max_indexes

CPU times: user 21.6 ms, sys: 8.2 ms, total: 29.8 ms
Wall time: 1.38 s

(7794, 18772)

7.1.3. Kennard-Stone 采样：Python 程序#

事实上，此处的 Kennard-Stone 采样过程与上一篇文档完全一致，只是在所有需要索引距离矩阵处都进行了现场计算的工作而已。由于现场计算距离额外地引入了 \(n_\mathrm{feature}\) 的维度，因此此步的计算复杂度是 \(O(n_\mathrm{sample} n_\mathrm{result} n_\mathrm{feature})\)；但内存复杂度没有变化，仍然是 \(O(n_\mathrm{sample})\)。

尽管算法没有发生太大变化，但距离矩阵现算导致耗时会大量增加，并且要考虑到额外增加的变量与函数通讯过程。因此它可以看做是代价较大的复杂度 \(O(n_\mathrm{sample} n_\mathrm{result})\)。如果内存充足，实际上不是那么建议使用此方法。

def ks_sampling_core_mem(X, seed, n_result):
    # Definition: Output Variables
    result = np.zeros(n_result, dtype=int)
    v_dist = np.zeros(n_result, dtype=float)
    
    # Definition: Intermediate Variables
    n_seed = len(seed)
    n_sample = X.shape[0]
    min_vals = remains = None
    
    # --- Initialization ---
    def sliced_dist(idx):
        tmp_X = X[remains] - X[idx]
        return np.sqrt(np.einsum("ia, ia -> i", tmp_X, tmp_X))

    selected = [False] * n_sample
    remains = []
    for i in range(n_sample):
        if i not in seed:
            remains.append(i)
    result[:n_seed] = seed
    if n_seed == 2:
        v_dist[0] = np.linalg.norm(X[seed[0]] - X[seed[1]])
    min_vals = sliced_dist(seed[0])
    
    for n in seed:
        np.min(np.array([min_vals, sliced_dist(n)]), axis=0, out=min_vals)
    # --- Loop argmax minimum ---
    for n in range(n_seed, n_result):
        sup_index = min_vals.argmax()
        result[n] = remains[sup_index]
        v_dist[n - 1] = min_vals[sup_index]
        remains.pop(sup_index)
        min_vals[sup_index:-1] = min_vals[sup_index+1:]
        min_vals = min_vals[:-1]
        np.min(np.array([min_vals, sliced_dist(result[n])]), axis=0, out=min_vals)
    return result, v_dist

下面给出抽样 2000 个样本时的纯 Python 程序执行过程。可以看出耗时非常明显。

%%time
ks_sampling_core_mem(X, max_indexes, 2000)

CPU times: user 11.9 s, sys: 300 ms, total: 12.2 s
Wall time: 12.2 s

(array([ 7794, 18772, 11049, ..., 14861, 17154,   733]),
 array([2004.309858, 1794.599652, 1756.059579, ..., 1283.925941, 1283.855823,    0.      ]))

7.1.4. Kennard-Stone 采样：C 程序#

对于 C 语言的程序，其调用与上一篇文档类似，但作为 seed 的关键词不再是可选参数了。对于默认采样方式而言，用户需要自行提供最远处的两个样本序号。

对所有 20000 个样本采样，可以在 10 秒以内完成。其速度显然不如全部距离矩阵元素都能立即获得的算法，但仍然是可以接受的。

from KS_Sampling import ks_sampling_mem_core_cpp

%%time
ks_sampling_mem_core_cpp(X, max_indexes, 2000)

CPU times: user 3.5 s, sys: 0 ns, total: 3.5 s
Wall time: 878 ms

(array([ 7794, 18772, 11049, ..., 14861, 17154,   733]),
 array([2004.309937, 1794.599731, 1756.059448, ..., 1283.925903, 1283.855835,    0.      ]))

%%time
ks_sampling_mem_core_cpp(X, max_indexes, 20000)

CPU times: user 20.6 s, sys: 0 ns, total: 20.6 s
Wall time: 5.14 s

(array([ 7794, 18772, 11049, ...,  2941,  7521, 17265]),
 array([2004.309937, 1794.599731, 1756.059448, ...,  881.746399,  859.787659,    0.      ]))

7.2. 实例演示：QM9 数据集 CM 特征的 Kennard-Stone 采样#

作为一个可以现实会遇到的问题，我们对化学分子中使用的 QM9 数据集 (131k 个分子) 的 CM (Coulumb Matrix) 特征进行 QM9 采样。其数据来源是下述文章的补充信息：

Faber, F. A., et al; *Lilienfeld, O. A. v. Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error, J. Comput. Theory Chem. 2017, 13 (11), 5255-5264. doi: 10.1021/acs.jctc.7b00577

这里使用 Python 分块找到最远两点、C 语言执行 Kennard-Stone 算法的方式，进行样本的选取。

使用的程序与上文基本是相同的，但这些功能都已经整合到 KS_Sampling.py 文件中的函数 ks_sampling_mem 了。

与上一篇文档相似地，我们仍然用 4 核 CPU 计算。

这里所报出的警告表明存在一些分子间距离过小，导致程序产生数值误差，对非常小的负值求开方。大多数情况下，这不会导致很严重的错误。

from KS_Sampling import ks_sampling_mem
import numpy as np

np.set_printoptions(precision=6, linewidth=120, suppress=True)

n_sample = 130829
n_feature = 900

X = np.empty((n_sample, n_feature), dtype=np.float32)
with open("CM", "r") as f:
    for i in range(n_sample):
        X[i] = np.array(f.readline().split()[1:], dtype=np.float32)

%%time
QM9_CM_KS_result = ks_sampling_mem(X)

/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(distAB)
/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(distAB)
/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(distAB)
/home/a/Documents/2020-01-30-KS_Memory/KS_Sampling.py:135: RuntimeWarning: invalid value encountered in sqrt
  return np.sqrt(distAB)

CPU times: user 2h 49min 3s, sys: 8.04 s, total: 2h 49min 11s
Wall time: 43min 45s

QM9_CM_KS_result

(array([  2858,   3284,  99137, ...,  67127,  64051, 103213]),
 array([  0.010785, 207.073349, 200.928101, ...,   0.000006,   0.000005,   0.      ]))

有限内存的 Kennard-Stone 采样方法实现

文档目录

7. 有限内存的 Kennard-Stone 采样方法实现#

7.1. Python 实现#

7.1.1. 随机数据集#

7.1.2. 找到距离最远的两个样本#

7.1.3. Kennard-Stone 采样：Python 程序#

7.1.4. Kennard-Stone 采样：C 程序#

7.2. 实例演示：QM9 数据集 CM 特征的 Kennard-Stone 采样#