《编程珠玑》第 2 章──啊哈!算法

看起来很困难的问题也可以有一个简单的、意想不到的答案。

三个问题

问题 A

给定一个最多包含 40 亿个随机排列的 32 位整数的顺序文件,找出一个不在文件的 32 位整数(在文件中至少存在一个这样的数——为什么?)。在具有足够内存的情况下,如何解决该问题?如果有几个外部的「临时」文件可用,但是仅有几百字节的内存,又该如何解决该问题?

把起始位置等于 0 和 1 的分成两组,写入两个顺序文件。第一组的范围是 [0,2312^{31}),第二组的范围是 [2312^{31},2322^{32})。如果需要找的数字小于2312^{31},那么下一步在第一组文件中找,否则就在第二组文件中寻找。假设只有 300 字节的内存,那么只能装 75 个 32 位整数(unsigned int),需要划分的次数是 32-6=24 次,这样的话每个小文件只有262^6 个整数。为了更加充分的利用内存,可以将文件分成 32/64/128 份。

问题 B

将一个 n 元一维向量向左旋转 i 个位置。例如,当 n=8 且 i=3 时,向量 abcdefgh 旋转成 defghabc。简单的代码使用一个 n 元中间向量在 n 步内完成该工作。你能否仅使用数十个额外的字节的存储空间,在正比于 n 的时间内完成向量的旋转。

把问题看成从数组 ab 变换成 ba,同时假定我们有一个将数组中特定部分的元素求逆。首先对 a 求逆得到arba^rb,然后对 b 求逆得到arbra^rb^r。最后整体求逆,得到(arbr)r\left(a^rb^r\right)^r ,此时就是 ba。

问题 C

给定一个英语字典,找出其中的所有变位词集合,例如,「pots」、「stop」和「tops」互为变位词,因为每个单词都可以通过改变其他单词中字幕的顺序来得到。

使用基于排序的标示。将单词中的字母按照字母表顺序排序,比如「deposit」的标示就是「dopiest」,这也是「dopiest」和其他任何在该类中的单词的标示。

习题

1. Consider the problem of finding all the anagrams of a given input word. How would you solve the problem given the word and the dictionary? What if you could spend some time and space to process the dictionary before answering any query?

先计算出需要查找的单词的标示,然后依次读取字典,计算每个单词的标示并进行比较。如果可以预处理,那么在预算好的结构中进行二分搜索。

2. Given a tape containg 4,300,000,000 32-bit integers, how can you find one that appears at least twice?

分成 k 段,每一段包含的数据为nk\frac{n}{k},第 i 段范围是 [i×nk\frac{i\times n}{k},(i+1)×nk\frac{\left(i+1\right)\times n}{k})。如果某段的长度超过nk\frac{n}{k} 个,则重复数据在该段内。

3. We studied two vector rotation algorithms that require subtle code; implement each as a program. How does the greatest common divisor of I and N appear in the analysis of each program?

假设数组长度为 n,每次跨越的长度是 k,开始位置为 i,第 j 次跳跃停止的条件是 (i + j * k) % n != i,也就是等价于 (j * k) % n = 0,也就是说 j * k 既是 k 的倍数,也是 n 的倍数,那么 j * k 就是 n 和 k 的最小公倍数,j * k = lcm(n, k)。如果 j 小于 n,说明跳跃的总落点数比 n 小,也就是说还没有完全遍历整个数组,还需要把 i 往后移动。一共需要移动的次数 t = n / j。把 j = lcm(n, k) / k 带入可得 t = n * k / lcm(n, k)。因为存在公式 a * b = gcd(a, b) * lcm(a, b),所以 t = gcd(n, k)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
int gcd(int m, int n) {
int t = 1;
while (t > 0) {
t = m % n;
m = n;
n = t;
}
return m;
}

void move(std::vector<int> &x, int k) {
auto n = static_cast<int>(x.size());
int t = gcd(n, k);
for (int i = 0; i < t; ++i) {
int tmp = x[i];
int j = 0;
while ((i + (j + 1) * k) % n != i) {
x[(i + j * k) % n] = x[(i + (j + 1) * k) % n];
++j;
}
x[(i + j * k) % n] = tmp;
}
}

另外一种算法不断递归,将前面的 i 个置换到后面的 i 个。算法实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void swap(std::vector<int> &x, int i, int j, int k) {
for (int l = 0; l < k; ++l) {
std::swap(x[i + l], x[j + l]);
}
}

void move(std::vector<int> &x, int k) {
auto n = static_cast<int>(x.size());
if (k == 0 || k == n) {
return;
}

int i(k);
int j(n - k);
while (i != j) {
if (i < j) {
swap(x, k - i, k - i + j, i);
j -= i;
} else {
swap(x, k - i, k, j);
i -= j;
}
}
swap(x, k - i, k, i);
}

可见上面两个算法实现中的「循环不变体」很容易写错。

5. Vector rotation routines change the vector AB to BA; how would you transfrom the vector ABC to CBA? (This models the problem of swapping unequal-length blocks of memory.)

先把 bc 看成整体,那么第一次旋转就可以变成 bca。然后再对 bc 进行旋转就变成了 cba 了。

6. Bell Labs has a “user-operated directory assistancd” program that allows employees to look up a number in a company telephone directory using a standard push-button telephone. To find the number of the designer of the system, Mike Lesk, one dials the number of the service, types “LESK*M*” (that is, “5375*6*”) and the system then speaks his number. One problem that arises in this system is that different names may have the same push-button encoding; when this happens in Lesk’s system, it asks the user for more information. Given a large file of names, such as a standard metropolitan telephone directory, how would you locate these “false matches”? (When Lesk did this experiment on such a directory, he found that their incidence was just 0.2 percent.) How would you implement the routine that is given a push-button encoding of a name and returns eigher a name or an appropriate message?

在这里,名字的标示就是按键,可以先得到每个名字的标示,然后按照标示排序。查找的时候可以用二分查找。

7. In the early 1960’s Vic Vyssotsky worked with a programmer who had to transpose a 4000-by-4000 matrix stored on magnetic tape(each record had the same several-dozen-byte format). The original program his colleague suggestd would have taken fifty hours to run; how would Vyssotsky reduce the run time to half an hour?

对每个存储的数据,使用行号和列号来存储。假如存储时,是先按照行号再按照列号,那么排序的时候先按照列号排序,然后在对行号排序。最后删除列号和行号。这样做的话很适合用在存入不频繁,读取很频繁的情况,能压缩使用空间。

8. [J. Ullman] Given a set of N real numbers, a real number T, and an integer K, how quickly can you determine whether there exists a K-element subset of the set that sums to at most T?

只要 K 个最小元素的子集之和不超过 T 时,可以找到这样的子集满足条件。可以先快排,然后取前 K 个,时间复杂度 O(nlog n)。也可以用选择排序,时间复杂度 O(n)

9. Sequential search and binary search represent a tradeoff between search time and preprocessing time. How many binary searches need be performed in an N-element table to buy back the preprocessing time required to sort the table?

顺序搜索时间复杂度为 O(log n)。快速排序时间复杂度为 O(nlog n),二分搜索时间复杂度为 O(nlog n)

O(nlogn)+k×O(logn)k×O(n)O\left(n\log n\right) + k \times O\left(\log n\right) \leq k \times O\left(n \right)

得到如下结果:

kO(logn)k \geq O\left(\log n\right)

10. On the first day a researcher worked with Thomas Edison, Edison asked him to compute the volume of an empty light bulb shell. After several hours with calipers and calculus, he returned with the answer of 150 cubic centimeters. In a few seconds, Edison computed and responded “closer to 155” –– how did he do it? Given other examples of aha! insights in analog computation.

装满水,然后用量杯计算水的体积即可。