《编程珠玑》第 1 章──开篇

明确问题,一旦正确理解了问题,这场战役就成功了 90%。简单德程序通常比相同功能的复杂程序更可靠、更安全。

对话

1
2
3
4
5
6
7
8
9
A:怎么给一个磁盘文件排序
B:为什么不用系统的排序功能?
A:在一个大系统中排序,由于不明的技术原因,不能使用系统的排序
B:需要排序的内容是什么?有多少条记录?格式是什么?
A:1 千万条,每一条都是 7 位整数
B:为啥不在内存中排序呢
A:只有 1 MB 内存
B:还有其他信息么?
A:每个整数最多出现一次

将已知条件组织成一种更客观、更易用的形式。

  • 输入:一个最多包含 $n$ 个正整数的文件,每个数都小于 $n$ ,其中 $n = 10^7$ 。没有重复数字。
  • 输出:升序排序的列表。
  • 约束:1 MB 内存空间,磁盘空间充足,运行时间 10s 以内。

实现概要

采用位图数据结构。

  1. 将所有位置置为 0
  2. 依次读入整数,在对应的位置上置为 1
  3. 检验每一位,如果等于 1 ,就输出对应的整数

习题

1

If memory were not scarce, how would you implement a sort in a language with libraries for representing and sorting sets?

set 集合容器实现了红黑树的平衡二叉检索树的数据结构,它会自动调整二叉树的排列,把元素放到适当的位置。set 容器所包含的元素的值是唯一的,集合中的元素按一定的顺序排列。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <iostream>
#include <set>

using namespace std;

int main() {
set<int> S;
int i;
set<int>::iterator j;
while (cin >> i) {
S.insert(i);
}
for (j = S.begin(); j != S.end(); ++j) {
cout << *j << endl;
}
return 0;
}

2

How would you implement bit vectors using bitwise logical operations (such as and, or and shift)?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#define BITSPERWORD 32
#define SHIFT 5
#define MASK 0x1F
#define N 10000000
int a[1 + N / BITSPERWORD];

using namespace std;

void set(int i) {
a[1 >> SHIFT] |= (1 << (i & MASK));
}

void clr(int i) {
a[i >> SHIFT] &= ~(1 << (i & MASK));
}

int test(int i) {
return a[i >> SHIFT] & (1 << (i & MASK));
}

因为每个 int 有 32 位,所以数组中每个数可以来表示 32 位。a[1 >> SHIFT]i 除以 32 的除数,来确定对数组中第几个数来进行操作。(i & MASK) 表示 i 除以 32 的余数,(1 << (i & MASK)) 表示 1 向左移动「余数」位。

3

Run-time efficiency was an important part of the design goal,and the resulting program was efficient enough. Implement the bitmap sort on your system and measure its run time; how does it compare to the system sort and to the sorts in Problem 1? Assume that n is 10,000,000, and that the input file contains 1,000,000 integers.

1
2
3
4
5
6
7
8
9
10
11
for (int j = 0; j < M; ++j) {
clr(j);
}
for (int l = 0; l < M; ++l) {
set(x[l]);
}
for (int k = 0; k < M; ++k) {
if (test(x[k])) {
cout << x[k] << endl;
}
}

时间对比,bitmap 采用上述的做法,sort 是用 sort 方法对 vecotr 进行排序,set 使用 c++ 自带的 set 来进行排序。

bitmap sort set
运行时间 9711 us 9950 us 995993 us

4

If you take Problem 3 seriously, you will face the problem of generating $k$ integers less than $n$ without duplicates. The simplest approach uses the first c positive integers. This extreme data set won’t alter the run time of the bitmap method by much, but it might skew the run time of a system sort. How could you generate a file of $k$ unique random integers between 0 and $n-1$ in random order? Strive for a short program that is also efficient.

1
2
3
4
5
6
7
8
void set_x() {
for (int i = 0; i < N; ++i) {
x[i] = i;
}
for (int j = 0; j < M; ++j) {
swap(x[j], x[random() % (N - j) + j]);
}
}

5

The programmer said that he had about a megabyte of free storage, but the code we sketched uses 1.25 megabytes. He was able to scrounge the extra space without much trouble. If the megabyte had been a hard and fast boundary, what would you have recommended? What is the run time of your algorithm?

  1. 可以考虑分成两类,分别是以 0 和 1 开头的电话号码,这样的话可以将内存需求降低到 100 万 字节。
  2. 将所有数字分成两组,每组 500 万字节,排序完然后进行归并排序。

归并排序中,$k$ 趟排序可以在 $kn$ 时间复杂度、$\frac{n}{k}$ 空间复杂度内完成 $n$ 个小于 $n$ 的无重复正整数排序。

6

What would you recommend to the programmer if, instead of saying that each integer could appear at most once, he told you that each integer could appear at most ten times? How would your solution change as a function of the amount of available storage?

用 4 位、半字节的单位来统计某个整数出现的次数,然后进行第 5 题中的归并排序。

7

[R. Weil] The program as sketched has several flaws. The first is that it assumes that no integer appears twice in the input. What happens if one does show up more than once? How could the program be modified to call an error function in that case? What happens when an input integer is less than zero or greater than or equal to n? What if an input is not numeric? What should a program do under those circumstances? What other sanity checks could the program incorporate? Describe small data sets that test the program, including its proper handling of these and other ill-behaved cases.

超过一次,那么那个数字只会输出一遍,可以在 set 时候判断 test 函数是不是 true。小于 0 或者大于 N 的情况下,对数组 a 的数值设置存在问题,导致输出一些不存在的输入值,可以在 set 函数执行之前,判断输入的大小。如果不是 int 不能存入 a 数组,可以在检查输入的时候加个 continue 功能,除非遇到 EOF 才停止输入。

8

When the programmer faced the problem, all toll-free phone numbers in the United States had the 800 area code. Toll-free codes now include 800, 877 and 888, and the list is growing. How would you sort all of the toll-free numbers using only a megabyte? How can you store a set of toll-free numbers to allow very rapid lookup to determine whether a given toll-free number is available or already taken?

根据区号来把号码存入一个集合中,然后来判断。相当于根据区号分成 3 组。

9

One problem with trading more space to use less time is that initializing the space can itself take a great deal of time. Show how to circumvent this problem by designing a technique to initialize an entry of a vector to zero the first time it is accessed. Your scheme should use constant time for initialization and for each vector access, and use extra space proportional to the size of the vector. Because this method reduces initialization time by using even more space, it should be considered only when space is cheap, time is dear and the vector is sparse.

这是一个非常精巧的答案。对于一个长度非常大的数组,访问之前需要先初始化,然后对所有元素都统一初始化是不可取的。答案中引入两个数组 from 和 to 以及一个变量 top。判断 data[i] 初始化的条件是 from[i] < top && to[from[i]] == i 。top 记录了当前 data 数组中已经初始化的个数,保证了 from 数组中的数都要小于 top,但是 from 中没有初始化的数说不定也可能小于 top ,为了减小误判的概率,加入了 to 数组,to[from[i]] == i 这个条件确保了 from 数组不会被写入随机初始化数字。

对数组 data 中的数字 i 的首次访问

1
2
3
4
from[i] = top
to[top] = i
data[i] = 0
top++

10

Before the days of low-cost overnight deliveries, a store allowed customers to order items over the telephone, which they picked up a few days later. The store’s database used the customer’s telephone number as the primary key for retrieval (customers know their phone numbers and the keys are close to unique). How would you organize the store’s database to allow orders to be inserted and retrieved efficiently?

使用 $10 \times 10$ 的二维数组,将客户电话后两位作为索引。这就是经典的「用顺序搜索来解决冲突的开放散列」。

11

In the early 1980’s Lockheed engineers transmitted daily a dozen drawings from a Computer Aided Design (CAD) system in their Sunnyvale, California, plant to a test station in Santa Cruz. Although the facilities were just 25 miles apart, an automobile courier service took over an hour (due to traffic jams and mountain roads) and cost a hundred dollars per day. Propose alternative data transmission schemes and estimate their cost.

采用信鸽来传送底片。时间短,而且成本低。

12

Pioneers of human space flight soon realized the need for writing implements that work well in the extreme environment of space. A popular urban legend asserts that the United States National Aeronautics and Space Administration (NASA) solved the problem with a million dollars of research to develop a special pen. According to the legend, how did the Soviets solve the same problem?

铅笔。不愧是战斗民族!