AwkChannelWiki: Prime Number Sieve

This is the standard sieve of Eratosthenes implemented in portable awk. The running time of the bare algorithm should be O(n log n); no special techniques are used (finding primes up to sqrt(bound) and then sifting the remaining segments for O(n) runtime; using Pritchard's wheel techniques to reach sublinear runtime).

The main difference between sift1 and sift2 in the below code is that the former uses awk's fields to store values and the latter uses a standard array. The point of implementing sift1 in this way was to test whether there would be any speed advantage in bypassing the hashing performed on all keys that index into arrays (integers are converted to strings and then hashed; all array keys are strings, all the time, no exceptions, as per the spec).

Arrays, numbered fields and the substr() built-in are the only facilities in awk that offer the potential for O(1) references; of the three, only arrays and numbered fields offer O(1) assignment as well; and of those two, numbered fields ought to have a lower constant factor than arrays, since no hashing is required.

But do numbered fields deliver the expected performance benefits?

The Code

#!/usr/bin/awk -f 

function sift1(N, primes,	line, sp, n, i, j) {
	line = $0; $0 = ""
	NF = N
	n = sqrt(N)
	primes[sp = 0] = 2
	for (i = 3; i < N; i += 2)
		if (!($i) && (primes[++sp] = i) <= n)
			for (j = i * i; j < N; j += i)
				$j = "1"
	$0 = line
}

function sift2(N, primes,	n, sp, nums, i, j) {
	n = sqrt(N)
	primes[sp = 0] = 2
	for (i = 3; i < N; i += 2)
		if (!(nums[i]==1) && (primes[++sp] = i) <= n)
			for (j = i * i; j < N; j += i)
				nums[j] = 1
}

BEGIN {
	max = (ARGV[1] + 0); delete ARGV[1]
	alg =  ARGV[2]     ; delete ARGV[2]
	if (alg == 1) {
		sift1(max, primes)
	} else if (alg == 2) {
		sift2(max, primes)
	}
	#for (i = 0; i in primes; i++) printf "%2s: %3d\n", i, primes[i]
}

Methodology

for i in awk mawk 'busybox awk'; do for j in 10000 20000 30000; do
  for k in 1 2; do echo $i $j $k;time $i -f primes.awk $j $k;done;done;done

for i in awk mawk 'busybox awk';do
  echo $i;time $i -f primes.awk 1000000 1;done

Results

Note: entries marked as 1 completed too fast to yield numbers large enough to be distinguishable from timing error.

gawk 3.1.5

algorithm	10,000	20,000	30,000	32,767	1,000,000
sift1	1	1	1	1	1.96s, 1.96s, 1.97s
sift2	1	1	1	1	5.29s, 5.37s, 5.38s

mawk 1.3.3

mawk has a default compiled-in maximum of 32,767 fields per record, and some very interesting performance characteristics when field accesses increase.

algorithm	10,000	20,000	30,000	32,767	1,000,000
sift1	5.18s	23.52s	58.68s	72.64s	err
sift2	1	1	1	1	3.74s, 3.65s, 3.59s

busybox awk 1.1.3

The busybox awk applet displays the same performance scaling as mawk, only with a slightly different constant factor. busybox awk does not appear to have the same limitation on the number of fields as mawk, but its performance ran out the counter on my patience.

algorithm	10,000	20,000	30,000	32,767	1,000,000
sift1	16.88s	73.86s	172.44s	207.67s	slow
sift2	1	1	1	1	9.54s, 9.43s, 9.52s

Conclusion

Based on very, very preliminary testing): indexing directly into the fields on a given line is far faster than indexing into an array. This is unsurprising, since all array keys must be treated as strings, and therefore must all pass through the hash function (numeric keys must suffer the extra indignity of being cast to strings, which slows things down even more).

Further testing is required.

Notes

N.B. that actually using tricks like this to gain extra performance is dumb. This is purely an exercise in testing indexed data lookup facilities across multiple implementations.

These numbers are only useful for relative comparison. Tests were run under Linux kernel 2.6.17 on a 1.6GHz Pentium 4 with 512 megs of RAM and a load average of 0.00.

PrimeNumberSieve