This is the standard sieve of Eratosthenes implemented in portable awk. The running time of the bare algorithm should be O(n log n); no special techniques are used (finding primes up to sqrt(bound) and then sifting the remaining segments for O(n) runtime; using Pritchard's wheel techniques to reach sublinear runtime).

The main difference between `sift1`

and `sift2`

in the below code is that the former uses awk's fields to store values and the latter uses a standard array. The point of implementing `sift1`

in this way was to test whether there would be any speed advantage in bypassing the hashing performed on all keys that index into arrays (integers are converted to strings and then hashed; all array keys are strings, all the time, no exceptions, as per the spec).

Arrays, numbered fields and the `substr()`

built-in are the only facilities in awk that offer the potential for O(1) references; of the three, only arrays and numbered fields offer O(1) assignment as well; and of those two, numbered fields *ought* to have a lower constant factor than arrays, since no hashing is required.

But do numbered fields deliver the expected performance benefits?

#!/usr/bin/awk -f function sift1(N, primes, line, sp, n, i, j) { line = $0; $0 = "" NF = N n = sqrt(N) primes[sp = 0] = 2 for (i = 3; i < N; i += 2) if (!($i) && (primes[++sp] = i) <= n) for (j = i * i; j < N; j += i) $j = "1" $0 = line } function sift2(N, primes, n, sp, nums, i, j) { n = sqrt(N) primes[sp = 0] = 2 for (i = 3; i < N; i += 2) if (!(nums[i]==1) && (primes[++sp] = i) <= n) for (j = i * i; j < N; j += i) nums[j] = 1 } BEGIN { max = (ARGV[1] + 0); delete ARGV[1] alg = ARGV[2] ; delete ARGV[2] if (alg == 1) { sift1(max, primes) } else if (alg == 2) { sift2(max, primes) } #for (i = 0; i in primes; i++) printf "%2s: %3d\n", i, primes[i] }

for i in awk mawk 'busybox awk'; do for j in 10000 20000 30000; do for k in 1 2; do echo $i $j $k;time $i -f primes.awk $j $k;done;done;done for i in awk mawk 'busybox awk';do echo $i;time $i -f primes.awk 1000000 1;done

**Note:** entries marked as *1* completed too fast to yield numbers large enough to be distinguishable from timing error.

algorithm | 10,000 | 20,000 | 30,000 | 32,767 | 1,000,000 |
---|---|---|---|---|---|

sift1 | 1 | 1 | 1 | 1 | 1.96s, 1.96s, 1.97s |

sift2 | 1 | 1 | 1 | 1 | 5.29s, 5.37s, 5.38s |

mawk has a default compiled-in maximum of 32,767 fields per record, and some very interesting performance characteristics when field accesses increase.

algorithm | 10,000 | 20,000 | 30,000 | 32,767 | 1,000,000 |
---|---|---|---|---|---|

sift1 | 5.18s | 23.52s | 58.68s | 72.64s | err |

sift2 | 1 | 1 | 1 | 1 | 3.74s, 3.65s, 3.59s |

The busybox awk applet displays the same performance scaling as mawk, only with a slightly different constant factor. busybox awk does not appear to have the same limitation on the number of fields as mawk, but its performance ran out the counter on my patience.

algorithm | 10,000 | 20,000 | 30,000 | 32,767 | 1,000,000 |
---|---|---|---|---|---|

sift1 | 16.88s | 73.86s | 172.44s | 207.67s | slow |

sift2 | 1 | 1 | 1 | 1 | 9.54s, 9.43s, 9.52s |

Based on very, very preliminary testing): indexing directly into the fields on a given line is far faster than indexing into an array. This is unsurprising, since all array keys must be treated as strings, and therefore must all pass through the hash function (numeric keys must suffer the extra indignity of being cast to strings, which slows things down even more).

Further testing is required.

N.B. that actually **using** tricks like this to gain extra performance is dumb. This is purely an exercise in testing indexed data lookup facilities across multiple implementations.

These numbers are only useful for relative comparison. Tests were run under Linux kernel 2.6.17 on a 1.6GHz Pentium 4 with 512 megs of RAM and a load average of 0.00.