AwkChannelWiki: Adler32Checksum

As documented in the Adler-32 Wikipedia article.

function adler32(str,	a,b,len,i,j,MOD_ADLER,MAX_BLOCK) {
	if (!(sprintf("%c", int(rand()*256)) in ord))
		for (i=0; i<256; i++) ord[sprintf("%c",i)]=i
	  # This table initialization should really be done outside of the
	  # function definition here and shared across the global namespace.
	  # Unfortunately awk features no lexical scoping and no ability to
	  # query whether a variable has been defined, so we can't check for
	  # the presence of an existing 'ord' array. If we query for some
	  # random character and it's not in the array, we reinitialize the
	  # whole thing and dump it right in the global namespace. This is
	  # ugly, but hopefully less ugly than recomputing it on every call
	  # to this function. It would be trivial to add an 'ord' argument
	  # to the call list to allow calling functions to skip the init
	  # process if they are smart, but that would then force this code
	  # to reinit on every call. I think this is a decent compromise.
	MOD_ADLER = 65521
	MAX_BLOCK = 5552
	  # These rather dramatic names are a bit silly, and it does not
	  # really make sense to have MAX_BLOCK be 5552 when almost every
	  # modern awk has numbers with at least 53 bits of precision, but
	  # if we use these names and these values we make it easy for the
	  # curious ones to search for other implementations, and also easy
	  # for other searchers to find this code. And, really, this choice
	  # of overflow point is as good as any.
	a = (a?a:1); b = (b?b:0); i = 0; j = MAX_BLOCK
	  # We allow the values of a and b to be overridden by the caller
	  # so that it becomes possible to compute a "rolling" checksum: a
	  # caller can manage its own chunk tracking and call in here one
	  # piece at a time. Furthermore, since addition commutates, adler32
	  # could be run in parallel across multiple blocks... if awk could
	  # allow such a thing.
	len = length(str)

	while (++i <= len) {
		b += (a += ord[substr(str,i,1)])
		  # This should bottleneck the performance of the entire
		  # algorithm: since all awk arrays have string keys, every
		  # index into an array incurs the constant overhead of the
		  # string hash function. It's a good thing that our keys
		  # here are strings; if they were numbers, we would also
		  # have to convert those numbers to strings before using
		  # them as keys. Yikes! Mawk mitigates this somewhat, but
		  # gawk and busybox awk take an unpleasant performance hit
		  # in that situation.
		if (i == j) {
			j += MAX_BLOCK
			a %= MOD_ADLER
			b %= MOD_ADLER
			  # There's a case to be made that replacing these
			  # modulus operations with a test-and-substract 
			  # method could run faster, but I simply couldn't
			  # bring myself to attempt (and then benchmark!)
			  # such a silly micro-optimization.
			  #
			  # There is also a hidden assumption here that the
			  # value of i always starts off as less than j and
			  # climbs towards it. This may not be the case if
			  # the caller has overridden i in order to compute
			  # the checksum of a substring; hacker beware!
		}
	}
	return a + (2^16 * b)
	  # gawk supports bitwise operations that would simplify this bit
	  # of prettying up, but we're trying to be portable. Also recall
	  # that Adler32 is supposed to return its results in network
	  # order; we punt and simply return an integer, and we leave the
	  # dirty conversion work to the calling function.
	  #
	  # ...it's not laziness, it's constructive efficiency.
}

BEGIN {printf "'%X' should match '%s'\n", adler32("Wikipedia"), "11E60398"}

Some things are left to do: very little extensive testing has been performed to ensure correctness (how well does this perform across implementations on null strings? Strings with international characters? Strings with binary data? Strings with embedded NUL characters?), and no testing has been done to compare and contrast latency and throughput across awk implementations and several different implementation strategies.

This is a starting point, not a finished library.