Awk is a wonderful language! That said, there are a few annoying bits...
The Good
- well-documented semantics
- a wide variety of implementations, with remarkably good consistency between them
- terse domain-specific syntax
- rapid execution, fast startup
- awk is a simple language, and can easily be learned in its entirety. It is usually not necessary to consult the manual for anything but rarely-used odds and ends (like the specific escape sequences for strftime and [s]printf)
The Bad
- It's quite difficult to map back from the auto-split fields to positions in a given record, since awk only exposes the results of its field parsing and hides the offsets (but look at the optional fourth argument of split in gawk-devel).
- Some strings are assignable and some are not. This is probably because non-assignable strings stored as simple {ref-string, start, length} triples that point into other strings. This is a net efficiency win, but this should be an implementation detail rather than a language restriction. (goedel: Please, could somebody give an example for non-assignable strings?)
- There is no portable built-in way of querying the type of a variable: strings and numbers get automatically converted from type to type. This can create mysterious problems because the conversion operation is, in fact, rather slow; and since there's no way of reliably determining the type of a variable, tracking down those performance sinks requires considerable insight.
- Arrays keys may only be strings. This means that even when numbers are used to index into an array, the numbers are first converted to strings and then hashed. This is a language problem, not an implementation problem.
The Ugly
- The ordering of arguments to the
match(string, pattern)
built-in function is jarring. The gawk manual mentions that it may be easier to remember if one thinks of it as the string ~ pattern
matching operator, but it is backward from every other built-in function that operates on strings. - There is no way to return an array from a function. It is possible to pass an array into a function and modify it, but this is a poor substitute.
- There is no way of declaring an array without assigning a value to a key. This isn't a semantic problem, but it makes it difficult to write out longer functions that declare and explain their arrays at the beginning. One must use comments for the purpose instead (but look at
split("", a, " ")
). - Strings can be passed through multiple levels of interpretation. A string can contain escaped characters which must be interpreted (
"A string with an embedded \t tab character \n And a newline
); if a string is used in a regular expression context, a further level of escaping must be performed ("foo\\bar" ~ "f..\\\\b.."
); and when a string is used as the second argument to sub()
or gsub()
, things get real ugly real fast. - Awk variables are either global or parameters to a user-defined function. Awk does not throw errors when a user function is invoked with a mismatched number of parameters, so there is a convention to list local variables in the parameter list: for example, the function definition
function sift(n, i, j, nums, primes) ...
would accept a single parameter and use four local variables. This limitation makes writing functions with optional arguments and local variables an unnecessarily delicate process. - There is no way of testing whether a function has been defined or not (or rather there is, but the result is either silent success or awk exiting with an error). This makes writing libraries of code difficult without using a full awk parser and a preprocessor.
The Missing
- It is theoretically possible that a sufficiently smart compiler could perform enough analysis to find those cases where a simple array could be used instead of a hash table: Lua versions prior to 5.x used awk-style hash-backed tables, but the 5.x series has implemented an optimization where the table is actually a hybrid data structure, containing both a traditional hash table and an array. The array stores contiguous runs of integer keys and allows much higher performance to be achieved, at the cost of slightly higher implementation complexity (see The Evolution of Lua, section 6.2 ("Tables"), pp. 12-13; The Implementation of Lua 5.0, section 4 ("Tables"), pp. 6-8; ltable.c, the C implementation of tables in Lua 5.1).
- Most awk implementations perform very few optimizations, even very simple ones.
- There is no
eval
function for executing a string as a sequence of awk code. This is usually not a good idea, but is occasionally very useful. This facility could help with assembling library dependencies, writing fast virtual machines (awk is surprisingly good for writing prototype VMs and surprisingly bad at making them run fast), writing a metacircular awk system... - Higher-order functions are entirely absent (but look at indirect function calls in gawk-devel).
- The language specification does not state a minimum required numeric precision.
- Awk arrays can only contain strings and numbers, and can only be indexed by strings. Arrays cannot contain other arrays, regular expressions, functions... (but you can use a simulation of multidimensional arrays in POSIX awk and indirect function calls in gawk-devel)
- Although it would be relatively easy to perform tail-call elimination, the language does not require this optimization. Therefore, even if particular implementations provided the facility, relying on it would not be safe. Tail-call elimination makes entire classes of algorithms easier to represent.
- Functions cannot be declared inside other functions. If they could, libraries of awk code could avoid function clashes by isolating their internals. A module system would also make this possible, but that is an even more difficult problem.