AwkChannelWiki: XMLScraping

You cannot easily parse xml with awk.

But there are several tricks to scrap an xml file:

You can use a Field separator matching the tag: </?tag>, the line will then look like

 field1 FS field2 FS field3

where the first FS is the opening tag and the second one is the closing tag, extracting field2 is then easy:

 awk -F'</?tag>' 'NF>1{print $2}'

This can be generalized if you have more than one pair of <tag> on the same line:

 awk -F'</?tag>' '{for(i=2;i<=NF;i++) print $i}'

 awk '/<tag>/,/<\/tag>/'

 awk ' /<\/tag>/{f=0} f{print} /<tag>/{f=1}'

The above solutions only work if there is nothing on the line after tag. If this is not the case, you can do something like:

awk '/<tag>/{sub(/.*<tag>/,"");f=1}/<\/tag>/{f=0;sub(/<\/tag>/.*/,"");print}f{print}'

one possible solution use " as the record separator; if the record you want is the one following the record containing the attribute name:

 awk -v RS='"' '/foo=$/{getline;print}'

another possibility, use the attribute name as the FS, then you are in the same kind of situation as the above trick to extract the content of the tag:

 something FS value" something else FS value" something else

except that you need to get rid of the thing after the quote.

 awk -F'foo="' '{for (i=2;i<=NF;i+=2){ sub(/".*/,"");print $i}'

Same trick as above, but here we use > as a record separator so that we have one tag per record.

  awk -v RS=\> -F '<tag.*foo="' 'NF>1{sub(/".*/,"",$2);print $2}'