XMLScraping

Difference between revision 8 and current revision

No diff available.

You cannot easily parse xml with awk.

But there are several tricks to scrap an xml file:

Extract the content of <tag> </tag>

<tag> </tag> are on the same line

You can use a Field separator matching the tag: </?tag>, the line will then look like

 field1 FS field2 FS field3

where the first FS is the opening tag and the second one is the closing tag, extracting field2 is then easy:

 awk -F'</?tag>' 'NF>1{print $2}'

This can be generalized if you have more than one pair of <tag> on the same line:

 awk -F'</?tag>' '{for(i=2;i<=NF;i++) print $i}'

<tag> </tag> different lines

 awk '/<tag>/,/<\/tag>/'
 awk ' /<\/tag>/{f=0} f{print} /<tag>/{f=1}'
awk '/<tag>/{sub(/.*<tag>/,"");f=1}/<\/tag>/{f=0;sub(/<\/tag>/.*/,"");print}f{print}'

Extracting the value of the attribute foo

if you want all the foo disregarding the tag

 awk -v RS='"' '/foo=$/{getline;print}'
 something FS value" something else FS value" something else

except that you need to get rid of the thing after the quote.

 awk -F'foo="' '{for (i=2;i<=NF;i+=2){ sub(/".*/,"");print $i}'

all the attribute foo of a defined tag

Same trick as above, but here we use > as a record separator so that we have one tag per record.

  awk -v RS=\> -F '<tag.*foo="' 'NF>1{sub(/".*/,"",$2);print $2}'