[ILUG] perl file processing
Francis Daly
francisdaly at gmail.com
Wed Oct 8 19:09:41 IST 2008
2008/10/8 Marcus Furlong <furlongm at hotmail.com>:
Hi there,
> I have a about 250 files to process that have the following format:
>
> https://www.cs.tcd.ie/~furlongm/weka-output.txt
A single example doesn't give a great idea of what the common format is.
Based on that file and your description, is the following correct:
Discard everything in the file until the first appearance of the
string "Stratified cross-validation". Then discard everything in the
file until the first appearance of the string "Detailed Accuracy By
Class" appears. The next line is blank; discard it. The next line is a
series of column headings: discard them? Or note their names for later
reporting? The desired content is everything up to the next blank
line. Discard everything after that line.
Then for the "desired content" -- which is some rows of (always 6?)
columns of numbers, possibly with a heading row -- calculate and
present the arithmetic mean of each column. Including 0, which means
that each column in your example has 11 values, and not some have 9 or
10.
Repeat for each file.
> If anyone can help with a perl script (or the basis of one), I'd be very grateful, as my bash scripting is getting me nowhere with this one..
I'd probably use a shell loop to send each file in turn into a
"process one file" script.
That would only pass the table (newline-separated paragraph) you want
into the do-the-sums part, which would tot things up.
The surrounding shell loop could print the filename before any output,
if that is useful.
If my description above is reasonable, then feeding the file through
something like
sed '1,/Stratified cross-validation/d' | sed '1,/Detailed Accuracy By
Class/d' | sed -e '1,2d' -e '/^$/,$d'
would be one way of only getting that table you want.
Sending that through
awk 'END{print a/NR, b/NR, c/NR, d/NR, e/NR, f/NR} {a+=$1; b+=$2;
c+=$3; d+=$4; e+=$5; f+=$6}'
(Not pretty, but effective) will spit out the arithmetic mean of the
exact numbers logged. You can tweak it to report only appropriate
significant figures, if needed.
The processing is fragile because of the assumptions made. But if the
file format matches, you're sorted.
If you really want, you could a2p or s2p those to turn them into perl.
Hope this pushes you in the right direction,
f
More information about the ILUG
mailing list