[ILUG] Remove duplicate lines from a file?
Conor Daly
conor.daly at oceanfree.net
Fri Jun 30 23:17:01 IST 2000
-----Original Message-----
From: Fergal Daly <fergal at esatclear.ie>
To: Niall O Broin <niall at magicgoeshere.com>; Conor Daly
<conor.daly at oceanfree.net>
Cc: ilug at linux.ie <ilug at linux.ie>
Date: 30 June 2000 22:53
Subject: Re: [ILUG] Remove duplicate lines from a file?
>At 16:39 30/06/00, Niall O Broin wrote:
>>
>>perl -ne 'print unless ($seen{$_}++)'
>>
>>as a pipe to do the job. There's one slight hitch - this will consume
memory
>>like there's no tomorrow. If the file(s) you want to treat are somewhat
>>smaller than your free virtual memory, you'll be OK.
>
>In a similar vein
>
>perl -MMD5 -ne 'print unless $seen{MD5->hash($_)}++'
>
>should consume lots less memory if the lines are long, of course if you're
>really unfortunate 2 of your lines may hash to the same string under MD5
>but this is highly unlikely, especially if the lines re in some kind of
>regular format. Personally I don't think I'd use this, unless I was just
>trying to get statistics on how many duplicates there are, but I thought it
>was fun,
>
>Fergal
>
I was thinking about checksumming for really big files but I don't think
I'll need it here. I've got a total of about 17,000 unique lines of about
<100 bytes each for a total of about 1.5Mb of unique data. Should fint into
32Mb RAM with 64Mb swap ok...
I got the impression somewhere that the World would END before any two
unique files / strings would produce the same hash from MD5 :-)
---
Conor Daly
Ph +353 1 8326146
conor.daly at oceanfree.net
------------------------------------------
More information about the ILUG
mailing list