[ILUG] Remove duplicate lines from a file?

Conor Daly conor.daly at oceanfree.net
Fri Jun 30 23:17:01 IST 2000


-----Original Message-----
From: Fergal Daly <fergal at esatclear.ie>
To: Niall O Broin <niall at magicgoeshere.com>; Conor Daly
<conor.daly at oceanfree.net>
Cc: ilug at linux.ie <ilug at linux.ie>
Date: 30 June 2000 22:53
Subject: Re: [ILUG] Remove duplicate lines from a file?


>At 16:39 30/06/00, Niall  O Broin wrote:
>>
>>perl -ne 'print unless ($seen{$_}++)'
>>
>>as a pipe to do the job. There's one slight hitch - this will consume
memory
>>like there's no tomorrow. If the file(s) you want to treat are somewhat
>>smaller than your free virtual memory, you'll be OK.
>
>In a similar vein
>
>perl -MMD5 -ne 'print unless $seen{MD5->hash($_)}++'
>
>should consume lots less memory if the lines are long, of course if you're
>really unfortunate 2 of your lines may hash to the same string under MD5
>but this is highly unlikely, especially if the lines re in some kind of
>regular format. Personally I don't think I'd use this, unless I was just
>trying to get statistics on how many duplicates there are, but I thought it
>was fun,
>
>Fergal
>


I was thinking about checksumming for really big files but I don't think
I'll need it here.  I've got a total of about 17,000 unique lines of about
<100 bytes each for a total of about 1.5Mb of unique data.  Should fint into
32Mb RAM with 64Mb swap ok...

I got the impression somewhere that the World would END before any two
unique files / strings would produce the same hash from MD5 :-)

---
Conor Daly

Ph   +353 1 8326146

conor.daly at oceanfree.net
------------------------------------------





More information about the ILUG mailing list