[ILUG] RAID, huge filesystems and data mining.

Olivier Tharan olive at oban.frmug.org
Wed Jul 14 09:33:25 IST 2004


* Ronan Cunniffe <rcunniff at stp.dias.ie> (20040713 14:59):
>    Prompted by the "why RAID" discussion, I want to see what
> ILUGgers think of the following data-mining challenge, and my current
> sorta idea for solving it.  It's not *my* problem, but it's an interesting
> one.
> 
>    Large (1-2TB, scaling soon x10 or thereabouts) proprietary
> (multi-owner) data corpus, made up of many (thousands at least) of
> separate datasets.
> 
>    You are holding this data, and mediating access to it for an arbitrary
> number of dataminers.  Each user has a very definite set of access
> permissions, and it's not a regular pattern (i.e. there's no easy way of
> splitting the problem).
>    A data-mining run is going to involve 0.1 to 0.5 TB.
> 
>    This is (AFAIK) going to run on Red Hat 9, or possibly Fedora or
> something more recent.

I don't know if you want to build your solution on
directly-attached storage or not, and if it is going to be
opensource-only.

A NetApp storage could do what you want. It's NFS (or CIFS or
HTTP, etc.) but it could be part of a NAS with Fiber Channel
(this part I am not sure). When you run out of space, you throw
more disks in your qtree[1] and there you are.

The "views" you are talking about could translate into
"snapshots": at one given time, what your users see is a frozen
filesystem where the data does not change, whereas underneath the
snapshot you can add more data or do some cleaning.

I don't know if there is a Linux-based filesystem which does
snapshots; at least UFS2 on FreeBSD-CURRENT now does.

On the downside, the permissions are bound to your system, so for
the access permissions mentioned earlier, you would have to rely
on Unix groups and basic permissions (no ACL).

Do not think of NetApps only as very expensive file storage
cabinets, because you also get a reliable solution for the price.

[1] "quota tree", roughly equivalent to a mount point. You put
one or several qtrees on a RAID subsystem in a Netapp, which are
by the way some kind of RAID4++.

-- 
olive



More information about the ILUG mailing list