[ILUG] Corrupt files
frank.duignan at gmail.com
Mon Nov 6 19:50:50 GMT 2006
Any clues in this link?
On 11/6/06, Cian Davis <davisc at skynet.ie> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> I have a weird, frustrating problem and would appreciate the insights
> of anyone on this list. Please bear with me, it's a long mail but the
> problem needs to be described.
> Our research group focuses on CFD
> (http://en.wikipedia.org/wiki/Computational_fluid_dynamics for those
> Most of us use software called Fluent and one person in the group uses
> CFX. All our desktop machines are Windows and we use the Windows
> version but we have a cluster of 9 Fujitsu-Siemens dual processor Xeons.
> When the cluster was initially delivered, it was running RedHat 6.
> After a few months, some of the Fluent users found that their files
> wouldn't read because they were corrupted.
> Fluent files are made up of descriptive text at the top, a binary blob
> of information in the middle, and text again at the bottom. Fluent has
> support for gzip so I told people to gzip the files and that helped
> for a while but it came back. The occurrences seemed random and only
> affected about 2 out of the 5 people using Fluent on the cluster. We
> would find that the modification date on a corrupted data set would be
> the same as a backup that was working.
> The CFX user had no problem and 2 years later continues to have no
> In short, I couldn't pin it down to anything but suspected that the
> versions of software offered by RedHat 6 were old and possibly dodgy.
> So about a year ago, I wiped all the machines and put Debian sarge on
> them. It's not a supported platform for either Fluent or CFX but I've
> managed to get both working from a tarball that each provide.
> It's started happening again and specifically, it's started happening
> to my files. Considering that each of these datasets generally takes
> about 12 hours to solve, it's more than a bit of a pain in the arse
> that stuff is screwing up. One of the machines faces the network runs
> Kerberos, NIS, Nagios, NFS, DNS, Squid and ntpd. The other nodes have
> the Fluent and CFX software NFS mounted from the master node.
> Now, don't moan about this bit - it's the only way I could do it. The
> master only had 50GB of disk free. Each of the nodes had about 20GB
> free. To give everyone enough space for the thing to be useful, the
> /home of the heaviest user was put on the master node and the other
> users were given a /home on one of the nodes, which was NFS mounted to
> the master (as /home/$user). Generally, a job is set running on more
> than 1 node from the master - Fluent uses rsh to contact the other
> nodes. As far as possible, no heavy computation is done on the master
> I don't think it's an NFS problem - the user with the home on the
> master node was the first to go tits up. I don't think it's a Debian
> problem because the same happened with RedHat. I don't think it's a
> Linux problem because no other software seems to have a problem.
> Nothing in logs or dmesg. I'm leaning towards a Fluent problem or a
> hardware problem so I can't think of any way to test this. The problem
> is sufficiently random that I can't provide good data to the software
> maker to investigate - and the fact that we're running on an
> unsupported architecture doesn't help. And also, if it's a hardware
> problem, why is it only files read and written with this software
> that's causing the problem?
> So, can anyone suggest something to try or troubleshooting steps to go
> Any help much appreciated.
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.5 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> -----END PGP SIGNATURE-----
> Irish Linux Users' Group mailing list
> About this list : http://mail.linux.ie/mailman/listinfo/ilug
> Who we are : http://www.linux.ie/
> Where we are : http://www.linux.ie/map/
More information about the ILUG