[Go-essp-tech] zombie processes

martin.juckes at stfc.ac.uk martin.juckes at stfc.ac.uk
Tue Dec 20 02:12:46 MST 2011


Hello Jennifer,

I have also found that files become corrupted if wget is run too often with the "-c" option, resulting in a file which is as big as or larger than the source file, but with the wrong checksum.  Apparently this is not meant to happen, but it does. My solution is to remove incoming file and start again if it has not been successfully downloaded after 3 or 4 attempts with "wget -c" (I don't know what the optimal number is here - it obviously varies with the network speed, reliability of the thredds server at the other end, and size of the data files, and probably many other things as well --- too many variables to do a systematic survey). In some cases I have also found it necessary to impose a rate limit (e.g. "--limit-rate=400k"), to stop corruption happening during automatic retries - this clearly slows down the data rate on individual threads, but increases reliability;

Sincerely,
Martin Juckes

From: go-essp-tech-bounces at ucar.edu [mailto:go-essp-tech-bounces at ucar.edu] On Behalf Of Jennifer Adams
Sent: 20 December 2011 01:08
To: go-essp-tech at ucar.edu
Subject: [Go-essp-tech] zombie processes

Dear Colleagues,
I am doing all my wgetting on a suite of 64-bit CentOS boxes with a common gluster filesystem. My downloads are leaving a trail of hung processes on the servers that can't be killed (we refer to them as zombies). Too many zombies and the only remedy is a reboot, which is inconvenient for all the other users doing stuff on the servers. Some file-locking errors have shown up in the logs, which makes sense because the zombies are usually 'mv' or 'ls' commands, or the wget command itself, or similar. One pattern I've seen is a file gets stuck as a partial download, the checksums don't match, so it starts the download again, but the file can't be overwritten, and the wget gets caught in an infinite loop. And you can't kill the process or delete the file in question without spawning more zombies. It also seems to happen when I poke at the 0-byte files that occur when a download fails.

My sysadm is considering a gluster upgrade, but that would be big job and highly disruptive and before I send him down that path I'd like to be sure that we've ruled out other possible causes. Nobody at COLA but us CMIP5 downloaders are reporting any problems. So, has anyone else experienced these kinds of issues? Is it possible that the interaction of wget with a busy data node is the cause?

> wget --version
GNU Wget 1.11.4 Red Hat modified
> uname -a
Linux cola2.gmu.edu<http://cola2.gmu.edu> 2.6.18-238.12.1.el5 #1 SMP Tue May 31 13:22:04 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux


--Jennifer


--
Jennifer M. Adams
IGES/COLA
4041 Powder Mill Road, Suite 302
Calverton, MD 20705
jma at cola.iges.org<mailto:jma at cola.iges.org>





-- 
Scanned by iCritical.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ucar.edu/pipermail/go-essp-tech/attachments/20111220/5e93af53/attachment.html 


More information about the GO-ESSP-TECH mailing list