Hello<br><br>This is regarding running WRF on a Linux machine with openmpi, infiniband interconnect.<br>I ran wrf on 8processors on 1
node, but there is a problem when I run across nodes. Below is the error I get when I run across multiple nodes. This might be an openmpi issue but just wanted to check in this forum if anyone any idea about this?<br>If anyone has faced a similar issue, please help me out.<br>
<br>Thanks in advance<br>Preeti<br><br>---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br>
<pre> <br> <br>--------------------------------------------------------------------------<br>WARNING: There are more than one active ports on host 'moria08', but the<br>
default subnet GID prefix was detected on more than one of these<br>ports. If these ports are connected to different physical IB<br>networks, this configuration will fail in Open MPI. This version of<br>Open MPI requires that every physically separate IB subnet that is<br>
used between connected MPI processes must have different subnet ID<br>values.<br><br>Please see this FAQ entry for more details:<br><br> <a href="http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid">http://www.open-mpi.org/faq/?category=openfabrics#ofa-default-subnet-gid</a><br>
<br>NOTE: You can turn off this warning by setting the MCA parameter<br> btl_openib_warn_default_gid_prefix to 0.<br>--------------------------------------------------------------------------<br> starting wrf task 0 of 12<br>
starting wrf task 2 of 12<br> starting wrf task 5 of 12<br> starting wrf task 6 of 12<br> starting wrf task 4 of 12<br> starting wrf task 1 of 12<br>
starting wrf task 3 of 12<br> starting wrf task 7 of 12<br>[moria08:28420] 11 more processes have sent help message help-mpi-btl-openib.txt / default subnet prefix<br>[moria08:28420] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages<br>
--------------------------------------------------------------------------<br>The InfiniBand retry count between two MPI processes has been<br>exceeded. "Retry count" is defined in the InfiniBand spec 1.2<br>(section 12.7.38):<br>
<br> The total number of times that the sender wishes the receiver to<br> retry timeout, packet sequence, etc. errors before posting a<br> completion error.<br><br>This error typically means that there is something awry within the<br>
InfiniBand fabric itself. You should note the hosts on which this<br>error has occurred; it has been observed that rebooting or removing a<br>particular host from the job can sometimes resolve this issue. <br><br>Two MCA parameters can be used to control Open MPI's behavior with<br>
respect to the retry count:<br><br>* btl_openib_ib_retry_count - The number of times the sender will<br> attempt to retry (defaulted to 7, the maximum value).<br>* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted<br>
to 10). The actual timeout value used is calculated as:<br><br> 4.096 microseconds * (2^btl_openib_ib_timeout)<br><br> See the InfiniBand spec 1.2 (section 12.7.34) for more details.<br><br>Below is some information about the host that raised the error and the<br>
peer to which it was connected:<br><br> Local host: moria08<br> Local device: mthca0<br> Peer host: moria09<br><br>You may need to consult with your system administrator to get this<br>problem fixed.<br>--------------------------------------------------------------------------<br>
--------------------------------------------------------------------------<br>mpirun has exited due to process rank 0 with PID 28423 on<br>node moria08 exiting without calling "finalize". This may<br>have caused other processes in the application to be<br>
terminated by signals sent by mpirun (as reported here).<br>--------------------------------------------------------------------------<br>[moria08:28420] 3 more processes have sent help message help-mpi-btl-openib.txt / pp retry exceeded<br>
00:35:18 vss@moria08 <br></pre><br>