Thursday, August 18, 2011

weblogic.unicast.HttpPing in Unicast clusters

If you have the impression that your Cluster Members are vanishing from the cluster and have difficulty joining it back, read this:

http://download.oracle.com/docs/cd/E21764_01/doc.1111/e15731/weblogic_server_issues.htm#CIHEAFDH

9.4.1 Threads Are Blocked on Cluster Messaging in Unicast Mode

When using Unicast mode for cluster communication, many threads are blocked on cluster messaging, which may result in cluster members having difficulty sending heartbeat messages. In this situation, some cluster members drop out from the cluster and may take some time to rejoin the cluster.

Workaround

Set the following system property to resolve this issue:

-Dweblogic.unicast.HttpPing=true




If any member of the cluster is down, you will see:

####<21-Aug-2011 12:36:30 o'clock BST> <debug> <unicastmessaging> <pierrepc> <ms1> <Timer-2> <<anonymous>> <> <> <1313926590111> <BEA-000000> <[HttpPingRoutine] HttpPing Caught IOException: java.net.SocketException: Socket Closed>


These messages follow the ping attempts:

####<21-Aug-2011 12:39:27 o'clock BST> <Debug> <UnicastMessaging> <pierrepc> <ms1> <[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1313926767511> <BEA-000000> <[UnicastFragmentSocket] sending ' server ms1, id=1313926767511' to local group>
####<21-Aug-2011 12:39:27 o'clock BST> <Debug> <UnicastMessaging> <pierrepc> <ms1> <[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1313926767511> <BEA-000000> <[Group] [LocalGroup [[ms1]]] we are the seniormost. Send message to group>
####<21-Aug-2011 12:39:27 o'clock BST> <Debug> <UnicastMessaging> <pierrepc> <ms1> <[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1313926767512> <BEA-000000> <[Group] [LocalGroup [[ms1]]] we are the seniormost. Send message to group>

The SocketException will go away once you start all members of the cluster.


if you get this:


####<21-Aug-2011 13:11:59 o'clock BST> <Warning> <Socket> <pierrepc> <ms1> <ExecuteThread: '2' for queue: 'weblogic.socket.Muxer'> <<WLS Kernel>> <> <> <1313928719011> <BEA-000442> <Connection attempt was rejected because the incoming protocol is not enabled on channel "unicastChannel".>

here is what they say : enable the protocol on the channel



And if you get this:


####<21-Aug-2011 13:12:05 o'clock BST> <Debug> <UnicastMessaging> <pierrepc> <ms1> <Timer-2> <<anonymous>> <> <> <1313928725736> <BEA-000000> <[HttpPingRoutine] HttpPing Caught IOException: java.io.EOFException: Response contained no data>


then you might try this (as suggested here): -Dhttp.keepAlive=false (it didn't work for me :o( )

Here http://download.oracle.com/docs/cd/E12840_01/wls/docs103/issues/known_resolved.html they mention that in pre 10.3 versions of WebLogic the cluster gets in trouble when you suspend one of the Managed Servers. This is known as CR370084, yet the workaround of setting -Dweblogic.unicast.HttpPing=true is recommended only for test environments.






2 comments:

skies said...

Hi I tried the option -Dhttp.keepAlive=false after going through the Sun documentation for Networking in order to get rid of the IOException: java.io.EOFException, but it never worked for me and I had to remove the members from the cluster and run it as a non-cluster unit. Were you able to find any solution for this? I might have to open a case with Oracle to see whats happening

vernetto said...

so far we are only using a dedicated unicast channel (cluster-broadcast), and we are not using the weblogic.unicast.HttpPing option either....

Unfortunately the issue is impossible to reproduce, it only occurs every few days randomly.

I will also open a SR for this; in the worse case you can try Multicast cluster, I never had issues with that.