jmx events - other nodes get connected events when reconnecting to a server

Type: Bug
Status: Closed
Priority: 2 Major
Resolution: Won't Fix
Component/s:
Labels:

Assignee: interfaces
Reporter: tgautier
Created: November 29, 2007
Votes: 0
Watchers: 0
Updated:
December 16, 2011
Resolved:
February 20, 2009

Description

start server
start node 1
start node 2
stop server
start server

After step 5, node 1 and node 2 will reconnect (assuming persistence mode). Node 1 will receive a “this node connected” message for node 1, and node 2 will receive a “this node connected” message for node 2 (as expected).

The odd thing is that node 1 will ALSO receive a “node connected” message for node 2, and node 2 will receive a “node connected” message for node 1.

I am not sure if this what I would have expected…it seems difficult to program reasonable resource management listeners around this behavior…but it’s late at night so I could be wrong.

Comments

Taylor Gautier 2007-11-29

When we go to 3 nodes, this get more interesting (and a little more like what I would expect):

node 1 gets node 2 and node 3 connected node 2 gets node 3 connected node 3 gets nothing

Steve Harris 2007-11-29

For the first thing you wrote. I think which ever one connected first will receive a node connected for the other one. So that doesn’t feel so surprising to me (not saying we can’t change it).

Taylor Gautier 2007-11-30

the behavior of the 2 node case leads steve and I to believe there is a race condition and is unexpected. if this is a simple fix, let’s do it in pacheco, otherwise punt it back to DRB.

Taylor Gautier 2007-11-30

another thing to think about is that connected and disconnected events happened for a server failover - and they are used to show membership permanently joining and leaving. So either we need different events to disambiguate, or the server failover shouldn’t have sent connected / disconnected.

Fiona OShea 2008-04-29

Not going to be done for 2.6

Taylor Gautier 2008-08-11

After talking with Walter, determined these events should happen. It is not clear if these can be mapped to the current events or not. That will be the next step.

Node Joined the Cluster Event -> Sent once and only once to every node in the cluster when a node joins the cluster (including the node that just joined – note that the event must contain the node id of the joining node. In addition, the node that just joined must know it’s own node id before this event is sent)
Node Left the Cluster Event -> Sent when a node is permanently removed from the cluster to remaining nodes. Note that this should not be sent on a temporary disconnection (e.g. when a connection is lost to a client) but only upon permanent removal.
This Node Connected to the Server Event -> Sent each time a connection to the server is established. Note that the first time a node joins the cluster, this event will be sent in conjunction with the This Node Joined Event
This Node Disconnected from the Server Event -> Sent each time a connection to the server is lost
This node removed from the cluster -> Sent when this node is permanently removed from the cluster. Note that this can only be determined when the node attempts to connect to a server and the server rejects the client. It may not in fact be deterministic or useful, so this event is questionable and should be discussed further. This message is listed in RMP-347

Some examples.

Server is started. –> no events sent
Client 1 joins the cluster –> Client 1 receives: Event #1 (Node Joined the Cluster [Client 1]) –> Client 1 receives: Event #3 (This Node Connected to Server)
Server fails over to Passive –> Client 1 receives: Event #4 (This Node Disconnected from Server) –> Client 1 receives: Event #3( This Node Connected to Server)
Client 2 joins the cluster –> Client 2 receives: Event #1 (Node Joined the Cluster) –> Client 2 receives: Event #3 (This Node Connected to Server) –> Client 1 receives: Event #1 (Node Joined the Cluster [Client 2])
Client 1 loses its connection long enough that it is determined to be out –> Client 1 receives: Events #4 (This Node Disconnected from Server) –> Client 2 receives: Event #2 (Node Left the Cluster [Client 1])
Client 1 attempts to reconnect to the cluster –> Client 1 receives: Event #5 (This Node Left the Cluster)

Walter Harley 2008-08-12

Here’s a log obtained by launching three instances of the jmx-utils sample app in three separate command shells, and then starting and stopping the server (with mvn tc:start and mvn tc:stop) in a fourth. Each instance of the sample app launches two nodes. Asterisks indicate points where the server was started and stopped.

Clearly it is pretty random which nodes get which notifications on a restart.

CLIENT 0/1 [INFO] [node1] Registering listener… [INFO] [node0] Registering listener… [INFO] [node1] Hit Ctrl+D or something like that to exit: registered. [INFO] [node1] My node id is: ClientID[0] [INFO] [node1] Current cluster members are: [‘ClientID[0]’, ‘ClientID[1]’] [INFO] [node0] Hit Ctrl+D or something like that to exit: registered. [INFO] [node0] My node id is: ClientID[1] [INFO] [node0] Current cluster members are: [‘ClientID[0]’, ‘ClientID[1]’] [INFO] [node1] Node connected: ClientID[2] [INFO] [node0] Node connected: ClientID[2] [INFO] [node0] Node connected: ClientID[3] [INFO] [node1] Node connected: ClientID[3] [INFO] [node1] Node connected: ClientID[4] [INFO] [node0] Node connected: ClientID[4] [INFO] [node1] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[5] ***** [INFO] [node1] This node disconnected: ClientID[0] [INFO] [node0] This node disconnected: ClientID[1] ***** [INFO] [node0] This node connected: ClientID[1] [INFO] [node0] Node connected: ClientID[2] [INFO] [node0] Node connected: ClientID[4] [INFO] [node0] Node connected: ClientID[3] [INFO] [node1] This node connected: ClientID[0] [INFO] [node0] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[0] [INFO] [node1] Node connected: ClientID[5] ***** [INFO] [node1] This node disconnected: ClientID[0] [INFO] [node0] This node disconnected: ClientID[1] ***** [INFO] [node0] This node connected: ClientID[1] [INFO] [node0] Node connected: ClientID[2] [INFO] [node0] Node connected: ClientID[4] [INFO] [node0] Node connected: ClientID[3] [INFO] [node0] Node connected: ClientID[5] [INFO] [node1] This node connected: ClientID[0] [INFO] [node0] Node connected: ClientID[0] *****

CLIENT 2/3 [INFO] [node1] Registering listener… [INFO] [node0] Registering listener… [INFO] [node1] Hit Ctrl+D or something like that to exit: registered. [INFO] [node1] My node id is: ClientID[2] [INFO] [node1] Current cluster members are: [‘ClientID[0]’, ‘ClientID[1]’, ‘ClientID[2]’, ‘ClientID[3]’] [INFO] [node0] Hit Ctrl+D or something like that to exit: registered. [INFO] [node0] My node id is: ClientID[3] [INFO] [node0] Current cluster members are: [‘ClientID[0]’, ‘ClientID[1]’, ‘ClientID[2]’, ‘ClientID[3]’] [INFO] [node1] Node connected: ClientID[4] [INFO] [node0] Node connected: ClientID[4] [INFO] [node1] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[5] ***** [INFO] [node1] This node disconnected: ClientID[2] [INFO] [node0] This node disconnected: ClientID[3] ***** [INFO] [node1] This node connected: ClientID[2] [INFO] [node1] Node connected: ClientID[4] [INFO] [node1] Node connected: ClientID[3] [INFO] [node0] This node connected: ClientID[3] [INFO] [node0] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[0] [INFO] [node1] Node connected: ClientID[5] [INFO] [node1] Node connected: ClientID[0] ***** [INFO] [node1] This node disconnected: ClientID[2] [INFO] [node0] This node disconnected: ClientID[3] ***** [INFO] [node1] This node connected: ClientID[2] [INFO] [node1] Node connected: ClientID[4] [INFO] [node0] This node connected: ClientID[3] [INFO] [node1] Node connected: ClientID[3] [INFO] [node1] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[0] [INFO] [node1] Node connected: ClientID[0] *****

CLIENT 4/5 [INFO] [node0] Registering listener… [INFO] [node1] Registering listener… [INFO] [node0] Hit Ctrl+D or something like that to exit: registered. [INFO] [node0] My node id is: ClientID[4] [INFO] [node0] Current cluster members are: [‘ClientID[0]’, ‘ClientID[1]’, ‘ClientID[2]’, ‘ClientID[3]’, ‘ClientID[4]’, ‘ClientID[5]’] [INFO] [node1] Hit Ctrl+D or something like that to exit: registered. [INFO] [node1] My node id is: ClientID[5] [INFO] [node1] Current cluster members are: [‘ClientID[0]’, ‘ClientID[1]’, ‘ClientID[2]’, ‘ClientID[3]’, ‘ClientID[4]’, ‘ClientID[5]’] ***** [INFO] [node1] This node disconnected: ClientID[5] [INFO] [node0] This node disconnected: ClientID[4] ***** [INFO] [node0] This node connected: ClientID[4] [INFO] [node0] Node connected: ClientID[2] [INFO] [node1] This node connected: ClientID[5] [INFO] [node1] Node connected: ClientID[3] [INFO] [node1] Node connected: ClientID[0] [INFO] [node0] Node connected: ClientID[3] [INFO] [node0] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[0] ***** [INFO] [node0] This node disconnected: ClientID[4] [INFO] [node1] This node disconnected: ClientID[5] ***** [INFO] [node0] This node connected: ClientID[4] [INFO] [node0] Node connected: ClientID[3] [INFO] [node1] This node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[5] [INFO] [node0] Node connected: ClientID[0] [INFO] [node1] Node connected: ClientID[0] *****

Fiona OShea 2008-09-16

Where did we leave this?

Walter Harley 2008-09-16

It’s on, but not at the top of, my to-do list.

Fiona OShea 2008-12-12

Hi Sergio, I think you have been doing some work/thinking on this. Do you have any thoughts?

Sergio Bossa 2008-12-13

Hi Fiona,

I didn’t experience such a problem in my tests. I’m taking a look at the logs that Walter posted above, and there seems to be some kind of timing issue. For example, after the first server restart, ClientID[0] gets the nodeConnected event *only* from ClientID[5], but ClientID[5] gets the same event even from ClientID[3]: how is it possible? Given that ClientID[5] started after (or together with) ClientID[0], why the latter didn’t receive the ClientID[3] notification? I can’t come up with an answer, but IMHO it is a minor problem: cluster events implementors shouldn’t rely on the nodeConnected event after a server restart for executing important business logic, because it depends too much on the order nodes re-join the cluster and JMX notifications get propagated.

What do you think?

Cheers,

Sergio B.

Fiona OShea 2008-12-17

Thanks for your input Sergio, we are going to review how we do this

Fiona OShea 2009-02-20

Deprecated api in next release, so bug will be obsolete.