Master database switching to read-only mode, possibly caused by slave failure

#1

We had a database failure causing a production outage. Both master and slave databases have failed, slave became unresponsive and the master switched to read-only mode. By the look of it the slave database failed first and possibly caused the master to switch to read-only mode. The sequence of events was:

13:20 - our monitoring reported GC time per minute on the SLAVE instance over 30 seconds. The GC time on the slave instance continued to grow rapidly until it reached almost 1 minute per minute, at which point the process is essentially hanging - see attached screenshot.

13:22:43 - 13:25:43 - MASTER instance stopped responding to write requests. All requests sent to the database during 3 minutes between 13:22:43 and 13:25:43 returned at 13:25:44 with Exception:

[ObjectDB 2.3.7_08] javax.persistence.PersistenceException
Connection is closed (error 526)
        at com.objectdb.jpa.EMImpl.merge(EMImpl.java:456)
        at sun.reflect.GeneratedMethodAccessor528.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at org.springframework.orm.jpa.SharedEntityManagerCreator$SharedEntityManagerInvocationHandler.invoke(SharedEntityManagerCreator.java:240)
..............................................................
Caused by: com.objectdb.o.UserException: Connection is closed
        at com.objectdb.o.MSG.d(MSG.java:61)
        at com.objectdb.o.NTS.v(NTS.java:204)
        at com.objectdb.o.RPT.f(RPT.java:220)
        at com.objectdb.o.RPT.Vp(RPT.java:203)
        at com.objectdb.o.WSN.h(WSN.java:930)
        at com.objectdb.o.WSN.UC(WSN.java:238)
        at com.objectdb.o.STC.m(STC.java:346)
        at com.objectdb.o.SHN.ae(SHN.java:410)
        at com.objectdb.o.SHN.K(SHN.java:139)
        at com.objectdb.o.HND.run(HND.java:133)
        at java.lang.Thread.run(Unknown Source)

The reading continued to work and there was nothing in the server DB logs.

All write requests sent to master after this time have failed with the same exception.

13:25:59 - the slave closed it's database - the following was in the logs:
[2012-04-04 19:25:59 #4 store]
Database '/opt/objectdb/db/$replication/contextspace.odb' is closed by 30752@hardhead

We did not see anything wrong in the monitoring data for the master instance. We do collect quite a lot of monitoring data, so if you would like to see anything specific, please let me know.

The load was not particularly high and there was nothing unusual otherwise.

This was a production outage, so the focus was on restoring the service and we did not take heap dumps etc. We cannot reproduce the issue.

 

The main questions we have are:

 - Is it possible for the slave instance to impact the master instance in any way? 

 - Under which circumstances would the master DB switch to read-only mode?

 

Natalia.

#2

Thank you for this report.

> Is it possible for the slave instance to impact the master instance in any way?

It shouldn't, and this is unexpected.

> Under which circumstances would the master DB switch to read-only mode?

The master server should never switch to read-only mode so this is also unexpected.

Is it possible that actually the failure started on the master? In that case, if you use a combined master/slave url requests to the master all the requests are automatically routed to the slave, and the expected effect matches your description - READ requests are executed and WRITE requests are rejected.

Maybe even the high GC on the slave is the result of all the WRITE requests that it receives (maybe the refusal to handle these WRITE requests takes too much resources, maybe because of repeating attempts?)

As you know, master failure currently requires manual intervention to get out of READ ONLY mode, so your description may match such situation. Since there are no messages in the master log, maybe this is not exactly a maser failure but a failure to access the master from the client web application (e.g. due to a networking issue).

ObjectDB Support
#3

>  Is it possible that actually the failure started on the master?

It is possible that the failure started on the master, however we don't have any indication of failure on master prior to the GC increase on the slave. We have load on the server all the time, so we would see the failure straight away. 

> In that case, if you use a combined master/slave url
> requests to the master all the requests are automatically routed to the slave, and the expected effect matches your
> description - READ requests are executed and WRITE requests are rejected.

Firstly, we do not use combined master/slave url in our client as read-only access is not sufficient for our needs. The only reason for running the slave instance is to have a live copy of the data with assured data integrity.  

Secondly, after the slave went down we tried to restart it, but got an error similar to "Failed to synchronise the database", so we stopped it and did not start again. So, it was definitely master that went into read-only mode, the slave process was not even running.

>Maybe even the high GC on the slave is the result of all the WRITE requests that it receives (maybe the refusal to handle
> these WRITE requests takes too much resources, maybe because of repeating attempts?)
As I mentioned, our clients are not configured to connect to slave at all. Also, the GC issue started before we had any indication of failure. We had successfull write requests in the logs up to 13:22:42. All our servers have synchronised time.

> As you know, master failure currently requires manual intervention to get out of READ ONLY mode, so your description
> may match such situation. Since there are no messages in the master log, maybe this is not exactly a maser failure but
> a failure to access the master from the client web application (e.g. due to a networking issue).

Any networking issue would impact write and read requests equally, so it is highly unlikely.

Could you please explain how the slave communicates with the master to obtain it's updates? It seems to be some sort of RPC. Is it pull from the slave or push from the master? Any details you can share would be helpful.

Regards,

Natalia.

 

 

#4

Thank you for the additional details, which certainly rule out some options.

The replication is implemented by a simple open socket between the master and the slave. The slave keeps listening for updates from the master and applies any update when it is received. It is hard to see how a failure of the slave can have any effect on the master and affecting clients that are not connected to the slave is even more weird.

The exception that you get doesn't indicate a READ ONLY mode but an attempt to use a connection to the server (EntityManager) that was closed. Maybe some read operations still work due to cache on the client side? i.e. is it possible that the server didn't just switch to READ ONLY mode but the connections have been totally closed?

ObjectDB Support
#5

Thank you for this important report.

Build 2.3.7_14 should fix master server failures in such cases.

ObjectDB Support

Reply