Issue #559: Blocked by find

Type: Bug ReoprtPriority: HighStatus: FixedReplies: 23
#1

I have a container object which wraps an objectdb database. There is a "putIfAbsent(String id, Object obj)" method which is synchronized and attempts to load an object by id using the EntityManager "find" method. If nothing is returned from the db the passed object is persisted to the db. I have 5 threads which use this method.

In a run today one of the threads seems to have got stuck somewhere in the find. The thread dump looks like:

Name: DEV1-WSAgent-1-Existing/3
State: RUNNABLE
Total blocked: 109  Total waited: 80

Stack trace:
com.objectdb.o.PGC.C(PGC.java:484)
   - locked java.lang.Object@881d38
com.objectdb.o.PGC.x(PGC.java:269)
com.objectdb.o.PGC.w(PGC.java:230)
com.objectdb.o.SNP.B(SNP.java:348)
com.objectdb.o.RTT.l(RTT.java:127)
com.objectdb.o.RST.l(RST.java:24)
com.objectdb.o.RTT.C(RTT.java:179)
com.objectdb.o.RST.r(RST.java:110)
com.objectdb.o.PGT.q(PGT.java:109)
com.objectdb.o.RST.A(RST.java:93)
com.objectdb.o.RTT.l(RTT.java:132)
com.objectdb.o.RST.l(RST.java:24)
com.objectdb.o.TSK.i(TSK.java:146)
com.objectdb.o.TSK.f(TSK.java:95)
com.objectdb.o.MST.aT(MST.java:582)
com.objectdb.o.MST.U2(MST.java:541)
com.objectdb.o.WRA.U2(WRA.java:260)
com.objectdb.o.LDR.w(LDR.java:381)
com.objectdb.o.LDR.v(LDR.java:292)
com.objectdb.o.LDR.s(LDR.java:210)
com.objectdb.o.OBC.aM(OBC.java:1056)
com.objectdb.o.OBC.aK(OBC.java:969)
com.objectdb.jpa.EMImpl.find(EMImpl.java:551)
   - locked com.objectdb.jdo.PMImpl@ae89a
com.objectdb.jpa.EMImpl.find(EMImpl.java:474)
rbccm.felix.objectdb.workflow.ObjectDbInstanceContainer.putIfAbsent(ObjectDbInstanceContainer.java:108)
   - locked rbccm.felix.objectdb.workflow.ObjectDbInstanceContainer@9c71a4
rbccm.digest.workflow.execution.WorkflowRunner.workflowCompleteForSomeTasks(WorkflowRunner.java:245)
rbccm.digest.workflow.execution.WorkflowRunner.workflowCompleteForSomeTasks(WorkflowRunner.java:204)
rbccm.felix.gridservice.workflow.WorkflowService.process(WorkflowService.java:125)
rbccm.felix.gridservice.workflow.WorkflowService.process(WorkflowService.java:97)
rbccm.felix.gridservice.workflow.WorkflowService.process(WorkflowService.java:75)
rbccm.felix.framework.service.ServiceRunner.run(ServiceRunner.java:259)
java.lang.Thread.run(Thread.java:662)

A dump from one of the blocked threads looks like:

Name: DEV1-WSAgent-1-Existing/4
State: BLOCKED on rbccm.felix.objectdb.workflow.ObjectDbInstanceContainer@9c71a4 owned by: DEV1-WSAgent-1-Existing/3
Total blocked: 190  Total waited: 69

Stack trace:
rbccm.felix.objectdb.workflow.ObjectDbInstanceContainer.putIfAbsent(ObjectDbInstanceContainer.java:102)
rbccm.digest.workflow.execution.WorkflowRunner.workflowCompleteForSomeTasks(WorkflowRunner.java:245)
rbccm.digest.workflow.execution.WorkflowRunner.workflowCompleteForSomeTasks(WorkflowRunner.java:204)
rbccm.felix.gridservice.workflow.WorkflowService.process(WorkflowService.java:125)
rbccm.felix.gridservice.workflow.WorkflowService.process(WorkflowService.java:97)
rbccm.felix.gridservice.workflow.WorkflowService.process(WorkflowService.java:75)
rbccm.felix.framework.service.ServiceRunner.run(ServiceRunner.java:259)
java.lang.Thread.run(Thread.java:662)

If you could advise on where the issue might be it'd be much appreciated. Please let me know if you need any further information.

#2

The stack trace is related to dilution of pages from the page cache when it becomes too loaded, but the reason for this exception is unclear.

ObjectDB Support
#3

Ok, thanks.

Another issue on a similar line. I have a simple database which contains one type of Entity. Access is through an application wrapper object with "put" and "get" methods - these are synchronised.

I'm frequently hitting a scenario where the application blocks with one thread in the wrapper and all other waiting. A dump of the thread which made it into the wrapper looks like:

Name: uatsrtlonw342-WSAgent-1-New/4
State: BLOCKED on com.objectdb.o.LKM@254e85f4 owned by: ODB-FileWriter
Total blocked: 789  Total waited: 616

Stack trace:
com.objectdb.o.MST.Va(MST.java:1251)
com.objectdb.o.WRA.Va(WRA.java:369)
com.objectdb.o.WSM.Va(WSM.java:174)
com.objectdb.o.WRA.Va(WRA.java:369)
com.objectdb.o.WSN.Va(WSN.java:653)
   - locked com.objectdb.o.WSN@95cb481
com.objectdb.o.OBM.bK(OBM.java:908)
com.objectdb.o.OBM.bI(OBM.java:813)
com.objectdb.o.OBM.bG(OBM.java:719)
com.objectdb.jpa.EMImpl.commit(EMImpl.java:276)
rbccm.felix.objectdb.workflow.ObjectDbDataContainer.save(Unknown Source)
   - locked rbccm.felix.objectdb.workflow.ObjectDbDataContainer@3b5ea7c0
rbccm.digest.workflow.serialisation.WorkflowSerialiser.read(Unknown Source)
rbccm.digest.workflow.serialisation.WorkflowSerialiser.readWorkflowInstance(Unknown Source)
rbccm.digest.workflow.serialisation.WorkflowSerialiser.fromCompressedXML(Unknown Source)
rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
rbccm.felix.framework.service.ServiceRunner.run(Unknown Source)
java.lang.Thread.run(Thread.java:662)

I'm not sure how to read this but it looks like objectdb is blocked in the commit? Could you give me some information around what is happening at this point so I can try to avoid the block?

#4

It seems that this thread is waiting to enter a synchronized block. It may be a deadlock.

Can you print all relevant stack traces (by using CTRL-BREAK, JConsole, etc.)?

ObjectDB Support
#5

The attached file shows a dump for all the threads interacting with objectdb in the application.

#6

The attached file shows the jstack output for all threads in case I missed anything in the previous post.

The objects being stored in this db are quite varied in size from <1k to >1MB but the entity object is simple and looks like:

@Entity
public class DataHolder {
@Id private String id;
@Index private String containerId;
private String data;

    public DataHolder(String containerId, String id, String data) {
     this.containerId = containerId;
        this.id = id;
        this.data = data;
    }
   
    public String getId() { return id; }
    public String getData() { return data; }
    public String getContainerId() { return containerId; }
}
#7

The 2 relevant stack traces may indicate an infinite loop in purging pages from the cache, i.e. #1 and #3 indeed reflect the same problem.

"ODB-FileWriter" daemon prio=6 tid=0x000000000c042800 nid=0x25b0 runnable [0x000000000dd0f000]
   java.lang.Thread.State: RUNNABLE
at com.objectdb.store.file.PageCache.C(PageCache.java:454)
- locked <0x0000000780d54768> (a java.lang.Object)
at com.objectdb.store.file.PageCache.B(PageCache.java:417)
- locked <0x0000000780a0efa0> (a com.objectdb.o.PGC)
at com.objectdb.store.file.StoreFile.ah(StoreFile.java:991)
at com.objectdb.store.file.StoreFile.run(StoreFile.java:899)
- locked <0x0000000780a0efa0> (a com.objectdb.o.PGC)
- locked <0x0000000780d9df10> (a com.objectdb.o.LKM)
at java.lang.Thread.run(Thread.java:662)

   Locked ownable synchronizers:
- None

"uatsrtlonw342-WSAgent-1-New/4" daemon prio=6 tid=0x000000000bbbe800 nid=0x307c waiting for monitor entry [0x0000000010d1f000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at com.objectdb.store.MainStore.Va(MainStore.java:1251)
- waiting to lock <0x0000000780d9df10> (a com.objectdb.o.LKM)
at com.objectdb.store.wrapper.StoreWrapper.Va(StoreWrapper.java:369)
at com.objectdb.store.wrapper.SemaphoreStore.Va(SemaphoreStore.java:174)
at com.objectdb.store.wrapper.StoreWrapper.Va(StoreWrapper.java:369)
at com.objectdb.store.wrapper.StoreSniffer.Va(StoreSniffer.java:653)
- locked <0x0000000780a0edd8> (a com.objectdb.o.WSN)
at com.objectdb.pc.ObjectManager.bK(ObjectManager.java:908)
at com.objectdb.pc.ObjectManager.bI(ObjectManager.java:813)
at com.objectdb.pc.ObjectManager.bG(ObjectManager.java:719)
at com.objectdb.jpa.EMImpl.commit(EMImpl.java:276)
at rbccm.felix.objectdb.workflow.ObjectDbDataContainer.save(Unknown Source)
- locked <0x0000000780a0e958> (a rbccm.felix.objectdb.workflow.ObjectDbDataContainer)
at rbccm.digest.workflow.serialisation.WorkflowSerialiser.read(Unknown Source)
at rbccm.digest.workflow.serialisation.WorkflowSerialiser.readWorkflowInstance(Unknown Source)
at rbccm.digest.workflow.serialisation.WorkflowSerialiser.fromCompressedXML(Unknown Source)
at rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
at rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
at rbccm.felix.framework.service.ServiceRunner.run(Unknown Source)

I suspect that somehow a list of cached pages contains a circle since there are no other apparent reasons.

With no test case - revealing the cause of this circle (if this is the problem) may require few iterations of trying special ObjectDB builds that will log relevant findings.

ObjectDB Support
#8

If you run build 2.3.3_02 with the new temporary "objectdb.page.circle" system property set to "true" - a page circle will be detected as soon as it is created and related info will be logged.

ObjectDB Support
#9

See attached for a test which I think replicates the problem.

It is a simple persist loop which saves objects and seems to hang at around the 5000-6000 object mark with a cpu core used to maximum. I've also attached the jstack output which seems to show the same state.

I've also just noticed your response re: v. 2.3.3_02 - I will try this and get back to you asap.

#10

The test reached:

Persisting data 15000

and it is still running.

Which hardware / software are you using?

ObjectDB Support
#11

In addition, can you post your objectdb.conf file please?

ObjectDB Support
#12

Ok - it looks like the test may not consistently produce the problem. It hung on my first 2 runs but I've just tried it on 2.3.3_02 and it passed 10k persists without any trouble, so i retried 2.3.3 and it also passed 10k this time :( I'll keep working to recreate it.

My objectdb.conf is attached. It is as packaged but with processing and query cache sizes set to 0mb due to issues we've seen in the past with large objects.

My system is win xp service pack 3, jdk1.6.0_24

#13

Ok, the test is not completely worthless but it is massively inconsistent - sometimes it hangs after 1000 persists, sometimes it seems to run on past 10k persists without problems. I bumped up the object size (int payloadSize = 1000000;) and have had it hang a couple more times (see attached thread dump for example). The system property doesnt seem to have produced any extra logging though - I've checked that it is definitely set by doing a System.getProperty just before creating the EntityManagerFactory.

I'll keep trying to narrow down the problem.

#14

After about 10 attempts I gave up.

Since you got the exception without the new logging - this is probably not a page circle issue.

I just fixed a missing synchronized that may be related - please try build 2.3.3_03.

ObjectDB Support
#15

Fair enough - its not a great test...

I've gone straight to an application test with 2.3.3_03 as this shows the issue more consistently. Everything was looking ok until I hit the following exception:

2011-11-14 07:46:15,137 uatsrtlonw342-WSAgent-1-Existing/1 ERROR Workflow2Runner - Error sending callback event
rbccm.felix.framework.ApplicationException: Error flushing data container
at rbccm.felix.objectdb.workflow.ObjectDbDataContainer.removeContainerData(Unknown Source)
at rbccm.digest.workflow.manage.WorkflowManager.removeWorkflow(Unknown Source)
at rbccm.felix.gridservice.workflow.WorkflowService.workflowCallback(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.workflowCallback(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.workflowCallback(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.workflowCompleteforThisRunner(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.processGraph(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.processWorkflows(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.processWorkflow(Unknown Source)
at rbccm.digest2.execution.Workflow2Runner.workflowCompleteForSomeTasks(Unknown Source)
at rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
at rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
at rbccm.felix.gridservice.workflow.WorkflowService.process(Unknown Source)
at rbccm.felix.framework.service.ServiceRunner.run(Unknown Source)
at java.lang.Thread.run(Thread.java:662)
Caused by: com.objectdb.o.InternalException: Unexpected internal exception
at com.objectdb.o.JPE.h(JPE.java:163)
at com.objectdb.o.ERR.f(ERR.java:69)
at com.objectdb.o.OBC.onObjectDBError(OBC.java:1485)
at com.objectdb.jpa.JpaQuery.executeUpdate(JpaQuery.java:737)
... 15 more
Caused by: java.lang.NullPointerException
at com.objectdb.o.MST.ad(MST.java:1321)
at com.objectdb.o.MST.U9(MST.java:1230)
at com.objectdb.o.PRG.ah(PRG.java:908)
at com.objectdb.o.PRG.ad(PRG.java:607)
at com.objectdb.o.PRG.ac(PRG.java:536)
at com.objectdb.o.QRM.U4(QRM.java:259)
at com.objectdb.o.MST.U4(MST.java:947)
at com.objectdb.o.WRA.U4(WRA.java:290)
at com.objectdb.o.WSM.U4(WSM.java:113)
at com.objectdb.o.WRA.U4(WRA.java:290)
at com.objectdb.o.WSN.U4(WSN.java:420)
at com.objectdb.o.QRR.g(QRR.java:232)
at com.objectdb.o.QRR.b(QRR.java:151)
at com.objectdb.jpa.JpaQuery.executeUpdate(JpaQuery.java:722)
... 15 more

This occured when the app tried to clear data for a particular id. The code looks like:

   Query query = em.createQuery("DELETE FROM DataHolder d WHERE d.containerId = :containerId");
   query.setParameter("containerId", containerId);
   query.executeUpdate(); 
   em.getTransaction().commit();

Unfortunately this meant I could let it run for long enough to confirm the blocking issue is resolved.

#16

This new stack trace helped solving the "Negative snapshot user count" error (Issue #556).

Please try build 2.3.3_04 that should solve both errors.

ObjectDB Support
#17

Unfortunately it still looks to be hitting a deadlock in the application (see attached thread dump). I'll keep trying to find a test which replicates the problem consistently.

#18

Actually, this may just be slow performance rather than blocking.

The attached test writes data to a db. Then it starts up 2 groups of 10 threads, one group is persisting more data, the other group is deleting the first set of data via delete statements. Eventually the test seems to grid to a halt - it needs to run for about a minute to get to this state.

In the application the data sizes are larger so I guess the delete is slower and gives the appearance of being blocked.

 

#19

Using find & remove seems to hit similar problems. See attached test - you can switch between delete statements and find/remove by commenting/uncommenting the "deletethreads" blocks.

The aim of this database is to provide a data cache for the main application - access is typically a number of persists as a request enters the system, several gets as the request is processed, then a delete of all the data once the request is complete. If you have any suggestions on how this could be better structured to maximise objectdb performance it'd be appreciated.

#20

Extremely slow performance may be the result of using zero size page cache, which causes cache purge to be invoked too often. Please try build 2.3.3_05 that tries to reduce this frequency.

If it doesn't help, you may try a cache size of at least 8MB:

 

<processing cache="8mb" max-threads="10" />

Currently zero size cache is not really zero, since there is an internal minimum of 200KB.

ObjectDB Support
#21

Regarding performance:

  • Make sure you use enhancement.
  • Strings, particularly when used for IDs, are usually slower. Use numbers if possible. Even a composite primary key of 2 long values could be faster than a string (but that should be checked).
ObjectDB Support
#22

Setting the processing cache to 8mb definitely helps the load and remove method of deleting - the test now completes when using that method. Using the delete query still halts the test though.

My main app is using enhancement so we're ok on that front. String ids are part of the app (and legacy components) so are unfortunately difficult to change.

#23

Looking further into the performance of the load and remove method of deleting I'm a little confused by the effect of the processing cache. I expected that the larger I could make it, the better the performance should be.

However, using the test above and a cache of 32mb the test runs in approx 110s. Reducing the cache size to 8mb causes the test to run in approx 85s.

#24

Large cache doesn't always assure better performance. If the cache is not used (e.g. in code that only write new objects and never read) caching more data just burdens the JVM heap and the GC.

ObjectDB Support

Reply