Issue #482: Out of Memory

#1

felixobjectdb

Joined on 2011‑02‑10

This is a continutaion of Issue 61 - unfortunately it doesnt seem that the fix in 2.2.9_03 has fixed the problem.

The scenario is still the same:

I have two "producer" processes which insert messages into an objectdb "queue" database running with an objectdb server process.

I have one "consumer" process which reads messages from the queue database and inserts these into an embedded "normal" objectdb database. This is the process which has generated the OOM exception.

Each "queue" database is actually 2 databases to seperate metadata used for searching and the data itself. So in total the "consumer" process is connected to 4 server-hosted databases and 9 embedded databases (the embedded databases are 3 "queue" databases and 3 normal databases).

I'm running with the following cache settings:

<processing cache="4mb" max-threads="10"/>
<query-cache results="4mb" programs="10"/>

The objectdb server process runs with -Xmx1g and the "consumer" runs with -Xmx2g. Message size is approx 1k and there should never be more than 2 messages being consumed at the same time.

In this case I've had 2 seperate instances fail so I have 2 heap dumps. I'm working on getting these to you - the files are large and ftp is disabled by my organisation - so please bear with me on this. In the meantime I've attached the leak suspects report from Eclipse Memory Manager in the hope it may help.

Please let me know if you need any further information (except the dumps themselves!)

java_pid5468_Leak_Suspects.zip (70.5 KB)

java_pid9212_Leak_Suspects.zip (66.7 KB)

#2

support

Joined on 2010‑05‑03

The fix of issue 61 fixed an important memory leak but unfortunately it seems that it is not the only one. Hopefully the new memory leak can also be fixed after exploring the heap dumps.

Maybe it would be better if you send a heap dump of your application after 2-3 days (before the OutOfMemoryError) instead of after 2-3 weeks - it may already indicate a problem and it would be easier to transfer and diagnose.

ObjectDB Support

#3

felixobjectdb

Joined on 2011‑02‑10

Yes - I'm working on getting into a position to be able to do that as well. The bank infrastructure makes it difficult but so does the fact we're using Java 1.5 on windows so the options for getting a dump are more limited.

I'm hoping to get a port opened up to the ftp site you sent through before - once this is done life will be much easier.

#4

thumbripper

Joined on 2011‑06‑18

The heap dumps have now been uploaded to the objectdb ftp site.

Files are:

java_pid5468.zip

java_pid9212.zip

#5

felixobjectdb

Joined on 2011‑02‑10

We've had another instance of the same problem overnight so I have a third heap dump I can send through if you think it'd be useful? It looks much the same as the others though.

In the meantime I've attached the Leak Suspects report for it.

java_pid5568_Leak_Suspects.zip (65.5 KB)

#6

support

Joined on 2010‑05‑03

In the Class Histogram page in the last zip there are 108,032 com.objectdb.o.SNP instances.

This is not normal and information about who holds most of these instances (i.e. a path to the root) may help.

ObjectDB Support

#7

felixobjectdb

Joined on 2011‑02‑10

Ok, I've uploaded the 3rd heap dump now.

I had some trouble with the connection so its divided into a 4 part rar archive named java_pid5568.*.rar

#8

support

Joined on 2010‑05‑03

Thank you. These heap dumps were very useful.

It seems that there is a problem with some heavy objects that ObjectDB manages using reference counting. Somehow in an unknown situation the reference count drops to -1 and then the objects are pinned. Unfortunately after checking carefully all the locations in which the counters are increased and decreased (mostly in safe try-finally) I still don't know where is the bug.

Please try build 2.2.9_07. It should solve the problem by blocking an attempt to decrease the reference count below 0.

In addition on such attempt it writes to the log an error message ("Negative snapshot user count") with a stack trace. If you can send me the stack trace from the log it may help in locating the reference counting bug.

ObjectDB Support

#9

felixobjectdb

Joined on 2011‑02‑10

Thanks for the update.

I'll let you know if/when we get the exception and/or further memory problems.

#10

felixobjectdb

Joined on 2011‑02‑10

Leading on from this I've been thinking about whether our approach to the "queue" databases is the most efficient.

Currently a queue database is reperesented by 2 databases under the covers - one for small metadata objects which are used for searching the queue (oldest message, highest priority, etc) - one for the message data which can be anything from 1k to 10mb.

We've proved the benefits of keeping the metadata & data separate in past testing but would it be worth changing our setup to a single database containing 2 tables representing the metadata and data?

I assume managing one database would be more lightweight but presumably we'd lose any benefits of caching around the metadata as soon as a large object is loaded? Are there any other considerations here? Which would be the prefered approach or would they be roughly equivalent?

Obviously we'd test & compare both approaches so I'm not looking for a definitive answer, rather just guidance around the pros and cons. I guess this could be turned into a more generic question about the relative merits of large databases with many tables vs. many small databases?

Thanks for your help.

#11

support

Joined on 2010‑05‑03

This is an interesting question, but as you wrote, an answer will require a live test.

Another idea for check - keeping separate database but separating also the servers, with different configuration (page cache size, query cache size) for the metadata server and the data server.

ObjectDB Support

#12

support

Joined on 2010‑05‑03

Any news regarding this issue?

Is there a "Negative snapshot user count" error message and stack trace in the logs?

ObjectDB Support

#13

felixobjectdb

Joined on 2011‑02‑10

No update I'm afraid - everything has been working fine since the last build. Suggest you close this issue and I'll raise a new one if/when the exception appears.

#14

support

Joined on 2010‑05‑03

OK. Thanks.

ObjectDB Support

Issue #482: Out of Memory - Slow leak?

Reply