Issue #2075: Database access error , Doctor hanging on trying to repair, production shutdown at our biggest customer

Type: Bug Reoprt

Version: 2.7.1

Priority: Normal

Status: Closed

Replies: 15

hgzwicker

Joined on 2014‑04‑09

we have a database shutdown, logs are showing:

[2017-07-10 21:07:23 #1 store]
Database 'F:\Hummingbird\Objectdb\db\coreSystemDb.odb' is opened by 11028@wzbhb101

[2017-07-10 21:07:23 #2 server]
Server on port 3333 has started by 11028@wzbhb101

[2017-07-14 07:08:38 #3 store]
SectionClassifier: SectionClassifier{2->merger[2699]-missing:1}

[2017-07-14 07:09:08 #4 store]
SectionClassifier: SectionClassifier{2->merger[2699]-missing:1}

[2017-07-14 07:09:37 #5 store]
SectionClassifier: SectionClassifier{194359909->merger[2699]-missing:1}

[2017-07-14 07:16:18 #6 store]
SectionClassifier: SectionClassifier{2->merger[2699]-missing:1}

on running Doctor it hangs after reading in the database, just no activity for hours (0 processor usage), no error message, nothing

support

Joined on 2010‑05‑03

What version of ObjectDB are you using?

Any additional information related to this incident would help.

If you need to repair the database (i.e. it is not a test database but contains useful data) and the Doctor cannot help you may send it to us (in a private support ticket). This could also help us fixing the Doctor to handle this situation in the future.

ObjectDB Support

hgzwicker

Joined on 2014‑04‑09

we are using 2.7.1_01 in our app, the doctor is the version from march 2016. Database file is around 35 GByte, how can we handle that ?

support

Joined on 2010‑05‑03

Is it the Doctor of version 2.7.0_01? If not, can you try it?

Can you provide a link (possibly in a support ticket) for downloading the compressed database file?

ObjectDB Support

hgzwicker

Joined on 2014‑04‑09

it is doctor from 2.6.7, we'll try the new version

support

Joined on 2010‑05‑03

OK. If fixing still fails please share the database and we will try to repair it as soon as possible.

We should discuss the possible causes for this crush but first priority should probably be to repair the database so that system could work again.

ObjectDB Support

hgzwicker

Joined on 2014‑04‑09

doctor is running, still repairing, these are the messages that are currently posted (already any idea how things like these can happen ? how can we go on ivestigating and fixing ?)

...

*** m_filePos -> 3012019718
*** m_filePos -> 3012085254
*** m_filePos -> 3012150788
*** m_filePos -> 3012216324
*** m_filePos -> 3012281860
*** m_filePos -> 3012347396
*** m_filePos -> 3012412932
*** m_filePos -> 3012478468
*** m_filePos -> 3012544004
*** m_filePos -> 3012609540
*** m_filePos -> 3012675076
*** m_filePos -> 3012740612
*** m_filePos -> 3012806148
*** m_filePos -> 3012871684
*** m_filePos -> 3012937220
*** m_filePos -> 3013002756
*** m_filePos -> 3013068292
*** m_filePos -> 3013133825
*** m_filePos -> 3013199361
*** m_filePos -> 3013264897
*** m_filePos -> 3013330433
*** m_filePos -> 3013395969
*** m_filePos -> 3013461505
*** m_filePos -> 3013527038
*** m_filePos -> 3013592574
*** m_filePos -> 3013658110
*** m_filePos -> 3013723646
*** m_filePos -> 3013789182
*** m_filePos -> 3013854718
100%

Global Value Errors
-------------------
[1] Unexpected total object count: 56960598 (expected 56960597)

BTree Value Errors
------------------
[1] com.agile.hummingbird.ObjectProperty
- Unexpected object count: 49830620 (actual 49830619)

Page Content Errors
-------------------
[1] Page #16748245 has a non first section entry (at entry 1).

Index Errors
------------
[1] Index com.agile.hummingbird.ObjectProperty[doubleValue] requires rebuild.
- has 0 entries instead of 1
[2] Index com.agile.hummingbird.ObjectProperty[name] requires rebuild.
- has 0 entries instead of 1
[3] Index com.agile.hummingbird.ObjectProperty[state] requires rebuild.
- has 0 entries instead of 1
[4] Index com.agile.hummingbird.ObjectProperty[type] requires rebuild.
- has 0 entries instead of 1
[5] Index nds:com.agile.hummingbird.ObjectProperty[name,state,doubleValue] requires rebuild.
- has 0 entries instead of 1

Large Sections with Errors
--------------------------
Group #1:
Page#7880349 5:194359909 0/2 0+1954/2699

...

support

Joined on 2010‑05‑03

The main error message may possibly reflect a corruption due to a critical issue that was fixed in version 2.7.1.

You wrote that the system is using version 2.7.1_01. Could you please double check it (sometimes the old objectdb.jar file remains in the classpath and the wrong version is used). When did you switch to that version and what version was used before?

Of course, if the corruption already happened then switching to version 2.7.1 wouldn't fix the error automatically, so running the Doctor is required. Hopefully the database can be repaired with no data loss.

The Doctor report also mentions index issues, which are probably not serious, as indexes will be rebuilt. Are these new indexes? If ObjectProperty has millions of objects then the numbers (0, 1) in these messages are unclear.

ObjectDB Support

hgzwicker

Joined on 2014‑04‑09

we are using 2.7.1_01, there have been issues with older versions with the concurrent read queries that have been solved for us with that version. We double checked the version and also use Maven

we did run Doctor after switching to the latest version before

there are no new indexes, ObjectProperty has around 50 million objects, I have no idea where these numbers (0,1) are comming from.

in some minutes we have the rar of the corrupt database, do you have a ftp server where we could transfer it to ?

#10

support

Joined on 2010‑05‑03

What is the size of the file after compression? If you can put it on the Internet and provide a link it would be easier as arranging FTP would take some time. Did the Doctor finish?

ObjectDB Support

#11

hgzwicker

Joined on 2014‑04‑09

doctor is done :)

login to our extranet at hummingbird-systems.com using objectdb twice, there is a main menu objectdb, and the rar, download it

#12

support

Joined on 2010‑05‑03

The database file was downloaded. You may want to block access to it now.

ObjectDB Support

#13

support

Joined on 2010‑05‑03

After many hours of post mortem analysis, the cause of this database corruption is still unclear.

There is one database page with a missing entry, which is part of one large ObjectProperty object of 2699 bytes that was divided to two database pages. All the other error messages in the Doctor report reflect this issue.

Build 2.7.1 includes a fix of a problem that may cause such errors, but according to your report this database was not used with earlier versions. Maybe it was built with 2.6.7 doctor? But even then it is unlikely that the problem could be caused by the Doctor, although strangely, it seems that the problematic database page has not been changed since it was created by the Doctor.

Any additional information may help in preventing similar issues in the future. For example, if you have the original database that was created by the Doctor in May when you switched to 2.7.1_01 it could help in checking if the issue happened then or later.

If you could run the system for awhile with recording enabled then if this happens again we may be in a position to reproduce the problem by running the recorded transactions.

ObjectDB Support

#14

hgzwicker

Joined on 2014‑04‑09

to clarify the history of this database:

- it was created and used with version 2.6.7

- then we updated our system to 2.7.1_01, but used, for sure our fault, doctor 2.6.7 directly after that to clean up

- then we had the crash last week

Now my question is:

the fix in 2.7.1 that you describe, is that only active if a doctor 2.7.1 was run before ? Are we after the doctor 2.7.1 repair of last week in a state where the fix is active ?

we are probably not able to use the recording as the system is permanently under heavy load. Recording affects the performance significantly, right ?

#15

support

Joined on 2010‑05‑03

> the fix in 2.7.1 that you describe, is that only active if a doctor 2.7.1 was run before ? Are we after the doctor 2.7.1 repair of last week in a state where the fix is active ?

The fix is active as soon as you use 2.7.1 but if the database is already corrupted (and it may be hidden until your application tries to access the corrupted object) then it will not help as it just prevents further corruption.

It does seem that the corrupted page has not been changed since its creation (every page keeps information on the last transaction in which it was modified). Is it possible that you had this corruption for 2 months, but it was found only now?

> we are probably not able to use the recording as the system is permanently under heavy load. Recording affects the performance significantly, right ?

If you write the recording to a different fast disk then the effect may not be that bad but it has to be checked.

Another thing that you can do is frequent online backups followed by a Doctor check of the backup files, at least daily, to verify that the database is healthy and find errors as soon as possible.

ObjectDB Support

#16

hgzwicker

Joined on 2014‑04‑09

> for sure this is possible

Issue #2075: Database access error , Doctor hanging on trying to repair, production shutdown at our biggest customer

Reply