Consistent OD replica corruption after 1 minute
I have two identical OD servers, a master and a replica, both running 10.9.5. About a week ago, authentication to the OD failed completely and slapd -Tt showed that the database was corrupted. I fixed the problem by executing:
db_recover -cv -h /var/db/openldap/openldap-data/
and
db_recover -cv -h /var/db/openldap/authdata/
Authentication to the master resumed, and everything is working. Except, that the corruption remained on the replica. I tried waiting for replication and even forcing replication, but that did not fix the problem. I eventually fixed the replica by using the same commands. However, the problem shows up, like clockwork, 1 minute after I repair. I can tell that the repair is working because I have a process on a another machine that queries both ODs on a schedule. I can see both queries being successful, and then 1 minute after the fix, the query to the replica begins to fail. Here is the error from slapd -Tt:
559abc16 bdb(dc=ohaephqxs001,dc=aepsc,dc=com): PANIC: fatal region error detected; run recovery
559abc16 bdb_db_open: database "dc=ohaephqxs001,dc=aepsc,dc=com" cannot be opened, err -30974. Restore from backup!
559abc16 backend_startup_one (type=bdb, suffix="dc=ohaephqxs001,dc=aepsc,dc=com"): bi_db_open failed! (-30974)
I've tried looking at opendirectoryd.log and setting od logging to debug, but I can't see anything in the mass of logs to indicate the problem. I have no idea why the corruption is happing so regularly. Replication doesn't happen that often does it? Any suggestions on where to look are appreciated.
Thanks.
Xserve, OS X Mavericks (10.9.5)