今日发生了WebSphere Crash的故障。因为备份软件占用了IBM WebSphere MQ的check point 文件(/var/mqm/qmgrs/队列管理器/amqalchk.fil)
因为MQ进程使用操作系统的O_NSHARE属性去打开”amqalchk.fil”文件。使用排他的方式打开checkpoint文件是为了避免两个队列管理器同时打开一个checkpoint文件。两个队列管理器同时打开checkpoint会导致日志损坏。当日志被损坏会导致MQ队列管理器无法启动。
英文文档解释的原文:
The file system must obey requests to lock files either using O_NSHARE on the open() call or fcntl() with F_SETLK or F_SETLKW. These calls are used to protect recovery log extent files from concurrent access by two instances of the same queue manager. Without these locking semantics,there is the risk that two processes will write to the same log extent file and result in a corrupted queue manager which cannot be restarted.
MQ 占用checkpoint的频率:By default a checkpoint will be taken after every 10,000 persistent MQGET and/or MQPUT operations, or every 30 minutes, if at least 100 messages have been GET and/or PUT, which ever occurs first. Additionally, a check point will also be taken when the ENDMQM, or the RCDMQMIMG commands have been performed.
所以当队列管理器无法以排他的方式打开checkpoint文件,队列管理器会自动退出,避免损坏日志文件。
详细的文档请参考:https://www.ibm.com/support/pages/checkpoints-websphere-mq-logging