linux-service-start-failed

记一次mariadb启动失败的问题定位。

背景

hans项目中,all-in-one的部署模式下,遇到一个问题:

mariadb.service start failed,原因是创建/var/run/mariadb/mariadb.pid失败

参考mariadb-start-failed-error可知,如何解决该问题。但是之前没定位到原因。今天研究了下原因,并记录如下。

过程

涉及到的知识如下:

  • service之间的依赖关系定义&查看
  • tmpfiles&systemd-tmpfiles-setup

首先创建pid file失败的原因本质是没有/var/run/mariadb这个dir,参考上面的解决方案,也是创建dir并chown权限为mysql用户&group。这就引出了对/var/run下面的dir是怎么创建&维护的研究。

systemd-tmpfiles

/var/run其实是个link,指向run,具体可以戳:即将消失的/var/run。其中比较重要的:

systemd-tmpfiles-setup其实是对systemd-tmpfiles的封装,会在boot的时候调用并按照tmpfiles.d目录下的配置来创建对应文件or目录。mariadb如果使用yum install,会在/usr/lib/tmpfiles.d下存在mariadb.conf,其内容有:

d /var/run/mariadb 0755 mysql mysql -

并通过验证,发现在mariadb start error的情况下,systemd-tmpfiles-setup.service也挂了。故锅就变成了为啥会影响到这个服务的启动。跟了一把/var/log/messages,从中看到如下内容

error1:

Feb  4 02:57:25 allinone-237 systemd-tmpfiles: Failed to parse ACL "d:group:adm:r-x,d:group:wheel:r-x": Invalid argument. Ignoring
Feb  4 02:57:25 allinone-237 systemd-tmpfiles: Failed to parse ACL "group:adm:r-x,group:wheel:r-x": Invalid argument. Ignoring
Feb  4 02:57:25 allinone-237 systemd-tmpfiles: Failed to parse ACL "d:group:adm:r-x,d:group:wheel:r-x": Invalid argument. Ignoring
Feb  4 02:57:25 allinone-237 systemd-tmpfiles: Failed to parse ACL "group:adm:r-x,group:wheel:r-x": Invalid argument. Ignoring

error2:

Feb  4 02:54:12 allinone-237 systemd-journald[115]: Received SIGTERM from PID 1 (systemd).
Feb  4 02:54:12 allinone-237 kernel: type=1403 audit(1486176851.284:2): policy loaded auid=4294967295 ses=4294967295
Feb  4 02:54:12 allinone-237 systemd[1]: Successfully loaded SELinux policy in 312.719ms.
Feb  4 02:54:12 allinone-237 kernel: ip_tables: (C) 2000-2006 Netfilter Core Team
Feb  4 02:54:12 allinone-237 systemd[1]: Inserted module 'ip_tables'
Feb  4 02:54:12 allinone-237 systemd[1]: Relabelled /dev and /run in 26.327ms.
Feb  4 02:54:12 allinone-237 systemd[1]: Found dependency on lvm2-monitor.service/start
Feb  4 02:54:12 allinone-237 systemd[1]: Found dependency on ics-storage-started.service/start
Feb  4 02:54:12 allinone-237 systemd[1]: Found dependency on basic.target/start
Feb  4 02:54:12 allinone-237 systemd[1]: Found dependency on sysinit.target/start
Feb  4 02:54:12 allinone-237 systemd[1]: Breaking ordering cycle by deleting job systemd-tmpfiles-setup.service/start
Feb  4 02:54:12 allinone-237 systemd[1]: Job systemd-tmpfiles-setup.service/start deleted to break ordering cycle starting with sysinit.target/start

error1通过如下验证可以看到:

[root@allinone-237 tmpfiles.d]# grep -rin "group:adm:r-x,group:wheel:r-x" /usr/lib/tmpfiles.d/
/usr/lib/tmpfiles.d/systemd.conf:29:A+ /run/log/journal/%m - - - - group:adm:r-x,group:wheel:r-x
/usr/lib/tmpfiles.d/systemd.conf:35:A+ /var/log/journal/%m - - - - group:adm:r-x,group:wheel:r-x

是解析systemd.conf中一个规则失败了,把放到一个单独的conf中,并调用systemd-tmpfiles --create xxx.conf验证并不会影响正常文件的创建。故继续验证error2,即topic:systemd如何控制服务启动的顺序咯

systemd对service启动顺序的控制

这个topic比较大,后续要补充。现在简单针对error2中log能看到的进行验证。read log发现

为了break ordering cycle(从sysinit.target开始),systemd-tmpfiles-setup.service被delete了!

比较坑,大概就是没执行吧。比对正常的centos72中的/var/log/messages,没有该部分内容的。所以下面问题转化为:

  • service的启动顺序怎么规定
  • service的依赖怎么查看

这里用到了几个命令

  • systemctl list-dependencies –before service
  • systemctl list-dependencies –after service

这个查看的其实是具体service脚本中规定的依赖规则,以lvm2-monitor为例:

[root@allinone-237 tmpfiles.d]# systemctl status lvm2-monitor
● lvm2-monitor.service - Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
   Loaded: loaded (/usr/lib/systemd/system/lvm2-monitor.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sat 2017-02-04 03:28:32 UTC; 4h 46min ago
     Docs: man:dmeventd(8)
           man:lvcreate(8)
           man:lvchange(8)
           man:vgchange(8)
 Main PID: 1037 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/lvm2-monitor.service

Feb 04 03:28:32 allinone-237 systemd[1]: Starting Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling...
Feb 04 03:28:32 allinone-237 systemd[1]: Started Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling.

能看到service的位置是/usr/lib/systemd/system/lvm2-monitor.service。vim打开可以看到

[Unit]
Description=Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling
Documentation=man:dmeventd(8) man:lvcreate(8) man:lvchange(8) man:vgchange(8)
Requires=dm-event.socket lvm2-lvmetad.socket
After=dm-event.socket dm-event.service lvm2-lvmetad.socket lvm2-activation.service lvm2-lvmetad.service
Before=local-fs.target
DefaultDependencies=no
Conflicts=shutdown.target

[Service]
Type=oneshot
Environment=LVM_SUPPRESS_LOCKING_FAILURE_MESSAGES=1
ExecStart=/usr/sbin/lvm vgchange --monitor y --ignoreskippedcluster
ExecStop=/usr/sbin/lvm vgchange --monitor n --ignoreskippedcluster
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target 

其中,具体的写法可以参考CREATING AND MODIFYING SYSTEMD UNIT FILES。后续应该单独写blog测试一把,这里其实只关注unit中的before和after。这两个元素定义了service的依赖关系。log中很明显,出现了loop,systemd对这种问题的处理方式是随机搞掉一个(注1),即log中delete to break ordering cycle。那剩下就是通过systemctl list-dependencies --before,从sysinit.target开始,依次验证是否存在log中的环。验证结果是存在的,有问题的是ics-storage-started.service中对lvm2-monitor的依赖出现了问题。

未完待续

需要跟进的思路

这个说明很简单,中间仍然很多地方不明晰,记录如下,以后跟进研究:

其实上面的文档讲了好多,可以研究细看下。

另外就是:

研究中用到的文档记录

注1:是不是随机有待验证