Ceph - MDS Reporting Slow requests
문제
Ceph MDS가slow request로그가 발생하고Ceph FS클라이언트가 클러스터에 액세스할 수 없습니다.clients failing to respond to capability release가 발생합니다.
[root@ceph-monitor ~]# ceph -s
cluster:
id: xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx
health: HEALTH_WARN
3 clients failing to respond to capability release
2 MDSs report slow requests
[root@ceph-monitor ~]# ceph health detail
MDS_SLOW_REQUEST 2 MDSs report slow requests
mdsceph-mds0(mds.1): 6707 slow requests are blocked > 30 sec
mdsceph-mds1(mds.0): 6659 slow requests are blocked > 30 sec
MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58094341
mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58345871
mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58412670
해결방법
1. 현재 상태를 확인합니다.
# ceph health detail
HEALTH_WARN 3 clients failing to respond to capability release; 2 MDSs report slow requests
MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555555
mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555554
mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555553
MDS_SLOW_REQUEST 2 MDSs report slow requests
mdsceph-mds0(mds.1): 6707 slow requests are blocked > 30 sec
mdsceph-mds1(mds.0): 6659 slow requests are blocked > 30 sec
# ceph -s
health: HEALTH_WARN
3 clients failing to respond to capability release
2 MDSs report slow requests
2. 위 내용에서처럼 문제가 발생한 MDS가 식별되면(여기서는 mds.0) 중단된 작업을 포함하는 MDS의 dump_ops_in_flight 내용을 덤프합니다. 덤프에서 client.123456789가 rdlock에 실패하여 대기 중인 잠금을 획득하지 못하는 중단된 작업을 가지고 있음을 알 수 있습니다.
# ceph daemon mds.0 dump_ops_in_flight
{
"ops": [
{
"description": "client_request(client.123456789:12 lookup #0x1/health_check 2018-03-30 14:28:21.223159 caller_uid=1001, caller_gid=994{})",
"initiated_at": "2018-03-30 14:28:21.223897",
"age": 248958.400523,
"duration": 248958.400541,
"type_data": {
"flag_point": "failed to rdlock, waiting",
"reqid": "client.123456789:12",
"op_type": "client_request",
"client_info": {
"client": "client.123456789",
"tid": 12
},
"events": [
{
"time": "2018-03-30 14:28:21.223897",
"event": "initiated"
},
{
"time": "2018-03-30 14:28:21.224145",
"event": "failed to rdlock, waiting"
}
3. 클라이언트가 식별되면(이 경우 client.123456789) mds에 연결된 현재 세션을 나열할 수 있습니다. session ls에서 오작동하는 연결된 클라이언트가 ceph version 10.2.7을 실행하고 있음을 볼 수 있습니다.
# ceph daemon mds.0 session ls
{
"id": 123456789,
"num_leases": 0,
"num_caps": 1,
"state": "open",
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.123456789 1.2.3.4:0/1111111111",
"client_metadata": {
"ceph_sha1": "50e863e0f4bc8f4b9e31156de690d765af245185",
"ceph_version": "ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)",
"entity_id": "POOL",
"hostname": "ip-1.2.3.4",
"mount_point": "/home/sc/mnt/ceph",
"root": "/"
}
문제가 된 클라이언트를 종료하거나 제거하여 문제를 해결하고 MDS에서 중단된 작업을 지웁니다.
# ceph tell mds.0 client evict id=123456789