Ceph MDS
가 slow request
로그가 발생하고 Ceph FS
클라이언트가 클러스터에 액세스할 수 없습니다.clients failing to respond to capability release
가 발생합니다.[root@ceph-monitor ~]# ceph -s cluster: id: xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx health: HEALTH_WARN 3 clients failing to respond to capability release 2 MDSs report slow requests [root@ceph-monitor ~]# ceph health detail MDS_SLOW_REQUEST 2 MDSs report slow requests mdsceph-mds0(mds.1): 6707 slow requests are blocked > 30 sec mdsceph-mds1(mds.0): 6659 slow requests are blocked > 30 sec MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58094341 mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58345871 mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58412670
1. 현재 상태를 확인합니다.
# ceph health detail HEALTH_WARN 3 clients failing to respond to capability release; 2 MDSs report slow requests MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555555 mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555554 mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555553 MDS_SLOW_REQUEST 2 MDSs report slow requests mdsceph-mds0(mds.1): 6707 slow requests are blocked > 30 sec mdsceph-mds1(mds.0): 6659 slow requests are blocked > 30 sec # ceph -s health: HEALTH_WARN 3 clients failing to respond to capability release 2 MDSs report slow requests
2. 위 내용에서처럼 문제가 발생한 MDS가 식별되면(여기서는 mds.0
) 중단된 작업을 포함하는 MDS의 dump_ops_in_flight
내용을 덤프합니다. 덤프에서 client.123456789
가 rdlock
에 실패하여 대기 중인 잠금을 획득하지 못하는 중단된 작업을 가지고 있음을 알 수 있습니다.
# ceph daemon mds.0 dump_ops_in_flight { "ops": [ { "description": "client_request(client.123456789:12 lookup #0x1/health_check 2018-03-30 14:28:21.223159 caller_uid=1001, caller_gid=994{})", "initiated_at": "2018-03-30 14:28:21.223897", "age": 248958.400523, "duration": 248958.400541, "type_data": { "flag_point": "failed to rdlock, waiting", "reqid": "client.123456789:12", "op_type": "client_request", "client_info": { "client": "client.123456789", "tid": 12 }, "events": [ { "time": "2018-03-30 14:28:21.223897", "event": "initiated" }, { "time": "2018-03-30 14:28:21.224145", "event": "failed to rdlock, waiting" }
3. 클라이언트가 식별되면(이 경우 client.123456789
) mds
에 연결된 현재 세션을 나열할 수 있습니다. session ls
에서 오작동하는 연결된 클라이언트가 ceph version 10.2.7
을 실행하고 있음을 볼 수 있습니다.
# ceph daemon mds.0 session ls { "id": 123456789, "num_leases": 0, "num_caps": 1, "state": "open", "replay_requests": 0, "completed_requests": 0, "reconnecting": false, "inst": "client.123456789 1.2.3.4:0/1111111111", "client_metadata": { "ceph_sha1": "50e863e0f4bc8f4b9e31156de690d765af245185", "ceph_version": "ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)", "entity_id": "POOL", "hostname": "ip-1.2.3.4", "mount_point": "/home/sc/mnt/ceph", "root": "/" }
문제가 된 클라이언트를 종료하거나 제거하여 문제를 해결하고 MDS에서 중단된 작업을 지웁니다.
# ceph tell mds.0 client evict id=123456789