목차

Ceph - MDS Reporting Slow requests

문제

[root@ceph-monitor ~]# ceph -s
  cluster:
    id:     xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx
    health: HEALTH_WARN
            3 clients failing to respond to capability release
            2 MDSs report slow requests
            
[root@ceph-monitor ~]# ceph health detail
MDS_SLOW_REQUEST 2 MDSs report slow requests
    mdsceph-mds0(mds.1): 6707 slow requests are blocked > 30 sec
    mdsceph-mds1(mds.0): 6659 slow requests are blocked > 30 sec

MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
    mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58094341
    mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58345871
    mdsceph-mds1(mds.0): Client xxxxxxxxx:health_check failing to respond to capability releaseclient_id: 58412670

해결방법

1. 현재 상태를 확인합니다.

# ceph health detail
HEALTH_WARN 3 clients failing to respond to capability release; 2 MDSs report slow requests
MDS_CLIENT_LATE_RELEASE 3 clients failing to respond to capability release
    mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555555
    mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555554
    mdsceph-mds1(mds.0): Client 123456789:health_check failing to respond to capability releaseclient_id: 5555553
MDS_SLOW_REQUEST 2 MDSs report slow requests
    mdsceph-mds0(mds.1): 6707 slow requests are blocked > 30 sec
    mdsceph-mds1(mds.0): 6659 slow requests are blocked > 30 sec

# ceph -s
    health: HEALTH_WARN
            3 clients failing to respond to capability release
            2 MDSs report slow requests

2. 위 내용에서처럼 문제가 발생한 MDS가 식별되면(여기서는 mds.0) 중단된 작업을 포함하는 MDS의 dump_ops_in_flight 내용을 덤프합니다. 덤프에서 client.123456789rdlock에 실패하여 대기 중인 잠금을 획득하지 못하는 중단된 작업을 가지고 있음을 알 수 있습니다.

# ceph daemon mds.0 dump_ops_in_flight
{
    "ops": [
        {
            "description": "client_request(client.123456789:12 lookup #0x1/health_check 2018-03-30 14:28:21.223159 caller_uid=1001, caller_gid=994{})",
            "initiated_at": "2018-03-30 14:28:21.223897",
            "age": 248958.400523,
            "duration": 248958.400541,
            "type_data": {
                "flag_point": "failed to rdlock, waiting",
                "reqid": "client.123456789:12",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.123456789",
                    "tid": 12
                },
                "events": [
                    {
                        "time": "2018-03-30 14:28:21.223897",
                        "event": "initiated"
                    },
                    {
                        "time": "2018-03-30 14:28:21.224145",
                        "event": "failed to rdlock, waiting"
                    }

3. 클라이언트가 식별되면(이 경우 client.123456789) mds에 연결된 현재 세션을 나열할 수 있습니다. session ls에서 오작동하는 연결된 클라이언트가 ceph version 10.2.7을 실행하고 있음을 볼 수 있습니다.

 # ceph daemon mds.0 session ls
    {
        "id": 123456789,
        "num_leases": 0,
        "num_caps": 1,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.123456789 1.2.3.4:0/1111111111",
        "client_metadata": {
            "ceph_sha1": "50e863e0f4bc8f4b9e31156de690d765af245185",
            "ceph_version": "ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185)",
            "entity_id": "POOL",
            "hostname": "ip-1.2.3.4",
            "mount_point": "/home/sc/mnt/ceph",
            "root": "/"
        }

문제가 된 클라이언트를 종료하거나 제거하여 문제를 해결하고 MDS에서 중단된 작업을 지웁니다.

# ceph tell mds.0 client evict id=123456789

참조링크