특정 pg down되어 peering으로 warning에 빠졌을 시 문제 해결

$ ceph health detail

HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
…
pg 0.5 is down+peering
pg 1.4 is down+peering
…
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

위중 pg 0.5 다운이된 것을 정확하게 알기 위해 다음을 실행.

$ ceph pg 0.5 query

{
  "state": "down+peering",
  …
  "recovery_state":[
  {
    "name": "Started\/Primary\/Peering\/GetInfo",
    "enter_time": "2012-03-06 14:40:16.169679",
    "requested_info_from": []
  },
  {
    "name": "Started\/Primary\/Peering",
    "enter_time": "2012-03-06 14:40:16.169659",
    "probing_osds": [0,1],
    "blocked": "peering is blocked due to down osds",
    "down_osds_we_would_probe": [1],
    "peering_blocked_by": [
    {
      "osd": 1,
      "current_lost_at": 0,
      "comment": "starting or marking this osd lost may let us proceed"
    }]
  },
  {
    "name": "Started",
    "enter_time": "2012-03-06 14:40:16.169513"
  }]
}

위의 출력은 recovery_state 에서 보는 바와 같다.

recovery_state 섹션은 다운 ceph-osd 데몬, 특히 osd.1로 인해 피어링이 차단되었음을 알 수 있다.
이 경우 ceph-osd를 시작하면, 모든 것이 복구된다.
만약 osd.1 디스크 오류를 알리는 경우라면, 그에 따른 대응을 해야 한다.

osd.1을 계속 사용하려면

$ ceph osd lost 1 #복구가 진행 될 것이다.

참조링크

https://idchowto.com/?p=34983