프로젝트

일반

사용자 정보

실행

결함 #13005

restart 가 자주 반복되는 현상

이 헌제님이 약 2달 전에 추가함. 약 2달 전에 수정됨.

상태:
해결
우선 순위:
보통
담당자:
목표 버전:
시작 시간:
2026/02/24
완료 기한:
2026/02/26 (48일 지연)
진척도:

0%

추정 시간:
5:00 시간
발견 버전:
반영 버전:
난이도:
쉬움
중요도:
발견 유형:
조력자:
회사:
연락처:
점수:
1.25

설명

개요

Probe 에 실패하여 restart 가 자주 반복되는 현상이 있어 이를 해결해야 함.

[root@localhost gluesys-csi-v1.0.0]# kubectl get pods -n storage-1
NAME                                      READY   STATUS    RESTARTS         AGE
gluesys-csi-controller-57c7d4bb95-lwjrk   1/1     Running   0                16h
gluesys-csi-dhm44                         7/7     Running   22 (4h15m ago)   16h

Probe 는 단순히 vip:port 로 dial 을 3초간 전송하는데, timeout 으로 인해 실패함

다른 에러로는 삭제 실패가 있음

{"time":"2026-02-24T01:16:23.971821416Z","level":"ERROR","msg":"error during unary call","node":"localhost.localdomain","method":"/csi.v1.Controller/DeleteVolume","error":"rpc error: code = Aborted desc = Deleting LogicalVolume CRD: pvc-b312f317-93a4-4ea7-a187-78f9d3329e85"}

이것이 원인일지..

실행 #1

이 헌제님이 약 2달 전에 변경

  • 완료 기한 항목을 지정했습니다. (2026/02/26)
  • 상태 항목을 변경했습니다 (신규 => 검토)
  • 담당자 항목을 지정했습니다. (이 헌제)
  • 시작 시간 항목을 지정했습니다. (2026/02/24)
  • 난이도 항목을 지정했습니다. (쉬움)
실행 #2

이 헌제님이 약 2달 전에 변경

  • 상태 항목을 변경했습니다 (검토 => 진행)
  • 추정 시간 항목을 지정했습니다. (5:00 시간)
실행 #3

이 헌제님이 약 2달 전에 변경

  • 상태 항목을 변경했습니다 (진행 => 해결)
실행 #4

이 헌제님이 약 2달 전에 변경

  • 점수 항목을 변경했습니다 (0.00 => 2.75)
  • driver log
{"time":"2026-02-24T08:39:37.375048679Z","level":"INFO","msg":"Start to check TCP health","node":"localhost.localdomain"}
{"time":"2026-02-24T08:39:38.376635436Z","level":"ERROR","msg":"Storage TCP health check failed","node":"localhost.localdomain","address":"192.168.39.170:80","err":"dial tcp 192.168.39.170:80: i/o timeout"}
{"time":"2026-02-24T08:39:38.376709518Z","level":"ERROR","msg":"error during unary call","node":"localhost.localdomain","method":"/csi.v1.Identity/Probe","error":"rpc error: code = Unavailable desc = Storage backend unreachable: %!w(*net.OpError=&{dial tcp <nil> 0xc000027980 0x29a1ec0})"}
{"time":"2026-02-24T08:39:39.380506506Z","level":"INFO","msg":"Start to check TCP health","node":"localhost.localdomain"}
{"time":"2026-02-24T08:39:40.375807027Z","level":"ERROR","msg":"Storage TCP health check failed","node":"localhost.localdomain","address":"192.168.39.170:80","err":"dial tcp 192.168.39.170:80: i/o timeout"}
{"time":"2026-02-24T08:39:40.375973438Z","level":"ERROR","msg":"error during unary call","node":"localhost.localdomain","method":"/csi.v1.Identity/Probe","error":"rpc error: code = Unavailable desc = Storage backend unreachable: %!w(*net.OpError=&{dial tcp <nil> 0xc00071ad50 0x29a1ec0})"}
{"time":"2026-02-24T08:39:41.377937503Z","level":"INFO","msg":"Start to check TCP health","node":"localhost.localdomain"}
{"time":"2026-02-24T08:39:42.381448707Z","level":"ERROR","msg":"Storage TCP health check failed","node":"localhost.localdomain","address":"192.168.39.170:80","err":"dial tcp 192.168.39.170:80: i/o timeout"}
{"time":"2026-02-24T08:39:42.381552952Z","level":"ERROR","msg":"error during unary call","node":"localhost.localdomain","method":"/csi.v1.Identity/Probe","error":"rpc error: code = Unavailable desc = Storage backend unreachable: %!w(*net.OpError=&{dial tcp <nil> 0xc000710210 0x29a1ec0})"}
{"time":"2026-02-24T08:39:43.377079913Z","level":"INFO","msg":"Start to check TCP health","node":"localhost.localdomain"}
{"time":"2026-02-24T08:39:43.889092154Z","level":"WARN","msg":"Volume is already deleted","node":"localhost.localdomain","volume":"pvc-809c8ae2-06d8-44d6-97ba-db384a3dba78"}
{"time":"2026-02-24T08:39:43.892596107Z","level":"INFO","msg":"Request GetCapacity","node":"localhost.localdomain","type":"thin"}
{"time":"2026-02-24T08:39:43.938825709Z","level":"WARN","msg":"Volume is already deleted","node":"localhost.localdomain","volume":"pvc-6629f7fe-5123-4168-9e41-cd5551b84774"}
{"time":"2026-02-24T08:39:44.144849916Z","level":"WARN","msg":"Volume is already deleted","node":"localhost.localdomain","volume":"pvc-f79bdd98-3ce8-4ade-b11f-cc4a1ce4e5cb"}
{"time":"2026-02-24T08:39:44.379040199Z","level":"ERROR","msg":"Storage TCP health check failed","node":"localhost.localdomain","address":"192.168.39.170:80","err":"dial tcp 192.168.39.170:80: i/o timeout"}
{"time":"2026-02-24T08:39:44.379118543Z","level":"ERROR","msg":"error during unary call","node":"localhost.localdomain","method":"/csi.v1.Identity/Probe","error":"rpc error: code = Unavailable desc = Storage backend unreachable: %!w(*net.OpError=&{dial tcp <nil> 0xc000710630 0x29a1ec0})"}
  • controller 로그
{"time":"2026-02-24T08:36:43.281957759Z","level":"ERROR","msg":"Already contains finalizer","component":"LogicalVolumeReconciler","volumeID":""}
{"time":"2026-02-24T08:36:43.282004253Z","level":"INFO","msg":"Start to delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-4e44b540-fa5f-4073-9bcb-355f95e8b7bf"}
{"time":"2026-02-24T08:36:50.05246034Z","level":"INFO","msg":"Deleted Share","component":"LogicalVolumeReconciler","name":"pvc-4e44b540-fa5f-4073-9bcb-355f95e8b7bf"}
{"time":"2026-02-24T08:37:17.090216591Z","level":"INFO","msg":"Deleted LVM resource by template","component":"LogicalVolumeReconciler","name":"pvc-4e44b540-fa5f-4073-9bcb-355f95e8b7bf"}
{"time":"2026-02-24T08:37:28.881837717Z","level":"INFO","msg":"Deleted LV","component":"LogicalVolumeReconciler","name":"pvc-4e44b540-fa5f-4073-9bcb-355f95e8b7bf"}
{"time":"2026-02-24T08:37:28.88196448Z","level":"INFO","msg":"Successfully delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-4e44b540-fa5f-4073-9bcb-355f95e8b7bf"}
{"time":"2026-02-24T08:37:28.897622425Z","level":"ERROR","msg":"Already contains finalizer","component":"LogicalVolumeReconciler","volumeID":""}
{"time":"2026-02-24T08:37:28.897657004Z","level":"INFO","msg":"Start to delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-809c8ae2-06d8-44d6-97ba-db384a3dba78"}
{"time":"2026-02-24T08:37:35.145565892Z","level":"INFO","msg":"Deleted Share","component":"LogicalVolumeReconciler","name":"pvc-809c8ae2-06d8-44d6-97ba-db384a3dba78"}
{"time":"2026-02-24T08:37:58.461266426Z","level":"INFO","msg":"Deleted LVM resource by template","component":"LogicalVolumeReconciler","name":"pvc-809c8ae2-06d8-44d6-97ba-db384a3dba78"}
{"time":"2026-02-24T08:38:09.120999008Z","level":"INFO","msg":"Deleted LV","component":"LogicalVolumeReconciler","name":"pvc-809c8ae2-06d8-44d6-97ba-db384a3dba78"}
{"time":"2026-02-24T08:38:09.121044674Z","level":"INFO","msg":"Successfully delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-809c8ae2-06d8-44d6-97ba-db384a3dba78"}
{"time":"2026-02-24T08:38:09.133255852Z","level":"ERROR","msg":"Already contains finalizer","component":"LogicalVolumeReconciler","volumeID":""}
{"time":"2026-02-24T08:38:09.133433853Z","level":"INFO","msg":"Start to delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-6629f7fe-5123-4168-9e41-cd5551b84774"}
{"time":"2026-02-24T08:38:18.446242448Z","level":"INFO","msg":"Deleted Share","component":"LogicalVolumeReconciler","name":"pvc-6629f7fe-5123-4168-9e41-cd5551b84774"}
{"time":"2026-02-24T08:38:39.489277015Z","level":"INFO","msg":"Deleted LVM resource by template","component":"LogicalVolumeReconciler","name":"pvc-6629f7fe-5123-4168-9e41-cd5551b84774"}
{"time":"2026-02-24T08:38:48.886926308Z","level":"INFO","msg":"Deleted LV","component":"LogicalVolumeReconciler","name":"pvc-6629f7fe-5123-4168-9e41-cd5551b84774"}
{"time":"2026-02-24T08:38:48.886999147Z","level":"INFO","msg":"Successfully delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-6629f7fe-5123-4168-9e41-cd5551b84774"}
{"time":"2026-02-24T08:38:48.902190697Z","level":"ERROR","msg":"Already contains finalizer","component":"LogicalVolumeReconciler","volumeID":""}
{"time":"2026-02-24T08:38:48.90222863Z","level":"INFO","msg":"Start to delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-f79bdd98-3ce8-4ade-b11f-cc4a1ce4e5cb"}
{"time":"2026-02-24T08:38:54.095111147Z","level":"INFO","msg":"Deleted Share","component":"LogicalVolumeReconciler","name":"pvc-f79bdd98-3ce8-4ade-b11f-cc4a1ce4e5cb"}
{"time":"2026-02-24T08:39:14.572487413Z","level":"INFO","msg":"Deleted LVM resource by template","component":"LogicalVolumeReconciler","name":"pvc-f79bdd98-3ce8-4ade-b11f-cc4a1ce4e5cb"}
{"time":"2026-02-24T08:39:25.10810683Z","level":"INFO","msg":"Deleted LV","component":"LogicalVolumeReconciler","name":"pvc-f79bdd98-3ce8-4ade-b11f-cc4a1ce4e5cb"}
{"time":"2026-02-24T08:39:25.108147166Z","level":"INFO","msg":"Successfully delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-f79bdd98-3ce8-4ade-b11f-cc4a1ce4e5cb"}
{"time":"2026-02-24T08:39:25.121220614Z","level":"ERROR","msg":"Already contains finalizer","component":"LogicalVolumeReconciler","volumeID":""}
{"time":"2026-02-24T08:39:25.121259074Z","level":"INFO","msg":"Start to delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-1f0616e6-cbc3-4cb0-a414-53a46c58835c"}
{"time":"2026-02-24T08:39:30.152526687Z","level":"INFO","msg":"Deleted Share","component":"LogicalVolumeReconciler","name":"pvc-1f0616e6-cbc3-4cb0-a414-53a46c58835c"}
{"time":"2026-02-24T08:40:07.398247797Z","level":"INFO","msg":"Deleted LVM resource by template","component":"LogicalVolumeReconciler","name":"pvc-1f0616e6-cbc3-4cb0-a414-53a46c58835c"}
{"time":"2026-02-24T08:40:16.223360332Z","level":"INFO","msg":"Deleted LV","component":"LogicalVolumeReconciler","name":"pvc-1f0616e6-cbc3-4cb0-a414-53a46c58835c"}
{"time":"2026-02-24T08:40:16.223401845Z","level":"INFO","msg":"Successfully delete volume","component":"LogicalVolumeReconciler","volumeID":"","LV":"pvc-1f0616e6-cbc3-4cb0-a414-53a46c58835c"}
  • restart 현상 재현됌
실행 #5

이 헌제님이 약 2달 전에 변경

ShareCtl 을 disable/delete 할때마다 vip 가 흔들림.
Score 를 조정(500)해봤으나, 간혹 발생함.

score 를 줄인다? share 와 vip 관계를 없앤다? Failover 테스트 필요

실행 #6

이 헌제님이 약 2달 전에 변경

nc -z -v -w 3 192.168.39.170 80 와 같은 요청을 계속하는데,
gms service 에서 API 를 수행중이면 이후에 요청된 tcp handshake 가 대기상태라서
API 가 3초 이상 걸리면 health check 가 실패로 남겨짐

실행 #7

이 헌제님이 약 2달 전에 변경

 Resource Group: VIP-group-1
     vip_192.168.39.170 (ocf::heartbeat:IPaddr2):   Stopped
 rsc_VG1    (ocf::heartbeat:LVM):   Started ASE333-1
 rsc_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a   (ocf::heartbeat:Filesystem):    Started ASE333-1
 share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a (ocf::anystor-e:ShareCtl):  Stopped (disabled)

가장 마지막 ShareCtl 을 disable 하는 경우 vip 가 stop 되는 경우가 있음

실행 #8

이 헌제님이 약 2달 전에 변경

관련 로그 (시간 순서)

Feb 25 13:24:49 ASE333-1 pacemaker-schedulerd[2753] (common_print)  info: share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a    (ocf::anystor-e:ShareCtl):  Started ASE333-1

Feb 25 13:24:58 ASE333-1 pacemaker-based     [2746] (cib_perform_op)    info: ++ /cib/configuration/resources/primitive[@id='share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a']/meta_attributes[@id='share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a-meta_attributes']:  <nvpair id="share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a-meta_attributes-target-role" name="target-role" value="Stopped"/>

Feb 25 13:24:58 ASE333-1 pacemaker-controld  [2754] (abort_transition_graph)    info: Transition 4561 aborted by share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a-meta_attributes-target-role doing create target-role=Stopped: Configuration change | cib=0.2864.0

Feb 25 13:24:58 ASE333-1 pacemaker-schedulerd[2753] (common_print)  info: share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a    (ocf::anystor-e:ShareCtl):  Started ASE333-1 (disabled)

Feb 25 13:24:58 ASE333-1 pacemaker-schedulerd[2753] (native_color)  info: Resource share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a cannot run anywhere

Feb 25 13:24:58 ASE333-1 pacemaker-schedulerd[2753] (LogAction)     notice:  * Stop       vip_192.168.39.170                                            (             ASE333-1 )   due to unrunnable one-or-more:order_set_vip_192.168.39.170_set

Feb 25 13:24:58 ASE333-1 pacemaker-controld  [2754] (te_rsc_command)    notice: Initiating stop operation vip_192.168.39.170_stop_0 locally on ASE333-1 | action 45

Feb 25 13:24:58 ASE333-1 pacemaker-execd     [2751] (log_execute)   info: executing - rsc:vip_192.168.39.170 action:stop call_id:9826

Feb 25 13:24:58 ASE333-1 pacemaker-execd     [2751] (log_finished)  info: finished - rsc:vip_192.168.39.170 action:stop call_id=9826 pid:26821 exit-code:0 exec-time:51ms queue-time:0ms

Feb 25 13:24:58 ASE333-1 pacemaker-controld  [2754] (process_lrm_event)     notice: Result of stop operation for vip_192.168.39.170 on ASE333-1: 0 (ok) | call=9826 key=vip_192.168.39.170_stop_0 confirmed=true cib-update=19271

원인 요약

  1. share_VG1_pvc-a42b00c1-fe5c-42e3-93aa-3690f976512a 리소스의 target-roleStopped로 설정됨
  2. 해당 share 리소스가 실행할 수 없는 상태가 됨 (cannot run anywhere)
  3. order_set_vip_192.168.39.170_set constraint로 인해 vip도 함께 stop됨
  4. vip stop 작업 성공 (exit-code:0)
실행 #9

이 헌제님이 약 2달 전에 변경

위 댓글과 같은 추가 이슈가 발생하여 일감을 다시 염

실행 #10

이 헌제님이 약 2달 전에 변경

  • 상태 항목을 변경했습니다 (해결 => 진행)
실행 #11

이 헌제님이 약 2달 전에 변경

  • 상태 항목을 변경했습니다 (진행 => 해결)
  • 점수 항목을 변경했습니다 (2.75 => 1.25)
실행

내보내기 Atom PDF

클립보드 이미지 추가 (최대 크기: 50 MB)