12.8 페일오버·스위치오버 시나리오

HA 도구를 설치하는 것과 실제 사고에서 잘 동작하는 것은 다릅니다. 운영자가 마주치는 6가지 실전 시나리오와 각각의 대응·점검 절차를 정리합니다. 모두 분기·반기 시뮬레이션 권장합니다.

시나리오 A — primary 자체 사고

가장 흔한 시나리오. 호스트 crash, OS panic, 디스크 사고 등.

자동 failover 흐름

    sequenceDiagram
  participant Primary
  participant Standby1
  participant Standby2
  participant DCS as DCS / Monitor

  Primary->>DCS: leader lock 갱신 (정상)
  Note over Primary: crash
  Primary--xDCS: 갱신 안 됨
  DCS->>DCS: TTL 만료
  Standby1->>DCS: candidate 신청 (LSN=X)
  Standby2->>DCS: candidate 신청 (LSN=X-100)
  DCS-->>Standby1: 더 신선한 LSN 선택 → leader
  Standby1->>Standby1: pg_ctl promote
  Standby2->>Standby1: re-attach

운영자 점검 항목

새 primary 확인 — DCS·patronictl·pg_autoctl show
클라이언트 traffic 전환 — HAProxy healthcheck 또는 connection string failover
WAL archive 연속 — 새 primary에서 archive_command 정상?
standby 부족 알람 — primary 1대 + standby 2대 → primary + standby 1대로 줄어듦
옛 primary 복구 계획 — pg_rewind 또는 reinit

데이터 손실 확인

비동기 복제였다면 commit 응답 받은 일부 트랜잭션이 사라질 수 있습니다. application 로그·pg_stat_archiver 비교로 얼마나 손실됐는지 파악.

시나리오 B — 계획된 switchover

OS 업그레이드, 하드웨어 교체, 분기 점검합니다.

# Patroni
patronictl switchover app-cluster --leader node1 --candidate node2

# repmgr
sudo -u postgres repmgr standby switchover --siblings-follow

# pg_auto_failover
pg_autoctl perform switchover

자동 도구가 짧은 점검 시간(보통 5~30초)에 leader·follower 역할 교체. 데이터 손실 0 보장 (동기 모드면).

절차

application 새 primary 인지 가능 확인 (connection pooler 설정)
switchover 실행
5분 모니터링 — replication 정상 재구성
옛 primary에서 OS 작업
옛 primary를 standby로 다시 합류

시나리오 C — split-brain

drone primary 두 개. 네트워크 분할로 발생합니다.

    flowchart TB
  subgraph A["Site A (네트워크 격리)"]
    P1["옛 primary<br/>(자기를 primary로 인식)"]
  end
  subgraph B["Site B"]
    P2["새 primary<br/>(promote됨)"]
    S["standby"]
  end
  A --x B
  classDef bad fill:#fee2e2,stroke:#b91c1c,color:#7f1d1d
  classDef ok fill:#d1fae5,stroke:#047857,color:#064e3b
  class P1 bad
  class P2,S ok

방지

DCS quorum (Patroni·etcd 3+)
witness 노드 (repmgr)
synchronous_commit = on + synchronous_node_count = N (필수 응답 수)
fencing/STONITH — 옛 primary를 완전히 정지

사고 시 대응

두 primary 모두 stop
옛 primary의 데이터 분석 — 어디까지 진행됐는지
새 primary 기준으로 정리
옛 primary는 reinit (직접 재합류 위험)

대규모 사고일 가능성 — 자동 도구가 잘 동작했다면 split-brain 거의 발생 안 합니다.

시나리오 D — 옛 primary 복귀

자동 failover 후 옛 primary가 다시 살아남.

자동 도구의 대응

Patroni + pg_rewind: 자동 합류, 변경 부분만 동기화합니다. 빠름
repmgr: repmgr node rejoin --force-rewind
pg_auto_failover: 자동 합류 또는 reinit

pg_rewind 한계

조건	메모
`wal_log_hints = on` 필요	또는 data_checksums 활성
timeline 갈림 후 옛 primary가 변경 안 됐어야	옛 primary가 promote 후 트래픽 받았다면 reinit 필요
WAL이 충분히 보존돼야	archive 또는 streaming slot 있어야

실패 시

pg_basebackup으로 전체 reinit. 큰 클러스터는 시간 오래.

시나리오 E — standby 뒤처짐 / lag 폭증

standby가 GB 단위로 뒤처져 failover candidate 부적합.

원인

단일 트랜잭션이 큰 변경 (UPDATE ALL)
standby 호스트 부하 (CPU·디스크)
큰 인덱스 빌드의 WAL
hot_standby_feedback 충돌

대응

lag 추적 — pg_stat_replication.replay_lag
원인 SQL 추적 — primary의 pg_stat_statements
잠깐 기다림 — single big tx면 끝나면 따라잡음
failover 보류 명시 — Patroni maintenance 모드

maximum_lag_on_failover 설정으로 너무 뒤처진 standby는 자동 promote 후보 제외.

시나리오 F — replication slot이 primary `pg_wal` 꽉 채움

standby가 끊겼거나 logical subscriber가 멈춤 → slot이 WAL 잡고 있습니다.

즉시 대응

-- 영향 큰 slot 식별
SELECT slot_name, active, wal_status,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
  FROM pg_replication_slots
 ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;

-- inactive slot이면 즉시 drop
SELECT pg_drop_replication_slot('old_inactive_slot');

긴급 시 slot drop + standby 재구축이 데이터 손실 없이 가장 안전합니다.

max_slot_wal_keep_size 설정으로 이런 사고 예방(12.2).

정기 시뮬레이션

분기마다 staging 또는 isolated 환경에서:

시뮬레이션	빈도
계획 switchover	매월 (실제 운영에서 OK)
primary 강제 kill -9	분기 (staging)
네트워크 분할 (iptables)	분기
standby lag 인위 발생	분기
slot 폭주	반기
전체 DR (다른 리전)	반기

측정 항목

failover 인지 시간
primary 부재 시간 (RTO)
데이터 손실 (RPO)
클라이언트 에러 노출
절차 문서와의 일치

운영 체크리스트

항목	빈도
primary·standby 1-1 lag 알람	실시간
`pg_replication_slots`의 inactive 알람	실시간
HA 도구(Patroni 등) 자체 알람	실시간
witness/quorum 노드 수	일
failover 시뮬레이션	분기
pg_rewind·base 백업 검증	분기

운영 안티패턴

안티패턴	위험
failover 시뮬레이션 없이 운영	사고 시 도구 동작 안 함
`maintenance` 모드 풀고 잊음	자동 failover 안 됨
2-node 동기 + quorum 없음	secondary 죽으면 primary 멈춤
split-brain 방지 없음 (DCS·witness 부재)	drone primary
옛 primary가 살아 돌아왔는데 reinit 없이 합류	데이터 충돌

HA의 가장 큰 적은 드물게 발생해 절차 잊혀짐. 분기·반기 시뮬레이션으로 운영자 손에 절차가 남아 있는지 자체가 진짜 HA의 척도.

정리

6가지 시나리오: primary 사고, 계획 switchover, split-brain, 옛 primary 복귀, lag 폭증, slot 폭주
자동 도구가 잘 도와주지만 절차 시뮬레이션이 진짜 HA
split-brain 방지: DCS quorum·witness·동기 quorum·STONITH
옛 primary는 pg_rewind 또는 reinit
정기 시뮬레이션 + 측정으로 RTO·RPO 추적

Part XII 복제와 고가용성이 끝났습니다. 다음 Part XIII에서는 성능과 튜닝을 봅니다.

12.7 pg_auto_failover