2019/11/01

今天模擬在一台graylog server 三個ES node的環境
然後當有一台ES fail時要怎麼處理

首先我們先看一下目前ES的狀況
有三個node
status也是green

curl -XGET http://192.168.12.201:9200/_cluster/health?pretty
{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 11,
  "active_shards" : 14,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

再來看一下目前所有shards的狀況

curl -XGET 192.168.12.203:9200/_cat/shards
gl-events_0        3 p STARTED 0 230b 192.168.12.202 es-node-02
gl-events_0        2 p STARTED 0 230b 192.168.12.201 es-node-1
gl-events_0        1 p STARTED 0 230b 192.168.12.203 es-node-03
gl-events_0        0 p STARTED 0 230b 192.168.12.202 es-node-02
graylog_3          2 r STARTED 1  7kb 192.168.12.203 es-node-03
graylog_3          2 p STARTED 1  7kb 192.168.12.201 es-node-1
graylog_3          1 r STARTED 1  7kb 192.168.12.202 es-node-02
graylog_3          1 p STARTED 1  7kb 192.168.12.201 es-node-1
graylog_3          0 p STARTED 1  7kb 192.168.12.202 es-node-02
graylog_3          0 r STARTED 1  7kb 192.168.12.203 es-node-03
gl-system-events_0 3 p STARTED 0 230b 192.168.12.203 es-node-03
gl-system-events_0 2 p STARTED 0 230b 192.168.12.202 es-node-02
gl-system-events_0 1 p STARTED 0 230b 192.168.12.201 es-node-1
gl-system-events_0 0 p STARTED 0 230b 192.168.12.203 es-node-03



我們關掉其中一個ES node 192.168.12.202 模擬故障

查看整個cluster狀況
nodes變成2
status也變為red

curl -XGET http://192.168.12.201:9200/_cluster/health?pretty
{
  "cluster_name" : "graylog",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "active_primary_shards" : 8,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5,
  "delayed_unassigned_shards" : 5,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 64.28571428571429
}

再來看一下shards的狀況
192.168.12.202這個node己經不見了
而且上面的shards 變成 UNASSIGNED

curl -XGET 192.168.12.203:9200/_cat/shards
gl-system-events_0 3 p STARTED    0 261b 192.168.12.203 es-node-03
gl-system-events_0 2 p UNASSIGNED                       
gl-system-events_0 1 p STARTED    0 261b 192.168.12.201 es-node-1
gl-system-events_0 0 p STARTED    0 261b 192.168.12.203 es-node-03
graylog_3          2 r STARTED    1  7kb 192.168.12.203 es-node-03
graylog_3          2 p STARTED    1  7kb 192.168.12.201 es-node-1
graylog_3          1 r STARTED    1  7kb 192.168.12.203 es-node-03
graylog_3          1 p STARTED    1  7kb 192.168.12.201 es-node-1
graylog_3          0 p STARTED    1  7kb 192.168.12.203 es-node-03
graylog_3          0 r STARTED    1  7kb 192.168.12.201 es-node-1
gl-events_0        3 p UNASSIGNED                       
gl-events_0        2 p STARTED    0 261b 192.168.12.201 es-node-1
gl-events_0        1 p STARTED    0 261b 192.168.12.203 es-node-03
gl-events_0        0 p UNASSIGNED


找一台机器重裝ES後並重新加入cluster


先看一下狀況
nodes己經回來變成3了
可是status還是red

curl -XGET http://192.168.12.201:9200/_cluster/health?pretty
{
  "cluster_name" : "graylog",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 8,
  "active_shards" : 11,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 3,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 78.57142857142857
}

再來看shards的狀況
還是 UNASSIGNED 並沒有復原

curl -XGET 192.168.12.203:9200/_cat/shards
gl-system-events_0 3 p STARTED    0 261b 192.168.12.203 es-node-03
gl-system-events_0 2 p UNASSIGNED                       
gl-system-events_0 1 p STARTED    0 261b 192.168.12.201 es-node-1
gl-system-events_0 0 p STARTED    0 261b 192.168.12.203 es-node-03
graylog_3          2 r STARTED    1  7kb 192.168.12.203 es-node-03
graylog_3          2 p STARTED    1  7kb 192.168.12.201 es-node-1
graylog_3          1 r STARTED    1  7kb 192.168.12.203 es-node-03
graylog_3          1 p STARTED    1  7kb 192.168.12.201 es-node-1
graylog_3          0 p STARTED    1  7kb 192.168.12.203 es-node-03
graylog_3          0 r STARTED    1  7kb 192.168.12.201 es-node-1
gl-events_0        3 p UNASSIGNED                       
gl-events_0        2 p STARTED    0 261b 192.168.12.201 es-node-1
gl-events_0        1 p STARTED    0 261b 192.168.12.203 es-node-03
gl-events_0        0 p UNASSIGNED


查了資料說可以 reroute share
但實作上有問題無法執行


目前試出來的做法是先關掉graylog server

systemctl stop graylog-server.service

接下來把所有的 UNASSIGNED 砍了

curl -XDELETE '192.168.12.201:9200/gl-system-events_0/'

curl -XDELETE '192.168.12.201:9200/gl-events_0'

砍完後再去看shards

curl -XGET 192.168.12.203:9200/_cat/shards
graylog_3 2 r STARTED 1 7kb 192.168.12.203 es-node-03
graylog_3 2 p STARTED 1 7kb 192.168.12.201 es-node-1
graylog_3 1 r STARTED 1 7kb 192.168.12.202 es-node-02
graylog_3 1 p STARTED 1 7kb 192.168.12.201 es-node-1
graylog_3 0 r STARTED 1 7kb 192.168.12.202 es-node-02
graylog_3 0 p STARTED 1 7kb 192.168.12.203 es-node-03

這個是原本的資料檔
而且有設定 Index replicas

重啟graylog server

graylog會把剛剛砍掉的 gl-system-events_0 gl-events_0 建回來
收集的資料是放在 graylog_* 所以不會有影響

再看一次shards
全部都正常了

curl -XGET 192.168.12.203:9200/_cat/shards
gl-system-events_0 3 p STARTED 0 230b 192.168.12.203 es-node-03
gl-system-events_0 2 p STARTED 0 230b 192.168.12.202 es-node-02
gl-system-events_0 1 p STARTED 0 230b 192.168.12.201 es-node-1
gl-system-events_0 0 p STARTED 0 230b 192.168.12.203 es-node-03
graylog_3          2 r STARTED 1  7kb 192.168.12.203 es-node-03
graylog_3          2 p STARTED 1  7kb 192.168.12.201 es-node-1
graylog_3          1 r STARTED 1  7kb 192.168.12.202 es-node-02
graylog_3          1 p STARTED 1  7kb 192.168.12.201 es-node-1
graylog_3          0 r STARTED 1  7kb 192.168.12.202 es-node-02
graylog_3          0 p STARTED 1  7kb 192.168.12.203 es-node-03
gl-events_0        3 p STARTED 0 230b 192.168.12.202 es-node-02
gl-events_0        2 p STARTED 0 230b 192.168.12.201 es-node-1
gl-events_0        1 p STARTED 0 230b 192.168.12.203 es-node-03
gl-events_0        0 p STARTED 0 230b 192.168.12.202 es-node-02


所以記得 Configure Index Set 要設定Index replicas 至少為1
Index shards的數量就根据你ES node的數量來設定
如果ES node 有三個 就設定為3




沒有留言: