Keepalived

g7s@lemmy.ml · 1 year ago

Keepalived

taladar@sh.itjust.works · 1 year ago

Whatever technology you end up using you should be aware that you will see an order of magnitude or two increase in complexity by running things in a HA way which is very likely to cause some additional downtime instead of reducing it for a while (and possibly even in the long-term).

Network block devices on clusters like Ceph or distributed filesystems have many more failure modes in addition to the ones of the underlying storage hardware due to their distributed nature. Clustered services are similar. You might also see new performance bottlenecks emerge (e.g. your network might be significantly slower in both latency and throughput than modern local SSD or NVMe storage) and new temporarily unavailable services when the failover happens too often.

My advice would be to start running something like that only on a dev/test system that sees some use for a few months at least to learn what to do when things go wrong before you even consider using them in production.

g7s@lemmy.ml · 1 year ago

Thank you for the insight. I will think about it more and set up a test lab. We have 2.5Gbit switches, so I hope the network won’t be a bottleneck