I tried for many years with a storage server and a compute server. I never got it working good. The problem was when i was doing an update to the storage server or wanted to replace some part of it. I had first to stop all services that used the storage and then fix the storage server. If the replacement of some part did not go well the services could be down for a long time. At worst it could happen that i had to give up for the day and continue the next day. The services could be down for many hours.
Now it is not so important that every server is reliable. I can move workloads to another server in the cluster and start working on the server that has some error.
It is now much less downtime when i am doing updates. I can move a VM to another server when restarting a node. The VM only stops for a few seconds when doing a live migration.
I have one Windows fail-over cluster and one Linux Ceph cluster. I now know that having reliable storage is very important. Without that nothing is going to be reliable.
I am cheating on the fencing. For me the fail over thing is not so important. It is enough for me to be able to move workloads manually. The services i run on the Linux cluster dont have any fail over. I tried to use corosync and pacemaker, but it was difficult and i dont need it.
Sooner or later i believe i have to move everything the Linux cluster. Looks like Microsoft have stopped development on Hyper-V and Windows server. They want everybody to switch to Azure. They now have something called Azure Stack HCI. It is really nice of Microsoft to make it possible to install and run Windows server datacenter without activation. It is illegal but i dont think they care for my homelab.
This story will be updated.