MySQL/MariaDB HA: Galera Cluster vs. DRBD replication
DRBD® and LINBIT® are trademarks or registered trademarks of LINBIT in Austria, the United States, and other countries.
Other names mentioned in this document may be trademarks or registered trademarks of their respective owners.
This is a commercial document from LINBIT and Adfinis SyGroup. There are distribution terms that apply to this document, for more information please visit http://links.linbit.com/t-and-c.
2.1. DRBD Overview
2.2. Galera Cluster Overview
3.1. Network Traffic
3.2. Commit Latency
3.4. Load Balancing
- Further documentation
The DRBD-software is a Linux-kernel block-level replication facility that is widely used as an shared-nothing cluster building block. It is included in vanilla kernels since 2.6.33, and most distributions ship the necessary userspace utilities. Furthermore, many distributions have newer DRBD versions than the one included in the kernel package in extra packages.
DRBD can replicate across multiple network protocols, and in (currently) three modes – from synchronously for local HA clusters, to asynchronous for pushing data to a disaster recovery site.
DRBD is developed and supported world-wide by LINBIT; that includes most distributions and architectures, with a few SLA levels up to 24/7 email and phone reachability.
Galera Cluster is a synchronous multi-master database cluster, it provides High-Availability by replicating transactions to all nodes in the cluster. By removing the overhead introduced by a two-phase commit and moving to a certification based replication mechanism the solution allows near linear scalability while still maintaining High-Availability and consistency.
Galera is developed by Codership and fully integrated and supported in solutions by MariaDB . Adfinis SyGroup is a MariaDB partner and provides assistance in implementing, monitoring, and maintaining MariaDB based infrastructures.
This Tech-Guide compares two different High-Availability Solutions for MySQL databases; one is a block-device based replication solution, the other extends MariaDB internals to provide synchronous replication.
A few differences will be shown, advantages and disadvantages discussed.
DRBD is a block-device based replication solution, i.e. it will simply ensure that a range of storage blocks (a partition, harddisk, logical volume, etc.) is the same on two (or with DRBD 9 more) nodes.
This means that it is completely independent of the application using that storage, and even the specific filesystem doesn’t matter – it works equally well with XFS, ext4, BTRFS, and so on.
DRBD typically is used via TCP/IP connections; with DRBD 9 an RDMA transport is available, too, which reduces the network latency and therefore raises the number of available IOPs quite a bit.
Galera Cluster works within the MariaDB binary. Via configuration settings the
mysql binary loads the Galera shared library, which enables the network communication and replication to other
mysql processes on remote nodes.
Currently Galera Cluster is only compatible with the InnoDB storage engine, because only this engine provides the required transaction support. Support for further storage engines will be possible as soon as they support transactions.
As DRBD is completely unaware of the filesystem and application stack above, it will simply replicate all writes to the remote node(s) – this means application data, transaction logs, and indices as well as filesystem meta-data (eg. the filesystem journal, inodes, directories).
Galera Cluster will just send the logical changes, i.e. the contents of the transaction packed into a Galera
write-set, over the network. An
UPDATE statement involving thousands of rows will be about the size of the updated records, there’s no further overhead for indices or transaction logs.
The Galera Cluster communication can use either unicast (TCP) or multicast (UDP) connections. Multicast is especially well suited for big environments to further reduce network traffic.
With DRBD there is only one active Master for that database; so, as soon as the final disk write for the
COMMIT is being done, DRBD can deliver an acknowledge to the application. Depending on the storage stack this might be below 100 μsec.
In Galera Cluster the contents of a transaction are broadcasted to every node in the cluster. As soon as the client commits the transaction on one node, this node broadcasts the
write-set to the other nodes and they acknowledge the receipt. Each node then performs a certification of the write-set and commits the transaction locally. The originating node acknowledges the transaction to the client after the local certification has been successful.
Additional latency arises only during the broadcast step and equals the longest roundtrip time to any node in the cluster. For deployments within the same colocation this is normally below 400 μsec.
DRBD supports synchronous and asynchronous replication; the latter is useful for disaster-recovery across long distances. In that case there’s a separate product (the DRBD Proxy), which supports compression of the replication data stream, and so reduces the amount of bandwidth required.
Galera Cluster can only be used synchronously, but standard MariaDB asynchronous replication slaves can be attached to each Galera Cluster node. As there is an additional latency associated with each commit, there is a limit to the number of transactions that can be processed in WAN deployments. A rule of thumb for the maximum number of transactions is 1/RTT trx/s.
DRBD is typically used in an Active/Passive setup, i.e. each DRBD resource is only active on one node. That means that only one node will have access to the filesystem storing a database ; this node will have to do all the statement parsing, data fetching, deciding, and writing.
Galera Cluster is a pure Multi-Master solution – each node will provide its full resources. The only performance impact is caused by broadcasting the transaction to all nodes. Each node can be used for read-only queries, so read performance scales linearly. Because of optimistic locking some degree of write scalability can be achieved, but this depends on the application structure and at best you’ll be able to increase write performance by about 15%.
In an HA environment there has to be some planning-ahead for problems, too.
If the active node in a DRBD environment goes down (for whatever reason), the Cluster stack (typically Pacemaker with Heartbeat or Corosync) has to detect the problem and switch the services over to another node. In the worst case this will entail a filesystem check, a database recovery, and then the time required to get the caches hot again.
In Galera Cluster, when a single one node goes down, the remaining nodes in the cluster continue working without interruption. A client currently connected to the failing node would retry the connection via a loadbalancer, but other than that would not notice any interruption. When the crashed node gets started up again, it might need a filesystem check and database recovery too, and would not be available for load balancing purposes during that time.
After a crash the crashed node has to ensure getting the latest data.
DRBD, being in the block device layer, keeps a bitmap of dirty blocks, and will simply copy them over as soon as the DRBD connection is established again after the crash. This copying is done in on-disk-order; performance is only limited by the storage and network hardware, with a 10 GBit network and FusionIO cards you should be able to drive 1.2 GByte/sec.
Galera Cluster has two ways of updating data on a joining node. If the node was already a member of the cluster before and only left the cluster for a short period of time it tries to perform an Incremental State Transfer (IST) by pulling changes from the write-set cache of another node in the cluster.
If no other node in the cluster can provide the required changes for performing an IST from its write-set cache, or a new node joins the cluster, a Snapshot State Transfer (SST) will be performed. This means that all the data of the database will be transfered to the joining node. Galera selects a so called Donor node which is going to be the source of the transfer. As being a Donor can have a serious impact on performance, Donor nodes are often excluded from load balancing to ensure consistent read and write performance within the cluster.
Here is a concluding table, based on the discussion above.
|Network traffic||all changed disk blocks||only transaction contents||Galera|
|Latency||μsec to msec, depends on storage system||msec, because of Userspace/Kernel transitions||DRBD|
|Replication||synchronous or asynchronous (disaster-recovery)||synchronous or asynchronous||DRBD/Galera|
|Load Balancing||Block level data can be read from other nodes||full Multi-Master||Galera|
|Failover||via Cluster stack; seconds to minutes of downtime||other node continue without interruption||Galera|
|Resynchronization||Only changed disk blocks, block-device order||IST/SST||–|
The DRBD Project Page. Located at http://drbd.linbit.org, with lots of information, including a Users’ Guide that weighs in (at the last count) at 172 pages in PDF format – one of the most extensive project documentations in the Open Source world!
The LINBIT Home Page. Starting at http://www.linbit.com, this answers all questions about paid support from the developers. An overview about supported platforms, SLAs, and prices is at http://www.linbit.com/en/products-and-services/drbd-support/pricing?id=358
The Adfinis SyGroup Home Page. “We are the Linux Engineers” is the proud slogan on https://adfinis.com. With more than 15 years of Linux experience Adfinis SyGroup is the first address in Suisse.
The MariaDB Home Page. The MariaDB Corporation https://mariadb.com is the main driving force behind the development of MariaDB and provides support and assistance for MariaDB products together with its partners.
RAID controller with BBU, FusionIO cards, some SSDs ↩
Although having an Active/Active cluster by spreading the active resources across the nodes is recommended. ↩
Splitting one database doesn’t work, but you can run multiple mysql processes in a DRBD cluster, each with its own database and separate service-IP-address. ↩
E.g., database hot spots. ↩
All these can be mitigated by a good storage system. One of our customers could change this time from approx. 45 minutes to 30 seconds by switching the storage from harddisks to FusionIO cards ↩
For example, because of a reboot or upgrade and subsequent restart of MariaDB. ↩