Sunday, November 28, 2010

A Clustered Samba Fileserver

A few weeks ago, I started this. Due to lack of time, to the many mistakes I noted during the process and to a move from Ubuntu 10.10/virtualbox to Fedora 14/kvm+qemu, I put it aside for a while.

Time to get back to it!

The test setup is made of:

  • 2 virtual machines running on kvm+qemu
  • Fedora 14

1. The need for a clustered CIFS Fileserver


Sad to say, but Windows Sharing is probably the most used file sharing over a network. Most customer grade NASes seldom offer anything else and a lot of clients have only netbios/CIFS available to access a network share.

But netbios/CIFS also comes with a price: a certain CPU and memory usage per client. That aside is not the issue, but in environment where hundreds or thousands of clients simultaneously access a share, this can lead to serious performance degradation.

At least two options are offered:
  1. Buy and install a beefier file server
  2. Add multiple nodes accessing the same filesystems, each presenting the same share

Solution 1 has an obvious limit: as the number of simultaneous clients grows again, another beefier server will be needed, and another ... until the hardware cost prohibits any further upgrade.

Solution 2, on the other end, while technically more complex, gives the ability to grow very high. of course, at a certain point, the limiting factor will most likely be the IO/s on the backend storage.

This solution declines in multiple versions: using a unique fileserver as the backend shared amongst all the nodes, using a centralized backend storage (iSCSI, Fiber Channel, ...) shared amongst all nodes, or using a local storage replicated to other nodes.

All have advantages and disadvantages. In this document, I'll look at the third option, as it's a most likely option for SMEs, although certain vendors, such as Promise or Compellent, have really cheap SANs.

I also find this solution more scalable: if you need more space - or performance - you can add a disk in each node, replicate it and add it to the Volume Group you share.


2. Installing and configuring DRBD

As mentioned, we need a mechanism to replicate a local block device (ie, a partition or a physical device) with an identical device in another computer.  

DRBD replicates a block device local to a computer with a block device on another computer. Said otherwise, DRBD will make sure that all write requests issued to the local block device is mirrored to a block device located on another computer, based on its configuration.

In itself, DRBD is not a cluster product (I repeat: DRBD is not a cluster product!) but it's a piece in the puzzle.

DRBD works by creating resources. See DRBD's home page [1] for an illustration of this. It also has some very fine documentation.

Most distribution have a package available to install DRBD. Also, DRBD is mainline since 2.6.33, so recent kernels already have the necessary configuration to support it.

DRBD works by creating resources, that specify which local device is to be used, what peer(s) do(es) the replication and what is the device on the peer to use. Once this is down, the resource can be activated and will start to work as a local device.


Initially, this worked by having a node as the primary, and the other as the secondary. However, it's now possible to have both nodes be primary, meaning they are both RW. This has a serious implication in terms of sensitivity to split-brain situation: if a write is issued on both nodes while the network connectivity is lost, which one should be propagated? Both? What if they both concern the same location on the physical device?

In an active-active scenario, we don't have much choice and have to have both nodes primaries.

In my setup, the resource is r0 and the local device to back it is vda3. Here is my resource file.

resource r0 {
  on linux-nas01 {
    device    /dev/drbd1;
    disk      /dev/vda3;
    address   192.168.0.18:7789;
    meta-disk internal;
  }
  on linux-nas02 {
    device    /dev/drbd1;
    disk      /dev/vda3;
    address   192.168.0.19:7789;
    meta-disk internal;
  }
  startup {
    become-primary-on both;
  }
  net {
    allow-two-primaries;
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
}
The startup and net directives were taken from DRBD's website. In the documentation, there are sections specific to Redhat Clusters, which I encourage to read [2].

On Fedora, an important step is to either open the ports configured in the resources on the firewall, or to disable the firewall. Being test hosts, I opted for the latter.

service iptables stop
chkconfig iptables off

And make sure DRBD is on, and will start at boot time.

chkconfig drbd on
service drbd start
In my case, I had to take an extra step, as DRBD wouldn't activate the resource, due to no peer being the formal primary and no UpToDate status.

drbdadm up r0
drbdadm -- --overwrite-data-of-peer primary all

After that, vda3 started its replication from a node to the other. The command "drbd-overview" will indicate that there is a sync in progress. Let it complete.

3. Installation of the cluster and tools

Next: we will take care of the cluster components on both nodes.

Fedora uses the Redhat Cluster Suite, composed of various pieces, such as cman, coro and such.

Important note: if you don't want to edit a few configuration files, make sure that the node names you give match the hostnames AND the DNS entries.

 First, let's create the cluster, named NASCluster.

ccs_tool create NASCluster
 And let's create a fence mechanism [3]. This will guarantee that if a node is acting stray, it will be prevented from being part of the cluster, or accessing cluster resources.
ccs_tool addfence linux-nas01 fence_manual
ccs_tool addfence linux-nas02 fence_manual
 
And let's add our nodes to the cluster.

ccs_tool addnode linux-nas01 -n 1 -v 1 -f linux-nas01
ccs_tool addnode linux-nas02 -n 2 -v 1 -f linux-nas02
Last, let's check that everything is as we expect it to be. 

ccs_tool lsnode
ccs_tool lsfence
Replicate /etc/cluster/cluster.conf to the other node, i.e. using scp.

And start cman on both nodes.

service cman start
Congratulations. At this point, the command cman_tool status and clustat should report a few information about your cluster being up.

4. Configuring clvmd, cluster LVM daemon

This one is a sneaky b....d. It took me almost a day to figure out what was wrong.

Every time I started clvmd, I would end up with cman terminating, not being able to restart it nor to unfence the node.

It appears that the default mechanism to insure cluster integrity is, for clvmd, "auto", which tries to detect whether corosync, cman or another tool should be used. I assume that my corosync configuration was incomplete. But it's possible to force cman to be used.

In /etc/sysconfig/clvmd - you may have to create the file - place the following line.

CLVMDOPTS="-I cman"
Next, edit /etc/lvm/lvm.conf and change the locking_type to 3. This is a recommendation from DRBD's documentation.

Also, make sure lvm knows it's a clustered implementation.

lvmconf --enable-cluster
And you shall be all set to start clvmd.

service clvmd start
If everything goes OK, clvmd should return, saying that nothing was found. This is normal.

Let's finish the cluster install by starting rgmanager.

service rgmanager start
Don't forget to do the clvmd configuration  and to start the daemons on both nodes.

5. Creation of the LVM configuration

Our goal, after all, is to make the device available on the cluster. So, let's do it after a quick recap on what is LVM.

Logical Volume Management allows one to group different "physical volumes", that is physical devices such as hard drives, logical devices, such as partitions or iSCSI/FC targets or even virtual devices such as DRBD devices into Volume groups, in which various logical volumes can be carved as needed.

A big advantage is the ability to quickly add space to a logical volume by giving it more extents, and growing the filesystem.


If down the road you notice that your /var partition is running out of space, either you have some free extents in the Volume Group you can give to that Logical Volume, or you may add a new disk, add it as a Physical Volume to the Volume Group and allocate now free extents to the Logical Volume.

In our case, the drbd resource, accessible as /dev/drbd1 will be treated as a physical volume, added to a volume group called drbd_vg and 900MB will be allocated to the Logical Volume drbd_lv.

pvcreate /dev/drbd1
vgcreate drbd_vg /dev/drbd1
lvcreate -n drbd_lv -L 900M /dev/drbd_vg
If everything went right, you can issue pvscan, vgscan and lvscan on the second node, and it should return you the various volumes you just created on the first node. It may be necessary to refresh the clvmd service.


Side note: an issue I got with the default start order ...

At a point, I stopped both nodes and restarted, to discover (horrified) that the actual partition, vda3, was then used rather than the drbd device. The reason is simple: the lvm2-monitor service starts, by default, before the drbd service.

I still have to go through the documentation, but as my setup didn't use lvm for anything else than the clustered file system, I went away by making sure drbd started before lvm. HOWEVER ... lvm2-monitor also starts at run-level 2, which drbd is not supposed to. So I disabled lvm2 in run-level 2.


6. Creation of GFS2 filesystem

The end is close. Let's now create a GFS2 file system. GFS (Global File System) is a Redhat developed clustered file system. In short, clustered file systems have the mechanisms to insure integrity (not two nodes should be writing at the same place, no node should consider some space as free when another node is writing to it) and consistency of the file system. This seems kind of obvious but, trust me, the details are really gory.

The file system is created using the well-known command mkfs

mkfs -t gfs2 -p lock_dlm -j 2 -t NASCluster:opt /dev/drbd_vg/drbd_lv
A special attention to the second parameter '-t'. It specifies the table to use, and should be labelled <clustername>:<fsname>. If the part before the colon doesn't match your cluster name, you won't be able to mount it.

If everything goes right, let's mount the file system on both nodes.

mount -t gfs2 /dev/drbd_vg/drbd_lv /opt

Try creating a file on node 2, you should see it on node 1.

The GFS2 service depends on an entry present in fstab. When you created the file system, a UUID was displayed. Use it to add a line in /etc/fstab:

UUID=cdf6fd4a-4cb2-7883-e207-5477e44d688e /opt              gfs2      defaults      0 0
This will mount my file system, with type gfs2, under /opt, with the default options, no dump and no fsck needed.

And the last step: let's start the gfs2 service.

service gfs2 start

7. SAMBA

In /etc/samba/smb.conf, we have only to present the mount point as a share, and restart the daemon on both nodes.

[opt]
    comment=Test Cluster
    path=/opt
    public=yes
    writable=yes
    printable=no
You may have to adjust a few other options, such as authentication or workgroup name.

Upon restart, you should be able to access the same files indifferently accessing the first or second node.

Side note: samba and clustering.

There is however a catch: samba is not meant to be a cluster application.

When accessing a file on a samba share, a lock is created and stored in a local database, in a tdb file. This file is local to each node and not shared, which means that a node has absolutely no idea of what the other nodes have as far as locks are concerned.

There are a few options to do a clustered install of the samba services, presented in [4].

8. Accessing the resource through a unique name

And the last piece. If we were to ask every user to chose between node 1 or node 2, they would probably either complain, or all use the same node.

A small trick is needed to make sure the load is spread on both nodes.

The easiest is to publish multiple A records in your dns.

cluster IN A ip1
cluster IN A ip2

Other ways are possible, such as having a home made script that will return the list of currently active nodes minus the ones that are already too loaded, or have the less loaded reported and so on.






Bibliography

[1] DRBD's home page
[2] DRBD's documentation on Redhat Clusters
[3] Explanation on cluster fencing
[4] Clustered samba


Thanks

Special thanks to the linbit team, to both Fedora and Redhat teams and everyone involved in Linux and clustering.

As usual, drop me a line if you have any question or comment.