Tech Blog

In this post, I'm going to take a look at a redundant Ceph storage cluster running on CentOS 7.  Although not completely accurate, I sort of think of Ceph as a redundant iSCSI option or even a 'poor mans' SAN because it can present block-level storage to clients.  Ceph is an extremely powerful enterprise-class clustering platform designed for performance, reliability, and scalability.

Ceph can provide object storage, a POSIX-compliant file system, or block level storage utilizing RADOS block devices (RBD).  As mentioned, I'm going to be working with block level storage.  The ceph file system looks very promising but as of this writing, it appears to be a bit questionable as to wether or not it is ready for a production environment.  It does appear to be actively developed so hopefully that will soon change.

As with all my posts, this is not intened to be a guideline to set up a production environment.  Instead, it is intended to introduce you to Ceph and walk you through the setup of your first Ceph cluster.  I should also note that Ceph can be set up in an automated fashion using ceph-deploy which is mostly what I will be using or manually.  Even with that in mind, there are several ways to get things done so make sure and take time to read the documentation as Ceph has FAR MORE features and functions than I am touching on here.

As with any new technology, it is probably a good idea to start off with some definitions:

Object Storage Device/Daemon (ceph-osd) - The object storage device is a physical or logical storage unit.  It is also used to refer to the software daemon that interaces with the physical or logical storage.  For redundancy, you should have an odd number of OSDs and more than one.

Monitor Daemon (ceph-mon) - The monitor daemon is used to monitor the Ceph storage for important information such as cluster membership, configuration, and state.  The monitor also includes a master copy of the cluster map.  For redundancy, you should have an odd number of monitors and more than one.  Ceph doesn't recomend running these on the same node with an OSD but I will do so for our demonstration.  Running them on the same node can cause performance issues under certain conditions on a production cluster.

Metadata Server Daemon (ceph-mds) - This daemon manages the file system namespace and coordinates access to the shared cluster.  Ceph block devices such as I will be setting up below don't use the metadata server daemon but I thought I would mention it anyway.  For redundancy, you should have an odd number of metadata deamons and more than one.

Pools - Pools or storage pools are logical groups that create a mechanism to break up and manage storage.  They can provide resilience, placement groups, CRUSH rules, snapshots, and ownership.  The three default pools created are data, metadata, and rbd.

Placement Groups - Used to aggregate objects within a pool.

Block Device - Any device that allow reading or writing data blocks.  Think "hard drive" and you probably get the general idea, although in reality it could be multiple drives or portions of drives.  This is a general term rather than being specific to Ceph.

rbd - This is a utility for manipulating rados block device images.  It is used by the Linux rbd driver and the rbd storage driver for Qemu/KVM.  We will use this command on our CentOS Linux client.

Journal - Ceph uses a special file system as a journal.   It writes data to the journal before writing it to the OSDs for both speed and consistency.  In a production environment, it is recomended to put your filestores on slower drives.  Then use a separate, smaller fast drive (SSD) for the journal.

CRUSH - An algorithm that determines how to store and retreive data by computating data storage locations.  CRUSH makes use of a map of your cluster which contains a list of OSDs, a list of 'buckets' for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a cluster's pools.  When you deploy Ceph with ceph-deploy, a default CRUSH map is generated.  The default CRUSH map is fine for a Ceph sandbox environment but should probably be tweaked for a production deployment.

RADOS - Reliable Autonomic Distributed Object Store, is an open-source object storage service that is an integral part of Ceph.

For this demonstration, I will be creating a 3-node cluster made up of virtual machines as follows:
- ceph1.theharrishome.lan - 192.168.1.185
- ceph2.theharrishome.lan - 192.168.1.186
- cehp3.theharrishome.lan - 192.168.1.187

All of these systems were fully updated CentOS 7 minimal installs and SELinux was disabled.  The NTP daemon was set up suing the following commands:
# yum -y install ntp
# systemctl enable ntpd
# ntpdate pool.ntp.org
# systemctl start ntpd

We need to add a few rules to our firewall as follows:
# firewall-cmd --permanent --add-port=6789/tcp
# firewall-cmd --permanent --add-port=6800-7300/tcp
# firewall-cmd --reload

I made the following addition to /etc/hosts so I could be assured of name resolution:
192.168.1.185 ceph1.theharrishome.lan ceph1
192.168.1.186 ceph2.theharrishome.lan ceph2
192.168.1.187 ceph3.theharrishome.lan ceph3

Ceph has the ability to configure the nodes from an 'admin node' using a tool called 'ceph-deploy'.  To do so, we need to set up a user on each node.  I did that as follows on all three nodes:
# useradd ceph

Then set the password
# passwd ceph

We need to allow the user to have sudo rights.  We also need to disable 'requiretty' for the user.  We can do both by issuing the visudo command ('# visudo').  Find the line that contains 'Defaults requiretty' and change it to 'Defaults:ceph !requiretty'.  Then we add 'ceph ALL = (root) NOPASSWD:ALL' to the bottom of the file and save it.

For this setup, I will use node1 as the admin node upon which I will run ceph-deploy.  We need to change to the ceph user on that node:
# su - ceph

We need to set up SSH keys so that node1 can SSH to node2 and node3 without the need to enter a password.  Actually you don't have to but it is much easier.  From node1:
# ssh-keygen
Just hit enter for the default file but do make note of it and enter again to skip entering a passphrase.

Then to copy the contents of the newely created public key (example /home/ceph/.ssh/id_rsa.pub) to each node we will use the ssh-copy-id command as follows:
# ssh-copy-id ceph@ceph2
# ssh-copy-id ceph@ceph3

Now test to verify you can SSH from node1 to node2 and node3 as follows:
# ssh ceph@ceph2
# ssh ceph@ceph3

Let's modify our ~/.ssh/config so the ceph user we just created can log in via SSH when called by ceph-deploy as follows:

Host ceph1
    Hostname ceph1
    User ceph
Host ceph2
    Hostname ceph2
    User ceph
Host ceph3
    Hostname ceph3
    User ceph

Make sure the permissions are set correctly on the new config file:
# chmod 644 ~/.ssh/config

Now to install some software.  We need to add the Ceph repo to yum with the following command ran on the admin node only (node1):
# sudo rpm -Uvh http://download.ceph.com/rpm/rhel7/noarch/ceph-release-1-1.el7.noarch.rpm 
This will install the LTS or Long Term Support version.  As of this writing, that is the 'Hammer' version.  You can find out more about Ceph versions at http://docs.ceph.com/docs/master/releases/.

From the admin-node (node1 for this example), install ceph-deploy with the following command and note the use of 'sudo' since we are still logged in as the ceph user we created earlier:
# sudo yum -y install ceph-deploy

Now let's deploy ceph to all three of our nodes.  We do not ever run 'ceph-deply' with sudo:
# ceph-deploy install ceph1 ceph2 ceph3
You will probably see a lot of information scroll by which is to be expected.  Here is a screen shot of the end of the command output:

Give the ceph user ownership of the /etc/ceph directory where the config files will be stored:
# sudo chgrp ceph /etc/ceph

Let's make sure any new files added to this directory keep the same group access by setting the stickey bit and also allow the ceph group to have write access:
# sudo chmod g+ws /etc/ceph

Change to that directory:
# cd /etc/ceph

Again, from the admin node, let's create our new cluster with a monitor on each node:
# ceph-deploy new ceph1 ceph2 ceph3
Upon successfull completion of this command, you should see ceph.conf, ceph.mon.keyring, and ceph.log in the /etc/ceph directory.

Now we need to make a change to the config file of /etc/ceph/ceph.conf on the admin node.  Open the file in your favorite text editor and add the following two lines to the bottom makeing sure to save it when done:

[mon]
mon clock drift allowed = .2500

I have found this setting necessary as sometimes the default setting is just too senstive to time differences between nodes even when they are all synced to the same time server.  This setting allows the time to be off up to a quarter of a second between nodes.

Now to create the initial monitor from the admin-node, we run the following:
# ceph-deploy --overwrite-conf mon create-initial
When this command has successfully completed, you will see the additonal files of ceph.bootstrap-mds.keyring, ceph.bootstrap-rgw.keyring, ceph.bootstrap-osd.keyring, and ceph.client.admin.keyring in the /etc/ceph directory.

We are ready to deal with storage.  Preapare your drive by partitioning it as you see fit.  For this, I will be using the first partition on vdb (or /dev/vdb1) which is 5 GB on each node.  BEWARE that the following command will wipe your data on the chosen partition ! ! !  I might also point out a nice tool that you may not be aware of to wipe a partition and that is 'wipefs'.  It is probably already on your system so you can use it by issuing a command such as '# wipefs -a /dev/vdb1' but be aware you won't get a warning prompt and it will remove your data ! ! ! Note also that you should still be in the /etc/ceph directory and you should still be acting as the ceph user.
# ceph-deploy osd prepare ceph1:vdb1 ceph2:vdb1 ceph3:vdb1
You will likely see some warning in the output of this command for each server because the partition is already present.  That is OK so long as it finishes up successfully on each node with output such as follows:

And now to activate the OSDs:
# ceph-deploy osd activate ceph1:vdb1
# ceph-deploy osd activate ceph2:vdb1
# ceph-deploy osd activate ceph3:vdb1
You may notice above that I chose not to activate all nodes with a single command.  The reason for this is you will most likely need to repeate each of the above commands a second time.  Simply hitting the up arrow and enter to run the exact same command will generally succeed.  The reason for this has to do with the fact that I am not specifying a separate partition for the journal.  For production, you should separate the OSD and the journal with the journal being on an SSD drive for maximum performance.

Here is what the error looks like on the first run:

And the very next run where it is successful:

Our new cluster should now begin to synchronize.  You can use two commands to find out information on the status:
# sudo ceph health
# sudo ceph status

Here is a screenshot of each:

You can see by the output that our new cluster doesn't look happy quite yet.  The goal is for the health to be 'HEALTH_OK' and 'active+clean'.  As you can see, that is not the case.  I did this demonstration using the long term support 'Hammer' release of Ceph, at least it is as of this writing.  In that version, the default weight for the OSDs in the CRUSH map is calculated based on terabyte storage with 1 terabyte yielding a weight of 1, half a terabyte yeilding a weight of .5 and so on.  Becuase my storage is only 5 GB, it rounds that to a weight of 0.  A weight of 0 means the OSD will never be used for storage and therefore the cluster will never come on line.  Utilizing CRUSH weight is a way of determing where Ceph stores data.  In the newer Ceph relase of 'Infernalis', they have changed this calculation so changing the weight is not necessary.  If you see a status of 'undersized+degraded+peered' you may need to adjust the CRUSH weight.  To give all OSDs on all nodes the same weight (321):
# sudo ceph osd crush reweight-subtree default 321
# sudo ceph osd crush reweight-subtree ceph1 321
# sudo ceph osd crush reweight-subtree ceph2 321
# sudo ceph osd crush reweight-subtree ceph3 321

Your cluster should begin to synchronize and eventually you should get a state of 'HEALTHY_OK' and 'active+clean'.  Do not proceed until you reach this status.  Google is your friend here   Here is what it should now look like:

Instead of using the default storage pool, let's create our own.  To do so we need to calculate the number of placement groups to use.  In my humble opinion, this is one of the most difficult task for this whole process and it doesn't apper to be an exact science.  As a general rule, the number of placement groups are calculated as follows:

Total Placement Groups = (Number of OSD * 100)/ Pool Size

Pools size is the number of replicas for replicated pools.  We use 2 for this to make sure any data is stored at least twice.  The answer to the equation is then rounded to the nearest power of 2.

So our equation looks like this:
3 x 100
______    =  150
    2

If you round that up to the nearest power of 2, you get 256.  So why did they recomend setting the number of placement groups to 128 for 5 or less OSDs?  Well, it's apparently not an exact science.  A lot has been written about placement groups.  Here are a couple of good references for more information:
http://docs.ceph.com/docs/master/rados/operations/placement-groups/
http://ceph.com/pgcalc/

Here is a quick 'power of 2' table to get you started:

22 4
23

8

24 16
25 32
26 64
27 128
28 256
29 512
210 1024

We will create a new pool named 'data-pool-1' with the recomended number of 128 placement groups:

# sudo ceph osd pool create data-pool-1 128

You are now ready to use your new cluster.  Before doing so, a word of caution.  Don't attempt to mount your new cluster from one of the server nodes, especially an OSD node.  There have been reports of this causing corruption!  Instead, use a completely separate client.  If you don't have one available, you may want to consider spinning one up by using VirtualBox.

You can use the ceph-deploy tool to install the necessary packages to a client.  For a client, I am using a fully updated (as of this wring) CentOS 7 minimal install.  So start off, we need to install the Ceph repository just as we did previously for the nodes:
rpm -Uvh http://download.ceph.com/rpm/rhel7/noarch/ceph-release-1-1.el7.noarch.rpm

And install the needed packages:
# yum -y install ceph ceph-radosgw

Now we need the ceph.client.admin.keyring on our new client.  Again, we could use ceph-ceply for this but I'm doing this step manually as I think it helps to understand whats going on.  So from the admin node, copy /etc/ceph.conf and /etc/ceph/ceph.client.admin.keyring to the client's /etc/ceph directory using something like this:
# sudo scp /etc/ceph/ceph.client.admin.keyring /etc/ceph/ceph.conf This email address is being protected from spambots. You need JavaScript enabled to view it.:/etc/ceph/

The rest of the commands will be done on the client.  To create a 5 GB block device image in the pool we created earlier on the cluster named 'data-pool-1':
# rbd create ceph-data-1 --size 5120 --pool data-pool-1

Now we need to map the new image to a block device on the client:
# rbd map ceph-data-1 --pool data-pool-1

New we format the device as normal:
# mkfs.ext4 /dev/rbd0

And finally, mount it as usual:
# mount /dev/rbd0 /mnt

Now you are ready to start kicking the tires on your new cluster!  Here are some additional commands that you may find usefull along with those used above to help get you started:

Remove from nodes:
# ceph-deploy purge <ceph1 ceph2 ceph3>

Purge only data from nodes:
# ceph-deploy purgedata <ceph1 ceph2 ceph3>

Forget keys:
# ceph-deploy forgetkeys

List a cluster's pools:
# ceph osd lspools

Show a pool's utilization (may take a minute to run):
# rados df

Unmap a block device:
# rbd unmap <ceph-data-1>

Remove a block device:
# rbd rm <ceph-data-1>

List the disks on a node:
# ceph-deploy disk list <ceph2>

Check the clusters usage stats among pools:
# ceph pg dump

For stats on placement groups:
# ceph pg dump

Your cluster is self-repairing but you can tell it to attempt to repair:
# ceph osd repair

View the CRUSH map:
# ceph osd tree

As you have probably noticed by now, Ceph uses keyrings to store one or more authentication keys.  You can track them as follows:
# ceph auth list

Get the number of placement groups in a pool:
# ceph osd pool get <name> pg_num

I have only touched on all the features and options available with Ceph.  It is an extremely powerful, tunable, and scalable system.  Before moving forward with a production Ceph cluster, do take time to read the documentation and set up some Ceph clusters in a lab environment to beat up on.  I hope you have learned something useful by reading this and thank you again.

- Kyle H.