chris blogs

22jan2018 · Anatomy of a Ceph meltdown

Last week, the server farm of our LMU Student Council had a major downtime over almost five days. As part of the administrator team there, I’d like to publish this post mortem to share our experiences and lessons learned to avoid situations like this in the future.

First and foremost, having a multiple-day spanning downtime is completely unacceptable for a central service like this (and I really wish there was a way to fix this quicker), but the nature of the issue made it really hard to find another solution or workaround. In theory it would have been possible to set up an emergency system restored from backups, but this would have blocked hardware that we need to ensure regular operation later. Also setting up things from scratch is likely to introduce new issues, and resources were bound on recovery. (Please remember that we are all unpaid volunteers who have our own studies and/or day jobs, and no one has had more experience with Ceph than what you get from reading the manual.)

A quick word on our setup: We have three file servers with 12TB storage each that provide each three Ceph OSDs, a monitor, and MDS (to provide CephFS to a shell server and the office machines). Connected to these are two virtualization hosts that run 24 virtual machines total in QEMU/KVM. The file servers and virtualization hosts run on Gentoo, most VM are Debian, a few run Windows. The setup is very redundant: Ceph guarantees each file server can drop out without problems, and if one virtualization host goes down, we can start all machines on the other host (even if main memory gets a bit tight then).

Unfortunately, Ceph itself is a single point of failure: when Ceph goes down, no virtual machine works.

It follows a protocol of the events:

2018-01-15: At night, trying to debug an issue related to CephFS, an administrator had to restart an MDS, which failed. Then they tried to restart an OSD, which failed too. This caused the Ceph cluster to start rebalancing. I was not involved yet; as far as I know no further action was taken.

2018-01-16: Trying to restart the OSD again, we noticed that ceph-osd crashed immediately. It turned out that all three systems had been updated a few times without restarting the OSD. No OSD could start anymore. We kept the last two OSD running (this turned out to be a mistake). The file servers, running Gentoo, also had a profile update done by another administrator. We came to the conclusion that we needed to rebuild world to get into a consistent state.

Ceph and glibc were built without debugging symbols, so all information we had came from the ceph-osd output of backtrace(3), which pointed to the functions parse_network and find_ip_in_subnet_list. These functions are run very early by Ceph during configuration file parsing. I looked into the code, and it was quite simple, and only used std::string and std::list, two interfaces that changed in the recent libstdc++ ABI change.

My working idea behind the bug was now that the libstdc++ ABI change between GCC 4.2 and GCC 6.2 triggered this.

After emerge world, which took several hours, all software was built on the new libstdc++ ABI.

ceph-osd still crashed.

Another theory was that tcmalloc was at fault, but a Ceph without tcmalloc failed as well.

We decided to build a debugging version of Ceph to inspect the issue deeper. Compiling Ceph on Gentoo failed twice: (1) Building Ceph failed due to Ceph trying to run git, which triggered a sandbox exception since we have a /.git directory in the root folder. This could be worked around by setting GIT_CEILING_DIRECTORIES. (2) Building Ceph with debugging symbols took more than 32 GB of disk space, so we had to create space for that at first.

2018-01-17: Debugging of Ceph intensified. It turned out the call to parse_network triggered a data corruption in a std::list<std::string>, which caused the destructor of this data structure to segfault. Tracking down the exact place where this corruption happened turned out to be hard: printing STL data structures is provided by gdb, but to create watchpoints on certain addresses you need to reverse-engineer the actual memory layouts. (For a short time, we assumed the switch to short string optimization was at fault, but spelling out the IPv6 address didn’t help.) Finally I managed to set a watchpoint, and it turned out inet_pton(3) triggered an overflow, which resulted in corruption of the next variable on stack, the list mentioned above. Googling some more turned up Ceph Bug #19371, which tells us that Ceph tried to parse an IPv6 address into a struct sockaddr, which only has space for an IPv4 address! This explained the data corruption. A fix was published in Ceph 10.2.8. We still ran Ceph 10.2.3, the version marked stable in Gentoo. (Up to this, we thought the quite old version of Ceph was not at fault, since it ran well before!)

We decided to update to Ceph 10.2.10.

The OSD crashed, but due to a different thing. First, the Gentoo init.d scripts were broken, secondly Ceph now assumes to run a user ceph (it ran as root before). We started ceph-osd as root again.

The OSDs started fine, so all OSD were restarted now. The MDS reported degradation and the storage itself was degraded a lot (this means the redundancy requirement was not met) and unbalanced.

Ceph started recovery, but for yet unknown reasons the OSD started to crash often and consume vast amounts of RAM (3-5x as much as usual), which drove the system into swapping at first, and then it started to disconnect the OSD because there were too slow to respond, which slowed down recovery even further.

We assume this is Ceph Bug #21761.

We reduced osd_map_cache trying to lower RAM usage, but we are not sure this had any effect.

We started adding more swap, this time on SSD which were meant to serve as Ceph cache usually. This made the situation a bit better, the OSD started to crash later, and had better responsiveness.

2018-01-18: Ceph recovery was still slow, so we looked for more information. MDS was still degraded, we did not know how to fix this. Reading the mailing list we learned to set noout (we knew that) and nodown, to force disable dropping out of the cluster. We also learned to set noup to let the OSD deal with the backlog, since the osdmap epochs were seriously out of sync (up to 10000). After setting noup and letting the OSD churn (this took several hours at high CPU load), the MDS was not degraded anymore! The system continued to balance and started backfilling.

At some point we took single OSD (backed by XFS) down to chown their storages to ceph:ceph, which took several hours each.

OSD RAM usage normalized.

2018-01-19: Backfilling progressed slowly, so we increased osd_max_backfill and osd_recovery_threads. We set noscrub and nodeepscrub to reduce non-recovery I/O. At some point later at night, the system went from HEALTH_ERR to HEALTH_WARN again!

2018-01-20: The OSD went all back to active+clean. Two things were stopping us from HEALTH_OK: we needed to set require_jewel_osds and sortbitwise. Setting both was unproblematic and worked fine.

We started to bring up first virtual machines again. This caused some minor fallout:

  • The LDAP server started fine, but did not bring up its IPv6 route (a Debian issue we hit before), so the mail server could not identify accounts. This was fixed quickly.
  • The mailing list server received a few mails to bigger mailing lists, and started to send them out all at once, which caused us to exceed quota at our upstream SMTP server (and the quota was too low, as it turned out later). This meant we had a backlog of over 5000 messages for several hours.

At the end of the day, all systems were operational again.

There is no evince that data was lost during the downtime. It is possible that inbound mail was bounced at the gateway, and thus not delivered, but in this case the sender was notified of this fact. All other mail that was sent inbound was delivered when the mail server came back up.

Lessons learned:

  • If we notice something is going wrong with Ceph, we will not hesitate to shut down the cluster prematurely. It’s better to have 30 min downtime once, than a mess of this scale.
  • We should not update Ceph on all machines at once. After updating Ceph (or other critical parts of the system), we will check all services restart fine.
  • We will build glibc with debugging symbols. (I think this would have pointed me to inet_ptoa quicker and saved a few hours of debugging.)
  • We will track Ceph releases more closely, and generally trust upstream releases (I don’t know why Gentoo does not stabilize newer releases of Ceph, they fix significant bugs).

    (At some point I had proposed to run the OSD in a Debian chroot, but stretch contains Ceph 10.2.5 which was affected by the same bugs.)

  • We need to find a solution to fix the Debian IPv6 issue, which bit us a bit too often.

NP: Light Bearer—Aggressor & Usurper

24dec2017 · Merry Christmas!

Jingle all the way
Comic © Wondermark

Frohe Weihnachten, ein schönes Fest, und einen guten Rutsch ins neue Jahr!

Merry Christmas and a Happy New Year!

NP: EA80—Nr. 1

27feb2017 · A time-proven zsh prompt

I’ve been using below shell prompt since 2013 and only slightly tweaked it over time. The most significant change was probably displaying the Git branch.

The basic idea of my prompt is to not show redundant or obvious information. This allows the prompt to be short, yet useful.

By default, the prompt displays the hostname, shortened directory, and a % to signify a zsh. The hostname is bold to make it stand out when you are scrolling, and the sigil is colored to mark the beginning of the command. It looks like this:

juno ~% ./mycommand -x

Long directory names are truncated in the middle:

juno /tmp/dirwithare…gname%

In rare cases, only showing two levels of hierarchy may be confusing, so you can set $NDIRS to something higher, e.g. 4:

juno deeply/nested/dir/structure%

When the previous command failed, the prompt also displays the exit status of the previous command:

juno 42? ~%

When there are background jobs running, the prompt shows how many there are:

juno 1& ~%

Note how the status and job display use the associated ASCII symbols.

When we are in a Git repository, the current branch is displayed inline as part of the base directory (when possible), or as a prefix, together with the repo name. By design, in the most common cases this keeps the prompt very short:

juno prj/rack@master%
juno rack@master/doc%
juno rack@master doc/Rack%

When the prompt detects a SSH session, the prompt sigil is doubled, so we are a bit more careful there:

hecate prj/lr%%

When the shell runs as root, the sigil is red (I don’t usually run zsh as root):

juno /etc#

That’s it, essentially. Apart from the Git integration, it’s really straight-forward. Not visible above is trick 4 to simplify pasting of old lines, and how it updates the title of terminal emulators to hostname: dir respectively hostname: current-command (which needs quite complicated quoting).

The whole thing is defined in the PROMPT section of my .zshrc.

NP: Light Bearer—Aggressor & Usurper

02jan2017 · zz: a smart and efficient directory changer

A nice feature I’ve become used to in the last year is a so-called “smart directory changer” that keeps track of the directories you change into, and then lets you jump to popular ones quickly, using fragments of the path to find the right location.

There is quite some prior art in this, such as autojump, fasd or z, but I could not resist building my own implementation of it, optimized for zsh.

As far as I can see, my zz directory changer is the only one with a “pay-as-you-go” performance impact, i.e., not every directory change is slowed down, but only every use of the smart matching functonality.

The idea is pretty easy: we add a chpwd hook to zsh to keep track of directory changes, and log for each change a line looking like “0 $epochtime 1 $path” into a file ~/.zz. This is an operation with effectively constant cost on a Unix system.

chpwd_zz() {
  print -P '0\t%D{%s}\t1\t%~' >>~/.zz
chpwd_functions=( ${(kM)functions:#chpwd?*} )

The actual jumping function is called zz:

zz() {

How does the matching work? It’s an adaption of the z algorithm: The lines of ~/.zz are tallied by directory and last-used time stamp, so for example the lines

0 1483225200 1 ~/src
0 1483225201 1 ~/tmp
0 1483225202 1 ~/src
0 1483225203 1 ~/tmp
0 1483225204 1 ~/src

would turn into

6 1483225204 3 ~/src
4 1483225203 2 ~/tmp

Also, the initial number, the effective score of the directory, is computed: We take the relative age of the directory (that is, seconds since we went there), and boost or dampen the results: the frequency is multiplied by 4 for directories not older than 1 hour, doubled for directories we went into today, halved for directories we went into this week, and divided by 4 else.

  awk -v ${(%):-now=%D{%s}} <~/.zz '
    function r(t,f) {
      age = now - t
      return (age<3600) ? f*4 : (age<86400) ? f*2 : (age<604800) ? f/2 : f/4
    { f[$4]+=$3; if ($2>l[$4]) l[$4]=$2 }
    END { for(i in f) printf("%d\t%d\t%d\t%s\n",r(l[i],f[i]),l[i],f[i],i) }' |

By design, this tallied file can be appended again with new lines originating from chpwd, and recomputed whenever needed.

The output of this tally is then sorted by age, truncated to 9000 lines, then sorted by score. (My ~/.zz is only 350 lines, however.)

      sort -k2 -n -r | sed 9000q | sort -n -r -o ~/.zz

With this precomputed tally (which is generated in linear time), finding the best match is easy. It is the first string that matches all arguments:

  if (( $# )); then
    local p=$(awk 'NR != FNR { exit }  # exit after first file argument
                   { for (i = 3; i < ARGC; i++) if ($4 !~ ARGV[i]) next
                     print $4; exit }' ~/.zz ~/.zz "$@")

If nothing was found, we bail with exit code 1. If zz is used interactively, it changes into the best match, else the best match is just printed. This allows using things like cp foo.mkv $(zz mov).

    [[ $p ]] || return 1
    local op=print
    [[ -t 1 ]] && op=cd
    if [[ -d ${~p} ]]; then
      $op ${~p}

If we found a directory that doesn’t exist anymore, we clean up the ~/.zz file, and try it all over.

      # clean nonexisting paths and retry
      while read -r line; do
        [[ -d ${~${line#*$'\t'*$'\t'*$'\t'}} ]] && print -r $line
      done <~/.zz | sort -n -r -o ~/.zz
      zz "$@"

With no arguments, zz simply prints the top ten directories.

    sed 10q ~/.zz

I actually shortcut zz to z and add a leading space to not store z calls into history:

alias z=' zz'

The full code (possibly updated) can be found as usual in my .zshrc.

I use lots of shell hacks, but zz definitely is among my most successful ones.

NP: Leonard Cohen—Leaving The Table

24dec2016 · Merry Christmas!

Comic © Liz Climo

Frohe Weihnachten, ein schönes Fest, und einen guten Rutsch ins neue Jahr wünscht euch
Christian Neukirchen

Merry Christmas and a Happy New Year!

NP: Against Me!—Haunting, Haunted, Haunts

15jan2016 · Dear Github

These kind of posts seem popular these days.

My top six of features GitHub is missing:

  1. Searching for text in commit messages. Fixed 2017-01-04. About 2/3 of the repos I clone, I solely clone to run git log --grep.
  2. Searching in the wiki. Fixed 2016-08-08.
  3. Archive tarballs with submodule checkouts included; else submodule usage is totally pointless.
  4. Marking issues private to committers. Useful both for embargoed security issues and to keep out an angry mob.
  5. Being able to disable pull requests. For projects that use Github mainly as a mirror.
  6. IPv6 support. It’s 2016, damnit.


NP: Revolte Springen—Hinter den Barrikaden

24dec2015 · Merry Christmas!

Consumers' crèche

Frohe Weihnachten, ein schönes Fest, und einen guten Rutsch ins neue Jahr wünscht euch Christian Neukirchen

Merry Christmas and a Happy New Year!

NP: Elende Bande—Uns das Leben

19feb2015 · Six hacks for less(1)

Recently I got around to configuring less, and I collected these few tricks:

  1. Sometimes I look at lists with less, and then do things step-by-step, keeping the current action at the top of the page. This works nicely until you end up at the last page of the file, and then can’t scroll down. You lose track of where you are at and get confused.

    It would be much nicer scrolling down, and filling up the buffer with ~ after the end of file, just as if you had searched in the pager.

    Actually, with ESC-SPC, you can move a full page down, filling up the buffer with ~. Toying around a bit, you’ll find out that you can override the “page length” with a prefix, i.e. 1 ESC-SPC will move down one line only!

    However, this is still inconvenient to type all the time, thus let’s define a keybinding. For this, create a file ~/.lesskey where we will put the key definitions. This file then will be compiled using lesskey(1) and generate a binary configuration file ~/.less. (I guess you can be lucky that m4 is not involved in this mess…)

    One problem is actually binding the key. You can easily bind the cursor down key (\kd) to forw-screen-force, but how do you pass 1? The canonical hack is to use the noaction action, which will behave just like you’ve typed the keys after it. Thus, we write:

    \kd noaction 1\e\40
    j noaction 1\e\40

    (By the way, that #command comment is important to tell lesskey you are defining key commands.)

    Finally, scrolling bliss!

    Actually, scrap that.

    The badly underdocumented key J (and K) will scroll how I want, but you only read about that in the example inside lesskey(1). Therefore, we can just do:

    \kd forw-line-force
    j forw-line-force

    These keybindings are there since at least 1997 and I’ve never found them before…

  2. While we are redefining keys, I’ve always found it a bit clumsy to read multiple files, having to type :n and :p. Using [ and ] is much more convenient (at least on a US keyboard), and by default these keys do things of questionable utility.

    [ prev-file
    ] next-file
  3. Did you ever wish to give feedback from less? Like have a script output some info, and you decide how to go on? Since less always exits with status 0 usually, this I thought this was tricky to do, but the quit action actually can return an arbitrary exit code, encoded as a character.

    I bound Q and :cq (like in vim) to exit with status 1:

    Q quit \1
    :cq quit \1

    Now you can do stuff like look at all files and have them deleted when you press Q instead of q to exit:

    for f in *; do less $f || rm $f; done
  4. I use less a lot to look at patches, git log output, and ocassionally mailboxes. The D command as defined below will move to the next line starting with diff or commit or From␣.

    D noaction j/\^diff|commit|From \n\eu

    It will also “type” ESC-u to hide the highlighting. Now I can simply press D to jump to the next chunk of interest.

  5. To return to where you started from after a search or going to the end of file, type ''. Typing '' again will go back, so this is also nice to toggle between two search results.

  6. Back in the old days of X11R2(?) there was a tool called xless, which was exactly that: a pager like less that ran in its own X11 window. It’s quite useful. We can recreate this by combining a X11 terminal emulator and plain less with a small zsh snippet:

    xless() {
        exec {stdin}<&0 {stderr}>&2
        exec urxvt -e sh -c "less ${(j: :)${(qq)@}} </dev/fd/$stdin 2>/dev/fd/$stderr"
      } &!

    Watch the trick how we pass the stdin/stderr file descriptors and the file arguments!

    Now you can just run command-spitting-out-loads | xless and the output will be shown in a new terminal and not lock your shell.

    NP: Feine Sahne Fischfilet—Dreieinhalb Meter Lichtgestalt

17feb2015 · 10 fancy zsh tricks you may not know...

Wow, almost two years have passed since the latest installment of our favorite clickbait zsh tricks series.

  1. When editing long lines in the zle line editor, sometimes you want to move “by physical line”, that is, to the character in the terminal line below (like gj and gk in vim).

    We can fake that feature by finding out the terminal width and moving charwise:

    _physical_up_line()   { zle backward-char -n $COLUMNS }
    _physical_down_line() { zle forward-char  -n $COLUMNS }
    zle -N physical-up-line _physical_up_line
    zle -N physical-down-line _physical_down_line
    bindkey "\e\e[A" physical-up-line
    bindkey "\e\e[B" physical-down-line

    Now, ESC-up and ESC-down will move by physical line.

  2. Sometimes it’s nice to do things in random order. Many tools such as image viewers, music or media players have a “shuffle” mode, but when they don’t, you can help yourself with this small trick:


    Just append ($SHUF) to any glob, and get the matches shuffled:

    % touch a b c d
    % echo *($SHUF)
    d c a b
    % echo *($SHUF)
    c a d b

    Note that this shuffle is slightly biased, but it should not matter in practice. In doubt, use shuf or sort -R or something else…

  3. Are you getting sick of typing cd ../../.. all the time? Why not type up 3?

    up() {
      local op=print
      [[ -t 1 ]] && op=cd
      case "$1" in
        '') up 1;;
        -*|+*) $op ~$1;;
        <->) $op $(printf '../%.0s' {1..$1});;
        *) local -a seg; seg=(${(s:/:)PWD%/*})
           local n=${(j:/:)seg[1,(I)$1*]}
           if [[ -n $n ]]; then
             $op /$n
             print -u2 up: could not find prefix $1 in $PWD
             return 1

    With this helper function, you can do a lot more actually: Say you are in ~/src/zsh/Src/Builtins and want to go to ~/src/zsh. Just say up zsh. Or even just up z.

    And as a bonus, if you capture the output of up, it will print the directory you want, and not change to it. So you can do:

    mv foo.c $(up zsh)
  4. Previous tricks (#6/#7) introduced the dirstack and how to navigate it. But why type cd -<TAB> and figure out the directory you want to go to when you simply can type cd ~[zsh] and go to the first directory in the dirstack matching zsh? For this, we define the zsh dynamic directory function:

    _mydirstack() {
      local -a lines list
      for d in $dirstack; do
        lines+="$(($#lines+1)) -- $d"
      _wanted -V directory-stack expl 'directory stack' \
        compadd "$@" -ld lines -S']/' -Q -a list
    zsh_directory_name() {
      case $1 in
        c) _mydirstack;;
        n) case $2 in
             <0-9>) reply=($dirstack[$2]);;
             *) reply=($dirstack[(r)*$2*]);;
        d) false;;

    The first function is just the completion, so cd ~[<TAB> will work as well.

  5. Did you ever want to move a file with spaces in the name, and mixed up argument order?

    % mv last-will.tex My\ Last\ Will.rtf

    Pressing ESC-t (transpose-words) between the file names will do the wrong thing by default:

    % mv My last-will.tex\ Last\ Will.rtf

    Luckily, we can teach transpose-words to understand shell syntax:

    autoload -Uz transpose-words-match
    zstyle ':zle:transpose-words' word-style shell
    zle -N transpose-words transpose-words-match


    % mv My\ Last\ Will.rtf last-will.tex
  6. If you are an avid Emacs user like me, you’ll find this function useful. It enters the directory the currently active Emacs file resides in:

    cde() {
      cd ${(Q)~$(emacsclient -e '(with-current-buffer
                                   (window-buffer (selected-window))
                                   default-directory) ')}

    You need the emacs-server functionality enabled for this to work.

  7. I’m working on many different systems and try to keep a portable .zshrc between those. One problem used to be setting $PATH portably, because there is quite some difference among systems. I now let zsh figure out what belongs to $PATH:

    export PATH
    path=( ${(u)^path:A}(N-/) )

    The last line will normalize all paths, and remove duplicates and nonexisting directories. Also, notice how I pick up the latest Ruby version to find the Gem bin dir by sorting them numerically.

  8. One of the hardest things is to set the xterm title “correctly”, because most people do it wrong in some way, and then it will break when you have literal tabs or percent signs or tildes in your command line. Here is what I currently use:

    case "$TERM" in
        precmd() {  print -Pn "\e]0;%m: %~\a" }
        preexec() { print -n "\e]0;$HOST: ${(q)1//(#m)[$'\000-\037\177-']/${(q)MATCH}}\a" }
  9. For a cheap, but secure password generator, you can use this:

    zpass() {
      LC_ALL=C tr -dc '0-9A-Za-z_@#%*,.:?!~' < /dev/urandom | head -c${1:-10}
  10. Sometimes it’s interesting to find a file residing in some directory “above” (e.g. Makefile, .git and similar). We can glob these by repeating ../ using the #-operator (You have EXTENDED_GLOB enabled, right?). This will result in all matches, so let’s first sort them by directory depth:

    % pwd
    % print -l (../)

    Now we can pick the first one, and also make the file name absolute:

    % print (../)[1]:A) 

    I knew the #-operator, but it never occurred to me to use it this way before.

    Until next time!

    NP: Pierced Arrows—On Our Way

24dec2014 · Merry Christmas!

Consumers' crèche

Frohe Weihnachten, ein schönes Fest, und einen guten Rutsch ins neue Jahr wünscht euch Christian Neukirchen

Merry Christmas and a Happy New Year!

Bitte lesen: Liebeserklärung an die Vielfalt - eine Weihnachtsbotschaft.

NP: Against Me!—Holy Shit

Copyright © 2004–2016