You are here

Archive for System Administration

Using Cucumber For Testing Operations Tickets

We system administrators handle many tasks each day for our customers. These tasks usually come in as tickets, asking us to build a new system or to change an existing system. The "standard" workflow for this is: Receive ticket, do ticket, lob ticket back over to customer asking them to test. While we're good at our jobs, we're not infallible. Every so often, the customer lobs the ticket right back saying "It doesn't work!" We then spend more time readdressing the ticket and usually find that it doesn't work because we omitted a step in our haste. Chagrined, we fix our work, test it, and then tell our customer that it should work now.

I would like to show you a better way. I've written before about how I've been using Cucumber for testing tickets. Today, I'd like to show you how you can too.

Intended audience

This is primarily aimed at system administrators but should benefit anyone who has operations duties. You should be familiar with programming or scripting. Specific familiarity with Ruby will be helpful but is not required.

Installing Cucumber

Cucumber is distributed as a Ruby gem. This means you'll need Ruby. On some platforms, you can install a package for Cucumber that will also install its prerequisites (for example, rubygem-cucumber on Fedora or cucumber on Debian). On others, you will need to install Ruby and rubygems support (on RHEL/CentOS, this would be the ruby and rubygems packages) and then install Cucumber and its prerequisites with this command: gem install cucumber

Getting started

To start work with Cucumber, you will need a directory structure like this:

features
|- steps
`- support

  • Tests for tickets go in the features directory. These files have the .feature extension. Each logical piece of work (e.g. a ticket, service, or system) will have its own file. (Cucumber has its origins in software testing so it calls these "features," corresponding to features in software being built.)

  • Step definitions (discussed below) go in the steps directory. These files have the .rb extension. Step definitions should be grouped logically in files so you can easily find them later.

  • Support files, e.g. functions or classes that are used in the step definitions, go in the support directory. These files also have the .rb extension.

To tell Cucumber you're using Ruby, it's customary to create a file named env.rb in the support directory. (You can use other languages, e.g. Python, with Cucumber. I will provide an example of this in the future.)

Anatomy of a Cucumber feature

Here is an example ticket written as a Cucumber feature:

Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM
    Given my DNS servers are:
      |172.22.0.23|
      |172.22.0.24|
      |172.22.0.25|
    When I query my DNS servers for the A record for "quux.example.com"
    Then I should see that the A record response is "192.168.205.203"

Each feature has a title that should reflect the purpose of the ticket. I like to include the ticket number in the title so I can easily refer back to the ticket. (I also include the date in the filename so I can easily find the feature or figure out when I did the work.)

Each feature may also have a description. The description should reflect the reason behind what you're testing for. It is optional and can be omitted. I tend to leave this out but you may find it useful if you share your features with your co-workers.

Each feature has one or more scenarios. These are the actual tests for the feature. How many scenarios you use will depend on personal judgment and the exact request you're handling. I tend to use a scenario for each discrete element of the ticket. For example, for an email account, I would have one scenario for logging in and one for any specific properties for the account.

Each scenario is composed of steps. A step is a discrete element of the text. A step usually starts with one of three keywords, Given, When, and Then. These keywords roughly equate to separate stages in the test:

  • Given steps are for items that are already in place that impact the test. For example, if your test relies on specific pieces of infrastructure or on a specific state in your environment, these would be Givens.
  • When steps represent actions you would take to verify the ticket. These might include running commands on a system or going to a web page.
  • Then steps represent the outcome of the actions taken in the When steps.

(Whens and Thens are generally written as first person sentences to reflect that these are actions you would take or outcomes you would see.)

Step definitions

Once you've specified the feature, its scenarios, and its scenarios' steps, you then need to tell Cucumber how to execute the steps. You do this via step definitions. As noted above, step definitions are stored in Ruby files in the steps subdirectory.

A step definition will look like this:

Given(/^a step definition$/) do
  # code goes here
end

A step definition starts with one of the three keywords, Given, When, or Then. This is followed by a regular expression that matches the step for this definition. Following this is the block of code to run for the step.

You can use grouping in the regular expression to capture any data embedded in the step. Any captured groups are then passed along to the block. For example, given this step definition:

When(/^I query my DNS servers for the A record for "(.*?)"$/) do |host|
  # code goes here
end

If a scenario has the step "When I query my DNS servers for the A record for "quux.example.com"", it would execute using the above step definition with the host variable set to "quux.example.com".

Walkthrough

Let's see this in action. To follow along locally, you'll need to have Vagrant installed. To get started, clone this git repo and follow the instructions in the README.md file.

The environment consists of four systems: a system to run Cucumber on (tester), a master DNS server (dns_master), and two slave DNS servers (dns_slave1 and dns_slave2). The two slave DNS servers transfer zones from the master DNS server.

First, we'll need to set up the directories. Connect to tester using SSH (vagrant ssh tester). Once there, create the features directory and then the steps and support directories underneath it.

In the support directory, create a blank file named env.rb. This tells Cucumber that we will be using Ruby for step definitions.

Now we're ready to work through a scenario. Let's say we receive the ticket below:

From: Joe in Marketing
Subject: Need a new hostname
 
Our web design firm has created a mockup of our new website.  We'd like
to test the mockup with our marketing tools.  Would you please set up
QUUX.EXAMPLE.COM to point to our server at 192.168.205.203?

From this, we need to figure out an appropriate test for it. We want to make sure that quux.example.com resolves to 192.168.205.203. Fortunately, we already have a test for this: the example feature from above. Let's use that.

In the features directory, create a file named 20130708_example.feature. Copy the example feature from above and paste it into that file.

Our next goal is to get the test to fail. We need to prove that the test fails without the change in place. Otherwise, how do we know that our work actually causes the test to pass?

At the shell prompt, run this command while in the features directory:

cucumber ../features/20130708_example.feature

You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # ../features/20130708_example.feature:6
    Given my DNS servers are:                                           # ../features/20130708_example.feature:7
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
    When I query my DNS servers for the A record for "quux.example.com" # ../features/20130708_example.feature:11
    Then I should see that the A record response is "192.168.205.203"   # ../features/20130708_example.feature:12
 
1 scenario (1 undefined)
3 steps (3 undefined)
0m0.002s
 
You can implement step definitions for undefined steps with these snippets:
 
Given(/^my DNS servers are:$/) do |table|
  # table is a Cucumber::Ast::Table
  pending # express the regexp above with the code you wish you had
end
 
When(/^I query my DNS servers for the A record for "(.*?)"$/) do |arg1|
  pending # express the regexp above with the code you wish you had
end
 
Then(/^I should see that the A record response is "(.*?)"$/) do |arg1|
  pending # express the regexp above with the code you wish you had
end

As you can see in the output above, Cucumber doesn't know how to run our test because there are no step definitions. We need to fix this before we can get the test to fail.

Cucumber helpfully provides skeletal step definitions. Let's copy these and then paste them into a new file in the steps directory named dns_steps.rb.

Let's rerun the Cucumber command. You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # features/20130708_example.feature:6
    Given my DNS servers are:                                           # features/steps/dns_steps.rb:1
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
      TODO (Cucumber::Pending)
      ./features/steps/dns_steps.rb:3:in `/^my DNS servers are:$/'
      features/20130708_example.feature:7:in `Given my DNS servers are:'
    When I query my DNS servers for the A record for "quux.example.com" # features/steps/dns_steps.rb:6
    Then I should see that the A record response is "192.168.205.203"   # features/steps/dns_steps.rb:10
 
1 scenario (1 pending)
3 steps (2 skipped, 1 pending)
0m0.004s

We see that Cucumber now knows how to run the steps and that the first one is in the state "pending." In order to get this step to pass, we'll need to define it. Let's go ahead and do that.

Recall that our first step is:

Given my DNS servers are:
  |172.22.0.23|
  |172.22.0.24|
  |172.22.0.25|

We use this step to define the DNS servers we want to deal with.

Open dns_steps.rb. Find the skeleton for the first step:

Given(/^my DNS servers are:$/) do |table|
  # table is a Cucumber::Ast::Table
  pending # express the regexp above with the code you wish you had
end

Change that to:

Given(/^my DNS servers are:$/) do |servers|
  @nameservers = servers.raw.map {|row| row.first}
end

The new code for the step definition takes the IPs in the table and stores them as an array in the class variable @nameservers. Since class variables are shared between steps in a scenario, this lets us use the IPs in later steps.

(I haven't discussed tables since they're a more advanced topic. I will talk about them in a future post. The curious can look at Cucumber's documentation.)

Save the file and rerun the feature. You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # ../features/20130708_example.feature:6
    Given my DNS servers are:                                           # steps/dns_steps.rb:1
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
    When I query my DNS servers for the A record for "quux.example.com" # steps/dns_steps.rb:5
      TODO (Cucumber::Pending)
      ./steps/dns_steps.rb:6:in `/^I query my DNS servers for the A record for "(.*?)"$/'
      ../features/20130708_example.feature:11:in `When I query my DNS servers for the A record for "quux.example.com"'
    Then I should see that the A record response is "192.168.205.203"   # steps/dns_steps.rb:9
 
1 scenario (1 pending)
3 steps (1 skipped, 1 pending, 1 passed)
0m0.003s

Our first step passes! Now Cucumber is saying that the second step is pending. Let's define that step now.

Recall that our second step is:

When I query my DNS servers for the A record for "quux.example.com"

Reopen dns_steps.rb. Find the skeleton for the second step:

When(/^I query my DNS servers for the A record for "(.*?)"$/) do |arg1|
  pending # express the regexp above with the code you wish you had
end

Change that to:

When(/^I query my DNS servers for the A record for "(.*?)"$/) do |host|
  responses = {}
  @nameservers.each do |server|
    resolver = Net::DNS::Resolver.new(:nameservers => server, :recursive => false, :udp_timeout => 15)
    responses[server] = resolver.query(host, Net::DNS::A)
  end
 
  @nameservers.each do |server|
    next if @nameservers.first == server
 
    responses[@nameservers.first].answer.map {|r| r.to_s}.sort.should \
      eq(responses[server].answer.map {|r| r.to_s}.sort),
      "DNS responses from #{@nameservers.first} and #{server} don't match!"
  end
 
  @response = responses[@nameservers.first]
  @host     = host
end

This code does four things:

  1. It queries each of the DNS servers specified in @nameservers for the A record(s) for the name stored in the host variable (for this example, "quux.example.com").
  2. It then verifies that the response from all of the DNS servers is identical. If it's not, it fails the step with the reason "DNS servers don't match."
  3. It then stores the response of the first DNS server in the @response class variable.
  4. It stores the host variable in the @host class variable so we can use it in error messages in later steps.

Let's look closer at the second block. Do you see this code in the middle?

ns1.should eql(ns2)

This is an example of an RSpec expectation. We use expectations to verify conditions that should be true (the should expectation) or that should not be true (the should_not expectation). If the expectation is not correct (e.g. the condition is false but it's expected to be true), it will cause the step to fail.

(RSpec is another BDD testing framework. For now, all you need to know is that Cucumber uses RSpec's expectations and matchers. This is worth knowing if you need to do something complex but isn't as important for most tasks. Thoughtbot has a PDF of common RSpec matchers.)

Why do we test validity in a When step? Here, we're conducting a sanity check on the DNS server responses. If the responses aren't the same, there's an issue with the DNS servers and any later tests we conduct will be invalid. Since we want to get feedback as soon as possible, we want to fail as soon as possible and so we do a sanity check here rather than in a later step.

Add the following to the top of dns_steps.rb:

require 'rubygems'
require 'net/dns'

Save the file and rerun the Cucumber command. You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # ../features/20130708_example.feature:6
    Given my DNS servers are:                                           # steps/dns_steps.rb:4
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
    When I query my DNS servers for the A record for "quux.example.com" # steps/dns_steps.rb:8
    Then I should see that the A record response is "192.168.205.203"   # steps/dns_steps.rb:26
      TODO (Cucumber::Pending)
      ./steps/dns_steps.rb:27:in `/^I should see that the A record response is "(.*?)"$/'
      ../features/20130708_example.feature:12:in `Then I should see that the A record response is "192.168.205.203"'
 
1 scenario (1 pending)
3 steps (1 pending, 2 passed)
0m2.703s

Now the second step passes! On to the third step. Recall that our third step is:

Then I should see that the A record response is "192.168.205.203"

Reopen dns_steps.rb. Find the skeleton for the third step:

Then(/^I should see that the A record response is "(.*?)"$/) do |arg1|
  pending # express the regexp above with the code you wish you had
end

Change that to:

Then(/^I should see that the A record response is "(.*?)"$/) do |ip|
  @response.should_not be_nil, "There is no DNS response.  Did you run a query?"
  @response.answer.should_not be_empty, "There are no records in the response."
  @response.answer.length.should eq(1),
    "There should only be one record.  Instead, there are #{@response.answer.length} records."
  @response.answer.first.address.should eq(ip),
    "The current A record is #{@response.answer.first.address}, not #{ip}."
end

This code performs the following checks:

  1. It verifies that @response is defined. If it's not, something undesirable has happened (for example, there is a bug in a When step or an appropriate When step had not been used) and any of our tests will fail.
  2. It verifies that @response's field answer is not empty. The resolve method used in the previous step sets answer to an array of resource records. If the array is empty, there were no records and we should fail the step now since there isn't an A record pointing to the IP.
  3. It verifies that @response's field answer only has one element. If there are multiple records, there will be multiple elements. Since we only want a single record, this should also fail.
  4. It verifies that the first (and only) element in @response's field answer is an A record pointing to the desired IP.

Note that each different condition has its own failure message. This helps us determine exactly where the step failed.

Save the file and rerun the Cucumber command. You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # ../features/20130708_example.feature:6
    Given my DNS servers are:                                           # steps/dns_steps.rb:4
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
    When I query my DNS servers for the A record for "quux.example.com" # steps/dns_steps.rb:8
    Then I should see that the A record response is "192.168.205.203"   # steps/dns_steps.rb:27
      There are no records in the response. (RSpec::Expectations::ExpectationNotMetError)
      ./steps/dns_steps.rb:29:in `/^I should see that the A record response is "(.*?)"$/'
      ../features/20130708_example.feature:12:in `Then I should see that the A record response is "192.168.205.203"'
 
Failing Scenarios:
cucumber ../features/20130708_example.feature:6 # Scenario: A record for QUUX.EXAMPLE.COM
 
1 scenario (1 failed)
3 steps (1 failed, 2 passed)
0m0.388s

As you can see, our test has failed because the third step failed. That step failed because there are no A records for quux.example.com. Our feature has failed in the expected manner. Success! (Of sorts, anyway.) We can now implement the change needed for the ticket.

In another window, connect to dns_master using SSH (vagrant ssh dns_master). Go to /home/vagrant/named and edit the file example.com. Add the following A record at the end of the file:

quux     IN      A   192.168.205.203

This is important: Do not increment the serial number for now. Our zone file should now look like this (your serial number may differ):

$TTL 300
example.com.  IN  SOA   ns1.example.com. hostmaster.example.com. (
                      1375624788 ; Serial
                      3h ; Refresh
                      15 ; Retry
                      1w ; Expire
                      300 ; Minimum
                      )
              IN  A     192.168.205.199
              IN  NS    ns1.example.com.
              IN  NS    ns2.example.com.
              IN  TXT   "U3VuIEF1ZyAwNCAxMDoxOTowOSAtMDQwMCAyMDEz"
; Hosts
www           IN  CNAME example.com.
foo           IN  A     192.168.205.198
ns1           IN  A     172.22.0.24
ns2           IN  A     172.22.0.25
quux          IN  A     192.168.205.203

Save the file and close your editor. At the command prompt, run this command: sudo rndc reload

The change made to the zone file will now be loaded by the master DNS server. However, since we did not increment the serial number, the change will not be picked up by the slave DNS servers. (This is probably the most common error I make when updating zone files.)

Go back to the first window and rerun the Cucumber command. You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # ../features/20130708_example.feature:6
    Given my DNS servers are:                                           # steps/dns_steps.rb:4
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
    When I query my DNS servers for the A record for "quux.example.com" # steps/dns_steps.rb:8
      DNS responses from 172.22.0.23 and 172.22.0.24 don't match! (RSpec::Expectations::ExpectationNotMetError)
      ./steps/dns_steps.rb:18
      ./steps/dns_steps.rb:15:in `each'
      ./steps/dns_steps.rb:15:in `/^I query my DNS servers for the A record for "(.*?)"$/'
      ../features/20130708_example.feature:11:in `When I query my DNS servers for the A record for "quux.example.com"'
    Then I should see that the A record response is "192.168.205.203"   # steps/dns_steps.rb:27
 
Failing Scenarios:
cucumber ../features/20130708_example.feature:6 # Scenario: A record for QUUX.EXAMPLE.COM
 
1 scenario (1 failed)
3 steps (1 failed, 1 skipped, 1 passed)
0m0.449s

As you can see, the feature has failed because the DNS server responses don't match.

Go back to the second window and edit the zone file again. This time, increment the serial number. The zone file should now look like this:

$TTL 300
example.com.  IN  SOA   ns1.example.com. hostmaster.example.com. (
                      1375624789 ; Serial
                      3h ; Refresh
                      15 ; Retry
                      1w ; Expire
                      300 ; Minimum
                      )
              IN  A     192.168.205.199
              IN  NS    ns1.example.com.
              IN  NS    ns2.example.com.
              IN  TXT   "U3VuIEF1ZyAwNCAxMDoxOTowOSAtMDQwMCAyMDEz"
; Hosts
www           IN  CNAME example.com.
foo           IN  A     192.168.205.198
ns1           IN  A     172.22.0.24
ns2           IN  A     172.22.0.25
quux          IN  A     192.168.205.203

Save the file and exit your editor. Run this command to reload the zone file: sudo rndc reload

Since we incremented the serial number this time, the changes to the zone file are pushed out to the slave DNS servers.

Go back to the first window and rerun the Cucumber command. You should see output like:

[vagrant@tester features]$ cucumber ../features/20130708_example.feature
Feature: New A record for QUUX.EXAMPLE.COM (20130708001)
  Joe from Marketing wants us to add a host record
  for quux.example.com so they can interact with a mockup
  for a new website.
 
  Scenario: A record for QUUX.EXAMPLE.COM                               # ../features/20130708_example.feature:6
    Given my DNS servers are:                                           # steps/dns_steps.rb:4
      | 172.22.0.23 |
      | 172.22.0.24 |
      | 172.22.0.25 |
    When I query my DNS servers for the A record for "quux.example.com" # steps/dns_steps.rb:8
    Then I should see that the A record response is "192.168.205.203"   # steps/dns_steps.rb:27
 
1 scenario (1 passed)
3 steps (3 passed)
0m0.395s

As you can see, the test passed. This means we're done! Now we can tell Joe that we've added the A record for him and we can move on to the next ticket.

Wrap-up

You should now have an idea of how to use Cucumber for use with tickets (or other tasks you have to do). If there are things about Cucumber that you need more information on, please see the Cucumber documentation. If something about this process confuses you, leave a comment and I'll try to help.

To learn more

There's a fair bit I haven't mentioned. I'll mention some of this in future blog posts.

If you want to see other examples, the features directory in the git repository for the demo environment contains the Cucumber features I used when building it and making sure it worked as expected.

If you're impatient and want to learn more now, you can read the Cucumber documentation and examples. The Pragmatic Bookshelf has published books on Cucumber, The Cucumber Book, and RSpec, The RSpec Book.

More on acceptance testing

A couple days ago, I wrote about writing tests for tickets. (Didn’t read that post? Go on and read it. All done? Good.) I figured I should say more about the theory behind the practice

Theory

The tests I write for tickets are based on the concept of “acceptance tests” from agile development. Acceptance tests are usually written by customers and developers together to determine when a given feature is done. A feature cannot be considered finished (and should not be released to the customer) until all of the acceptance tests for the feature pass.

Acceptance tests are written using the language of the business domain and are understandable by both the customers and the developers. They often follow the pattern of “Given/When/Then”, i.e. “Given something is true, When I do something, Then I should see something.” Implementation-specific language is not used in these tests.

My testing process is based on the practice of “acceptance test-driven development” (ATDD). Under ATDD, acceptance tests are written before any code is written. Once the test is verified to fail, the developer writes the code. (By verifying that the test failed first, you ensure that you know that the code you wrote caused the test to pass.) When the developer believes they are done or wants to check their work so far, they run the tests. Once, and only once, the tests pass, they an consider their work done.

Sysadmin reality

As a system administrator, I do not have the luxury to work out these tests directly with my customer. I have to rely on what they have written in their ticket (or said over the phone) to figure out what they want. Sometimes, I have to ask for clarification to make sure I understand their request well enough to write the tests.

Since I have to work out on my own what my customer wants, the tests I write are not guaranteed to be as effective for establishing when I am truly done. After all, my understanding of what they want may be incomplete and, therefore while my tests say I am done with the request, I’m not really done: I have not done everything my customer wanted.

One benefit of having to devise these tests myself is that I can use implementation-specific language. This lets me simplify the implementations of tests although maybe not the tests themselves. For example, I can write implementation-specific steps for getting the list of email accounts from mailservers (e.g. “When I get the list of accounts in Postfix”, “When I get the list of accounts in exchange”, etc.) without having to write conditional and abstraction logic within the test implementation.

To learn more

ATDD is covered in detail in ATDD By Example by Markus Gärtner. In addition to discussing the idea behind ATDD, he provides some coverage of two tools, Cucumber and FitNesse. The Cucumber Book by Matt Wynne and Aslak Hellesøy covers Cucumber in more detail.

The second part of The RSpec Book by David Chelimsky et al covers behavior-driven development (BDD) which is similar to and incorporates aspects of ATDD. (One of the foundations of BDD is “acceptance test-driven planning.”) I do not believe my testing methodology follows BDD since I am not always testing behavior. (See my comments about indirect tests in my last post.)

Finally, in chapter seven of The Clean Coder, Robert Martin discusses acceptance testing and the importance of establishing a correct definition of done.

Tickets and testing

Some time ago, I received a ticket from a customer to change an A record in a zone, a quick, easy thing to do. I made the change and told them that it was done. A few days later, I got a followup ticket saying “This doesn’t appear to be done.” I checked and, sure enough, the change didn’t appear when I queried the DNS servers. Apparently, in my haste, I had forgotten to do one simple but necessary step: increment the serial number in the zone file. Oops. Chagrined, I incremented the serial number, reloaded the zone file, verified that the change did indeed appear, and apologized profusely to the customer.

If only I tested the change before I decided I was done with the work, I could have found my error before my customer.

Mistakes like this don’t happen often, but they do happen. I know I have made many errors like this one throughout my time as a system administrator.

One Friday about a year ago, I handled a particularly hairy ticket. I spent the entire weekend half-expecting, dreading, I would get a phone call from the person on call telling me that I had screwed up. That call never came so I guess I did it right. On the following Monday, I decided to avoid weekends like that one and adopted a new policy: Test work I do before saying that I’m done. Before I do something, I would set up a test for the desired outcome, verify that the test fails, do the work, and then verify that the test passes.

An Example

To give you an idea of how this works, let’s do an example based on the scenario I mentioned earlier. Since I use Cucumber for my tests, I write a feature that looks like:

Feature: Change A record for foo.example.com
  Scenario:
    When I query for the A record for “foo.example.com”
    Then I should get a response with the A record “10.1.1.2”

I then run this feature and verify that it fails:
Cucumber says it failed!

I make the change on the DNS server and run the test again:
Cucumber says it failed again!

What? It failed? Oh, I didn’t increment the serial number again. Let’s fix that and:
Cucumber says it passed!

It passed! That means I’m done and can tell that to my customer, confident that an error hasn’t slipped through.

What I’ve learned

Having done this for about a year, I’ve learned several things.

  1. There is a cost to testing. If I already have step definitions set up for my test, it only adds a few minutes. If I have to write new step definitions, it takes a lot longer. (Some step definitions, particularly those that have to scrape websites, have taken hours to write.) Even if it only takes a few minutes to set up and run the test, that’s still a few minutes.
  2. Some things cannot or should not be tested directly. In these cases, indirect tests are needed. Instead of verifying the desired behavior, you might need to test the configuration that specifies the desired behavior. Care must be taken when using indirect tests since they are not as reliable as direct tests.

    For example, most tasks involving email delivery or routing should not be tested directly because emails sent for testing would be seen as unwanted noise by the customer or because you do not have the login credentials for the email account. If your customer wants to emails to forwarded to , you can test the mailserver configuration to verify that the forwarding is set up.

  3. Make sure that the test checks everything that is being done. (This is especially true for indirect tests.). If your test only checks part of the functionality, your test may pass but you’re not actually done. For example, if you’re supposed to set up a website that pulls data from a database, make sure you test for content that’s in the database. If you only test for the presence of a string from static content (say, from the template for the site), your test will pass even though you haven’t set up the database privileges correctly.

I believe in testing. Why should you?

Every error has costs. The obvious cost is time: It takes time to fix an error. This time may exceed the original time it took to do the work.

The more important cost is in customer respect and goodwill. Every error that your customers see takes away from the respect and goodwill you’ve built up. Sadly, it’s a lot easier to lose respect than to gain it. If your customers see enough errors from you that you deplete that respect, there will be consequences. Even if they don’t quit our services entirely, they are sure to tell their friends and colleagues that you’re incompetent, untrustworthy, and, perhaps most damning, unprofessional.

“But I don’t have time to test!” you might say. That’s precisely when you should be testing! If you’re doing everything at a breakneck pace, you’re sure to make mistakes. When you’re stressed and overloaded is precisely when you should slow down and do things carefully. Every error you catch now saves you from paying the cost of that error later.

How has this worked out over the past year?

Over the past year, I have tested most of the work I have done. Various reasons have prevented me from testing all of it.

I feel like my customer-visible error rate has decreased significantly. (I don’t have accurate statistics so I don’t know for sure.) Of all of the mistakes that were reported by my customers in the past year, only one has been for work I had tested and that was because the test was not complete. All other reported errors were for untested work. Said another way: My customer-visible error rate for properly tested work is 0%. I think that’s a pretty good track record.

Closing thoughts

Creating tests for checking your work will help you decrease your customer-visible error rate. They won't help you make less errors but they will help you prevent your customers from seeing them.

I recommend trying it out. Write a test for a particular work item, do it, make your test pass. Now do the same for your next work item. And then the one after that. And so on.

A thought on using RSpec for behavior-driven system administration

In lieu of a better term, behavior-driven system administration is the application of BDD principles to system administration, primarily building and maintaining systems.

The RSpec Book by David Chelimsky et al presents BDD as using the tools Cucumber and RSpec. Cucumber is used to describe and test the feature to be implemented and then RSpec is used to test the implementation while it is being built.

In Test-Driven Infrastructure with Chef, Stephen Nelson-Smith states that RSpec is not used in his example in the book because "there's no point in unit testing a declarative system." On one hand, I can agree with the statement. Somewhere, I read software developers saying that you should test your application but not worry about testing the platform.1

On the other hand, I disagree. I think RSpec can be useful when testing a declarative system but my rationale has less to do with testing than with auditing.2 I see Cucumber and RSpec tests as filling two different roles: Cucumber verifies the system's behavior. RSpec verifies the system's state.3 This also lets me use the Cucumber features to document the system's behavior and RSpec scenarios to document the system's state. (As pointed out in one of the BoF sessions at LISA 2011, Puppet manifests don't themselves work as documentation of a system.)

I am currently working on building a new system where I hope to test this. I have RSpec scripts written for making sure Puppet is set up and correctly configured but I haven't completed the Cucumber feature for Puppet to verify it's behavior so I can't show how this works in practice. When I have a full example, including a working Cucumber feature and Puppet config, I'll make another post and walk through it. (Or if it doesn't work out, I'll point that out too.)

  • 1. I don't like this amorphous "somewhere" but I can't remember or find where I read this. If I find it, I'll add a reference.
  • 2. Although, yes, I do appreciate knowing whether or not the change I have made to Puppet's manifests was the correct change to make. This may be something that goes away as I get more experience with Puppet.
  • 3. This may be a misuse of RSpec since it's also intended to verify behavior rather than state. I use RSpec since I'm more familiar with it than Test::Unit or other testing methods.

Another way to build RPMs with Mock

Thursday night, I wrote about building packages with Mock. After working on copying the built packages into my local repository, I've decided there's a better way.

The old way does this to build the RPMs:

mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm
mock -r epel-6-i386 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm
mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm --arch=noarch --no-clean

And then this to copy the files into the repository:

cp /var/lib/mock/epel-6-x86_64/result/*.noarch.rpm $REPOSITORY/x86_64/
cp /var/lib/mock/epel-6-x86_64/result/*.noarch.rpm $REPOSITORY/i386/
cp /var/lib/mock/epel-6-x86_64/result/*.x86_64.rpm $REPOSITORY/x86_64/
cp /var/lib/mock/epel-6-i386/result/*.{i386,i686}.rpm $REPOSITORY/i386/

The new way, instead, does this to build the RPMs:

mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm
mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm --arch=noarch --no-clean
mock -r epel-6-i386 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm
mock -r epel-6-i386 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm --arch=noarch --no-clean

And then this to copy the files into the repository:

cp /var/lib/mock/epel-6-x86_64/result/*.{noarch,x86_64}.rpm $REPOSITORY/x86_64/
cp /var/lib/mock/epel-6-i386/result/*.{noarch,i386,i686}.rpm $REPOSITORY/i386/

This builds the noarch packages in both the 32-bit and 64-bit environments. It's a tradeoff between time and having the deployment step make more sense.

Building RPMs with Mock

I haven't been very happy with my way to build packages. I've been looking for a better system for managing it.

Through conversation in #lopsa, I found my way to Mock, a tool that builds packages in chroot environments.

I've been testing it on a VM. So far, it looks promising.

To use Mock, you first need to add your user to the mock group:

sudo /usr/sbin/usermod -a -G mock $user

After that, Mock is simple to use. If you have a source RPM, building a package is as easy as:

mock -r $configuration rebuild $srpm

So, for example, to build my patched kernel RPM:

mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm

Build environment configurations are defined in /etc/mock. If you're building RPMs for use with RHEL or CentOS, one of the preexisting epel configurations should suffice. There are configurations for Fedora as well. You can configure your own configurations as well.

On 64-bit systems, you can use the 32-bit configurations to build 32-bit packages. If you want to use specify a different architecture, you can use the --arch argument. For example, to build all binary RPMs for my patched kernel:

mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm
mock -r epel-6-i386 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm
mock -r epel-6-x86_64 rebuild kernel-2.6.32.46-1.el6.oberon.src.rpm --arch=noarch --no-clean

The first command builds the 64-bit packages. The second command builds the 32-bit packages. The third command builds the additional packages that don't have an architecture, e.g. kernel-doc-2.6.32.46-1.el6.oberon.noarch.rpm. The --no-clean argument tells mock not to clean the build environment first. Without this, the third command will remove the packages generated by the first command.

When the commands are done running, the 64-bit and noarch RPMs can be found in the directory /var/lib/mock/epel-6-x86_64/result/ and the 32-bit RPMs can be found in the directory /var/lib/mock/epel-6-i386/result/.

I haven't tried using Mock in my Makefile but that's next on the list. I also need to simplify my builds so they don't rebuild all RPMs. Since the kernel RPMs take about six hours to build (for both 32-bit and 64-bit) on my VM, this makes builds almost prohibitively long.

I have also thought about building a system that uses Mock, some message queueing system, and some cloud interface to spin up EC2 instances (or the like) for builds. However, that seems pretty close to Koji so I should probably look into that further first.

Cobbler kickstart URL

This is mostly for my reference since I keep forgetting this. As per cobbler's documentation:

The kickstart URL for a system is:
http://$server/cblr/svc/op/ks/system/$name_of_system

Desired characteristics of a server

I've been working on a body of characteristics that a server should ideally have. So far, I have:

Documented
A server should be documented. The documentation should provide enough information to anyone who has to service the server, including allowing them to rebuild the server if necessary. The documentation should also satisfy any auditing requirements of the organization.
Verifiable
A server should have a given known state, described in the documentation, and it should be possible to programmatically determine whether or not the system is in that state.
Secure
A server should be protected against any attacks made by unauthorized entities that would disrupt the server or provide the attacker with information they are not authorized to have.
Monitored
A server should be monitored to ensure that it is in its documented state. Any deviation from that state, either through an uncoordinated change made by operations or through actions of an unauthorized party, should be detected automatically and the operations staff should be notified accordingly.
Backed up
A server should be backed up. Data that may have been lost due to error or tampering should be recoverable within the documented parameters.
Replaceable
A server should be replaceable. If the server fails or significantly goes out of its documented state, any technician should be able to provision and install a replacement while the faulty system can be diagnosed, inspected, and, if possible, corrected outside of the production environment.
Manageable
A server should be manageable and serviceable by any technician authorized to work on it.
Measured
A server should have its statistics measured and recorded on a regular basis. This data should be usable for planning activities and to monitor trends on the server.
Functional
A server should fulfill its documented role.

Some of these are admittedly characteristics of the server's environment and its technicians rather than of the server itself.

If you think I've overlooked anything, please let me know.

Edit: I added the last two originally as comments. I've added them to the actual post to make it easier for anyone who visits in the future.

Adventures in building a patched kernel for CentOS 6

Recently, I've been trying to build a kernel for CentOS 6 that includes the grsecurity security patch. This was complicated by my desire to build the new kernel as an RPM using the CentOS 6 kernel RPM's spec file.

Why grsecurity?

grsecurity is a kernel patch that adds additional security options and protections. The non-RBAC components of grsecurity will work with SELinux which I will use for now.1

Why use the spec file and not make rpm?

Well, first: Because make rpm didn't work. That was disappointing.

In CentOS kernel install split into separate packages, e.g. kernel, kernel-headers, and kernel-devel. In order to preserve these, I would have to reverse engineer the spec file and it just seems simpler to use what's there. The spec file also makes sure that the new kernel is added to the bootloader's configuration file.

Why use an RPM at all?

Building packages from source doesn't scale.2

One of my current projects is to build a fully functional environment where no packages are installed from source.3 In fact, none of the systems are allowed to have compilers. This is why I've worked on creating my own yum repository.

The process

After downloading the source file for Linux 2.6.32.46 and the grsecurity patch and verifying both4, I changed the spec file to use these. Due to an issue with %setup in the spec file I never fully figured out, I decided to invoke applying the patch manually through the ApplyPatch function defined in the %prep block. I then set about building the kernel initially using rpmbuild -ba SPEC/kernel.spec and ran into problems.

The complications

I ran into six problems during the process of building my RPM. These were:

  1. Fixing an error during the %prep phase
  2. Getting my configuration changes to persist
  3. Signing the modules
  4. Turning off the kABI checker
  5. Setting the kernel version correctly
  6. The system halted on boot


Fixing an error during the %prep phase

During the %prep phase, the spec file tries to verify that all of the configuration options known by the kernel are covered in the configuration files. This relies on the nonint_oldconfig target defined in scripts/kconfig/Makefile in the RPM. However, this option does not exist in 2.6.32.46. Extracting the patch for this (and for the related code in scripts/kconfig/conf.c) from the tarball provided with the RPM, adding it to the spec file, and applying it allowed me to proceed... but not very far. If nonint_oldconfig finds any kernel options that are not covered, like the options added by the grsecurity patch, it returns an error and the build halts.

Getting my configuration changes to persist

The SRPM includes generic and architecture-specific configuration files. These are merged together during the %prep phase. If the configuration changes are not present in these files, they are wiped during the phase.

I decided to keep the configuration files intact and instead placed my overriding kernel options in a separate file. I then modified Makefile.config to check for my files and merge those as well if present.

Signing the modules

CentOS 6 inherits the upstream vendor's patches. One such patch is used to sign the modules with a key so that the file integrity of the modules can be verified later. Any module that fails verification is not loaded. There is a kernel configuration option that requires all modules to be signed. As the spec file is supposed to be used with the distribution's patched kernel source, it verifies that the modules are signed. I could have removed this feature from the spec file but I decided this feature was a good one.

This is not part of the stock 2.6.32.46 kernel and, in fact, does not appear to be included in 3.0.4 either. To use this feature, the code changes have to be isolated from the patched distribution kernel and a new patch needed to be made. After creating a patch from changes made to 54 source files, I added it to the spec file and built signed modules.

Turning off the kABI checker

The CentOS kernel, like the upstream vendor's kernel, is expected to fulfill a given interface contract when built. Anything that would break this contract severely is not allowed to be made as a change to the kernel.5 As a result, the kernel verifies this contract in the spec file.

The standard way to do this is to pass this option to rpmbuild when building the kernel: --without kabichk

As I didn't feel like remembering to specify that every time I build the kernel, I changed the spec file to make sure that the with_kabichk variable set to  . I could have left it in place by copying Module.symvers to the appropriate kABI file in the SOURCES directory and changing the spec file appropriately.

Setting the kernel version correctly

This took me a while to figure out unfortunately. When I built the kernel, the version was being set incorrectly and then this resulted in the system not being able to find modules on boot.

What I figured out in the end was that, for kernel 2.6.32.46, base_sublevel should be set to 32 and stable_update should be set to 46.

grsecurity also adds a file called localversion-grsec that contains the text -grsec. I wrote a patch to remove the file, added it to the spec file, built it, installed it, and...

The system halted on boot

When a CentOS 6 system boots normally, it uses something called dracut. At some point, it tries to mount a partition inside a chroot'd partition. As the kernel is configured to prevent this (through the grsecurity patch), the system would then halt on boot.

As I didn't want to disable this feature permanently and I wanted everything else to work from boot, I wrote a small patch to disable the feature on boot. Adding kernel.grsecurity.chroot_deny_mount = 1 to /etc/sysctl.conf then enabled this feature later in the boot process.

And now it works

After building the RPMs, I installed the kernel and kernel-firmware packages on a test VM, rebooted it, and it came up using the new kernel.

I did find that a lot of messages were being written to the console. Setting kernel.printk to "6 4 1 7" via sysctl corrected this.

What next?

I want to work on backporting some changes from the CentOS 6 kernel to the package I have built. I know that the CentOS 6 kernel contains some features from 2.6.33, e.g. recvmmsg, and some enhancements to KVM. Unfortunately, since the source tarball in the RPM has already been patched, I have to tease out what those patches are. Once I have those patches, I should then check to see there are any grsecurity changes that need to be backported. Since there is no obviously apparent repository for the grsecurity patch, this requires looking at the test patch (currently for the 3.0.4 kernel) and comparing manually.

So where can I get this?

Once I have some things worked out, I will post the RPMs and the SRPM somewhere that it can be downloaded. For now, you'll just have to be patient.

  • 1. I have grsecurity installed on another server and I've never gotten around to configuring RBAC. Since a lot of work has gone into the CentOS 6 SELinux policies, I'll just rely on those.
  • 2. I've said this before. Some day I'll write a longer post on it.
  • 3. I'll try to write about this in the near future.
  • 4. You do this whenever you download a source file, right?
  • 5. This is why CentOS 5 does not support IPv6 connection tracking without building a custom kernel.

Writing install triggers for cobbler

Cobbler has the ability to run triggers at specific times. Two of those times include before installing a new machine ("pre-install triggers") and after installing a new machine ("post-install triggers").

"Old-style" triggers involve running executable binaries or scripts (e.g. shell scripts) in specific locations. Pre-install triggers are placed in the directory /var/lib/cobbler/triggers/install/pre/ and post-install triggers are placed in the directory /var/lib/cobbler/triggers/install/post/. The trigger will be passed three pieces of information: the object type, e.g. "system," the name of the object, and the object's IP. (A comment in the run_install_triggers method in remote.py says this passes the name, MAC, and IP but this does not appear to match the name of the variables.) If the trigger requires more information, it will need to pull it from elsewhere or parse cobbler's output. For all but simple tasks, this is probably not a convenient way to go.

Note: There is a bug in cobbler 2.0.3.1 which prevents running "old-style" triggers. See ticket #530 for more information and a possible fix.

"New-style" triggers are written in Python as modules. They reside in cobbler's module directory. (On my system, this is /usr/lib/python2.4/site-packages/cobbler/modules/.) Each module is required to define at least the functions register and run.

The register function takes no arguments and returns a string corresponding to the directory the module would reside in if it were an "old-style" trigger. For a pre-install trigger, it would be:

  1. def register():
  2.     return "/var/lib/cobbler/triggers/install/post/*"

For a post-install trigger, it would be:

  1. def register():
  2.     return "/var/lib/cobbler/triggers/install/post/*"

The run function is where the actual code for the trigger should reside. It takes three arguments: A cobbler API reference, an array containing arguments, and a logger reference. The argument array contains the same three values as for "old-style" triggers, i.e. the object type, the name, and the IP address. The logger reference may be set to None and the code should handle that. (In cobbler 2.0.3.1, this will be set to None. This may be fixed when the issue for "old-style" install triggers is.)

For an example of a run function, let's look at one I wrote (based on the trigger in install_post_report.py that is included with cobbler) to automatically sign Puppet certificates:

  1. def run(api, args, logger):

This starts the method. Note the signature.

  1.     settings = api.settings()
  2.  
  3.     if not str(settings.sign_puppet_certs_automatically).lower() in [ "1", "yes", "y", "true"]:
  4.         return 0

This retrieves the settings from /etc/cobbler/settings. To control the trigger, I added another option there named sign_puppet_certs_automatically. If this value either does not exist or is not set to one of the required values showing its enabled, the trigger returns a success code (since it's not supposed to run, it shouldn't return a failure code) and exits.

I also added another option to the cobbler settings called puppetca_path which contains the path to the puppetca command.

  1.     objtype = args[0] # "target" or "profile"
  2.     name    = args[1] # name of target or profile
  3.  
  4.     if objtype != "system":
  5.         return 0

This retrieves the object type and name from the argument array. If the object type is not a system, it returns a success code and exits.

  1.     system = api.find_system(name)
  2.     system = utils.blender(api, False, system)
  3.  
  4.     hostname = system[ "hostname" ]

This finds the system in the cobbler API and then flattens it to a dictionary. I'm pretty sure this could be improved upon.

  1.     puppetca_path = settings.puppetca_path
  2.     cmd = [puppetca_path, '--sign', hostname]

This retrieves the path for puppetca and sets up the command to be run to sign the certificate.

  1.     rc = 0
  2.     try:
  3.         rc = utils.subprocess_call(logger, cmd, shell=False)
  4.     except:
  5.         if logger is not None:
  6.             logger.warning("failed to execute %s", puppetca_path)
  7.  
  8.     if rc != 0:
  9.         if logger is not None:
  10.             logger.warning("signing of puppet cert for %s failed", name)
  11.  
  12.     return 0

This runs the command and logs a warning if either the command fails to be executed or does not succeed. Finally, at the end, it returns a success code.

According to the cobbler documentation, the return code of post-install triggers is ignored so there's no reason not to return anything other than value. Pre-install triggers apparently can halt the process if they return a non-zero value.

Note: The above code will not run correctly if logger is set to None. This is because utils.subprocess_call tries to call logger without verifying that it is not None and throws an exception. To use this with cobbler 2.0.3.1, you must either edit change the call to utils.run_triggers in remote.py's run_install_triggers method or you must change utils.subprocess_call to properly check for logger being set to None.

Also note: Since the original code is under the GPL, the code above is also under the GPL.

Pages