You are here

Archive for No name

First Experiences with BDD, Cucumber, And RSpec

I've restarted the project I have been working on. The prototype code worked well enough but I don't think the code is in good shape. I didn't add any testing to the prototype and this started to cause significant problems.

I don't know when or where I first heard about RSpec or behavior-driven development. However, when I found out that a book on RSpec, the aptly titled The RSpec Book, was being published, I pre-ordered it. I have been reading the beta PDF and decided to apply it to the rewrite.

Like any new programming methodology, it takes a little bit of getting used to. It took me a few hours last night to implement a single action on a controller but I'm certain this will improve over time. (I find also that switching between the PDF and TextMate on my laptop costs time too. This is definitely a time for which having a second monitor would be useful.)

I find BDD interesting because it forces a different way of programming than I'm used to. I spend a lot of time working on the model and seem to get around to the controllers and views near the end (if ever). Instead, when writing the feature for cucumber, I have to establish what behavior should occur for the user and then make sure that the controller and view gets written for that behavior. I obviously lack the experience to make an informed decision on whether or not BDD is a better software development methodology than TDD or other testing-enhanced processes. (Any process that involves testing is, in my opinion, significantly better than any that does not.)

One omission I found in the book (or maybe I haven't looked in the right place) is that there seems to be no discussion about spec'ing routes. RSpec does have methods for it. In discussion on the rspec-users mailing list, David Chelimsky provided a link to some examples so some documentation does exist. (I've left a note in the book errata so we'll see what happens.)

An issue I found working with RSpec is that ruby script/spec spec/ works but rake rspec does not. This seems to be because the rake task is not Rails-aware while script/spec is. I haven't done much more research into it though. Since the first command works, I'll continue to use it.

On Choosing a License

I have a web application I want to build. Other sites have an application like this built already but they don't quite do everything I want to do. I have started doing some exploratory work on it. (And I've found it's not quite as easy as it looks. This post by Benjamin Pollack comes to mind.)

I want to release the end result as open source so other people can use it. This way, those who wish to use it for themselves have the option to do so. However, before doing so, I have to pick a license.

Why Not the (A/L)GPL By Default?

The GPL is perhaps the most ubiquitous of the open source licenses. Various projects, from Linux to MySQL to Drupal, have benefited from the license and from the openness it fosters (and enforces). It also has its detractors who complain about the openness required, usually from business entities but sometimes from individual developers.

Zed Shaw sparked a new set of exchanges in the BSD-vs.-GPL holy war two weeks ago with his post "Why I (A/L)GPL". This caused a lot of discussion. For example, Kumar McMillan responded with a post detailing reasons not to license code with the GPL.

Separately, Jacob Kaplan-Moss posted a set of twenty questions for the GPL. James Bennett has other questions and concerns, in his post "When Licenses Attack". These posts point out a distinct lack of clarity with regards to what the GPL allows or disallows with dynamic languages. For example, the GPL explicitly mentions linking but does a Python include or a Ruby require constitute linking?

The GPL has a known loophole for web applications or other network services. The loophole is not a horrible idea, as mentioned by Dries Buytaert and Ted Haeger.

The presence of the loophole, however, concerns some people. The solution is the AGPL which forces service providers et al to provide a means to get the source code for a hosted network service, web application, etc. Some people like this. For example, Alberto García Hierro, formerly of byNotes, chose AGPL for his code.

However, even the AGPL has issues. The AGPL is technically incompatible with the GPL. There was some objection to some of the wording within the Debian community. On the forums for the Frog CMS, a web developer stated an issue with Frog's use of the AGPL and how he felt it would impact his client sites. Ted Haeger asks if the AGPL is too radioactive. And even Alberto García Hierro mentions issues with the AGPL. Both of them wonder if there's need for a LGPL-like version of the AGPL.

What About the Apache License?

Kumar McMillan suggests the Apache License as an alternative to the GPL. It is, indeed, an attractive alternative. Based on the license itself and part of chapter 10 of Van Lindberg's Intellectual Property and Open Source, it looks like a license well suited to a lot of projects who want to avoid the GPL. The clauses about patents and trademarks may not be useful for a small-time developer but the clause about contributions could certainly help to avoid headaches.

While there has been some discussion on how the BSD and MIT licenses interact with the GPL, I have found little documentation on how Apache-licensed code could be integrated into MIT- or BSD-licensed projects. It looks like preserving the license and attributions are necessary. I could see this being messy. The only concrete information I've found is that OpenBSD specifically forbids inclusion of source code licensed under version 2 of the Apache License.

So Choosing a License

For my particular application, the factors that impact a license decision are:

  • Platform: The language or framework the application is built on plays a part in determining the license. Any development using either that would be distributed must be done with a license that is compatible with that language or framework.

    An issue arises when dealing with frameworks that include parts of themselves in the final application. For example, some of the generators in Ruby on Rails could be claimed to work this way. This means that the output is considered a derivative work since it includes part of the original. (This is part of why GNU bison has a license exemption in its output files.) This further requires the use of a compatible license. (I am also not sure if it is possible to have two sections of a file under different licenses.)

    My current reasoning: The current draft of code is built on Rails which is licensed under the MIT/X11 license. Since the MIT license is one of the most permissive licenses, this does not restrict the license choice.

    The copyright status of output files from the Rails generators concerns me. Including parts of Rails within the application obviously makes it a derivative work of Rails. However, I do not know at what point, if any, the copyright for those sections of Rails would transfer to me or if those sections would always be copyrighted by the Rails development team and therefore would always fall under the MIT license and, therefore, always need the MIT license included.

    This concern alone makes the MIT license a strong candidate.

    Were I using Django, the BSD license would likely be a strong candidate for the same reasons.

  • Reusing code: As mentioned, this is not the first time someone has tried to do what I'm doing. There exist open source projects that at least somewhat overlap with what I'm doing. For the purpose of this discussion, we'll say that one is released under the three paragraph BSD license and one is released under the GPL.

    The BSD license has few restrictions on what can be done with the source code. As long as attribution is given and the terms of the license are mentioned, source code can be copied outright. Any derivative works, e.g. translations, modifications, etc., can be used or even relicensed as long as the original attribution and licensing is given.

    The GPL has significant restrictions on what can be done with the source code. While I can do almost anything I want with source code licensed under the BSD license (aside from strip attributions and the original license), I can only include GPL source code in other GPL'd works. Derivative works also have to licensed under the GPL.

    So if I use a BSD or MIT license, I can only use the BSD-licensed project for a reference. This is also true for the AGPL since, as mentioned earlier, it is technically incompatible with the GPL. I cannot use the GPL'd project as a reference. Only if I use the GPL can I use that project as a license.

    (This is technically not completely true. The GPL only applies to copyrighted material. According to section 102b of the US Copyright Law, copyright protection does not apply to ideas, procedures, or processes. It would therefore be theoretically possible to use the GPL'd project as a reference to find out how it does something. However, due to the high risk of cross-contamination, i.e. the likelihood that the reimplementation of a process or procedure would resemble a derivative work rather than a separate one, it is probably safer to not look at all.)

    Down the line, it is also possible that someone might want to use my code. If I release the code under the GPL, they cannot use it unless they themselves are using the GPL. The same is true of the AGPL. If I use a permissive license, there are no limitations.

    My current reasoning: Losing access to the GPL'd project is not a significant concern. It would probably speed development, at least some, but I would probably learn better if I implemented it myself from the beginning.

    I don't expect to do anything significant in coding this application. Anything I come up with could be easily developed by someone else given enough time. Requiring the use of a reciprocal license then just gets in the way.

  • Business model: This is often a sticky point for choosing licenses. A lot of the time, it comes down to two questions: "Do I want to have the option to make money off of this?" and "Do I care if other people make money off of this?"

    If there is a strong desire to prevent other people from making money off of the project, the GPL is a strong candidate. Since the source for the software must always be distributed with the binaries, it is unlikely that someone else could build a business model around direct sales of the software. There is no way to prevent another person from building a business model around offering support or other services based on the software. For example, since Drupal is released under the GPL, it is exceedingly difficult to build a business around selling the software. However, Acquia has a business model built around providing services for Drupal.

    Releasing software under the GPL does not prevent the copyright holder from making money off of it. While the value of paying for the software is lessened since a free version is available, there is nothing that prevents the copyright holder from providing the software under a commercial license. (As far as I know, no copyright license can prevent the copyright holder from relicensing the software.) MySQL AB saw some success with releasing a commercial version of MySQL.

    The BSD and MIT licenses place few restrictions on what someone else can do with the software. While they do not prevent the copyright holder from making money on direct sales of the software, there is nothing to prevent another person from doing the same.

    I believe that it is unrealistic to make money with a direct sales of a web application, especially one built on an open source framework. (There is a market for it obviously, given the existence of the ionCube PHP Encoder and Zend Guard.) Most of the money to be made with a web application is going to be found with services built around a specific application, e.g. hosting or local customization.

    The only way to escape the local customization and hosting loophole if you want to avoid others making money from the application would be to use the AGPL. This forces anyone who modifies the source code and deploys the reuslting work to provide a download link (or other means of distribution). Using any other license does not allow the developer to get access to any downstream modifications unless volunteered by the people who make them.

    My current reasoning:Since this is a web application and built on Ruby on Rails, I don't think there's any concern about the source code being used for monetary gain. (Were I using Java for this, when the application could be distributed solely in binary format, I probably prefer a license that enforced source code distribution.)

    I have no concerns about someone building a service around hosting the application. (I would consider doing this myself but I simply do not have the time.) My main concern is that someone would build a business model around a modified copy of the application and not send those changes upstream. This could only be mitigated through using the AGPL and the person doing this being honest enough to follow the terms of the license. However, given the issues surrounding the AGPL, I doubt that this would be a positive tradeoff.

So, given the above, the MIT license sounds like a strong candidate. I am still thinking it over but this is how I'm currently leaning.

Miscellanea from Rails-land

I have been experimenting with a Ruby on Rails project recently. Part of the project requires importing large (almost 150 MB, almost one million lines) text files of a given format into a normalized database. I have been working with this part of the project first because I want to find out how much space each log file will take up in the database.

Here's some of my notes and observations:

  • The original draft of the code parsed the line and then called save on the object for that line. Unfortunately, this script took six hours to run using script/runner. Unfortunately, I really need this process to take around one minute rather than six hours.

    ActiveRecord::Extensions (found via this post on the Accelerate HR blog) provides a method to import a large number of records at once. When configuring it to write 1,000 records at a time to the database, the SQLite version took about four and a half hours. Using "chunk" sizes of 10,000 took over nine hours before I stopped it manually because that caused the script to start swapping to disk. (Servers with only 512 MB of memory are no longer as useful as they used to be.)

    Switching to MySQL and using greater normalization results in faster run time with a chunk size of 1,000. However, even then, the import script takes about three hours to run.

  • Mixins can be used to share behavior between related models without repeating yourself. Jamis Buck and DHH call these "concerns". The use of mixins in this way does clean up the models considerably since there's a lot of similar code. However, I do have models that look like:
    class Model
      include Mixin1
      include Mixin2
  • At least on my test server (running CentOS 5.3), the version of Ruby packaged with the OS is broken when using large amounts of memory. Something causes a bit to be unset which yields strange errors. Sometimes it's just a segmentation fault. Sometimes ActiveRecord complains that the object has no "pime" field (when it should have been "time") or other such aberrations. And then there was when the SQLite driver complained that "INSART" was not a valid SQL command.

    These issues do not manifest under the version of Ruby Enterprise Edition installed so this suggests that the RPM ruby is broken.

  • I installed Ruby Enterprise Edition because of the suggested gains in performance and I plan to eventually run the completed application through passenger. However, I wonder how the performance of REE compares to Ruby 1.9.1 for this application.

Using include_path in PHP

Common convention states that include files for a PHP web application should be kept in a given place. This allows the developer to easily find them.

This becomes a headache though when a site is deep and has many subdirectories. You then start seeing code like:

require_once( '../../../../../../includes/config.php' );

This creates a headache for developers since they have to remember where a page is in respect to the includes directory. If the page is moved to a different subdirectory, the path in the require_once (of, if you prefer, include, require, or require_once) statement needs to be changed.

A simple workaround for this is to add the root directory of the site (or whichever directory is just above the includes directory) to PHP's include path.

Let's assume a directory structure like:

  `- www/
      |- logs/
      `- web/
         |- includes/
         |   `- config.php
         |- test/
         |   |- test/
         |   |   `- test /
         |   |       `- test2.php
         |   `- test.php
         `- index.php

In order to include includes/config.php without modifying the include path, each page would use a different path:

  • index.php:
    require_once( 'includes/config.php' );
  • test/test.php:
    require_once( '../includes/config.php' );
  • test/test/test/test2.php:
    require_once( '../../../includes/config.php' );

Now, if /var/www/sites/ is added to PHP's include path, all three scripts can load config.php with the statement:

require_once( 'includes/config.php' );

This clearly reduces the hassle of maintaining and referencing the single directory for includes.

To add the /var/www/sites/ directory to the include path, it must be appended to the include_path configuration setting. To do this for PHP running under Apache using mod_php, the following line can be added in a .htaccess file or directly to the site's VirtualHost paragraph:

php_value include_path '.:/usr/local/lib/php:/var/www/sites/'

(This assumes that PEAR modules are stored under /usr/local/lib/php.)

If this is the only site on the server, the include_path setting can be edited in php.ini. If using a CGI-based PHP, you may be able to create a local php.ini for this same end. Consult the PHP documentation on runtime configuration for more information.

Using Git For Websites

I've written before about using subversion and Piston for hosting this particular site.

The problem with using subversion for websites is that it is not available when working offline. (Assuming that your repository does not reside on your laptop.) While this is not normally an issue for sites based around a CMS, e.g WordPress or Drupal, there are times that having access to version control would be useful. Examples include working on a new Drupal theme or doing old-fashioned HTML web development.

The solution to this problem is to use a distributed version control system. In a distributed control system, a local repository is maintained on the development machine which is then periodically synchronized with the master repository. One such system is git. Git is used by the Linux kernel developers and many other projects besides. It also has the feature of being able to interact with subversion which is useful for me since I'm not ready to phase out my subversion repository.

At a glance, git appears to be superior for websites. Aside from the advantage of being distributed, it also stores its metadata in a subdirectory of the top-level directory called .git. Compare this against subversion which stores its metadata in a subdirectory called .svn within each directory and subdirectory of the repository. This means that if you have directories that are owned by the webserver (this happens on websites sometimes), you have to get them chown'd before you can add them to a subversion repository. With git, this isn't an issue. (Git will not preserve the file ownership but this isn't usually a significant issue.)

There are a few differences between subversion and git. First, git does not use a central repository like subversion. A common structure in subversion would be to have a central repository and then each project would be a subdirectory. In git, each project has its own repository. While you can check out specific directories from subversion, you can only check out the entire repository for git.

While subversion allows checking in empty directories, git does not. This can be an issue for some applications. The Git FAQ suggests, as a workaround, adding a fiele called .gitignore to each otherwise empty directory. Git will then check in the .gitignore file.

Git does not appear to support anything like svn:externals. However, this does not appear to be a significant issue. Version 2 of Piston supports git (as well as subversion) so it can be used to fulfill the same purpose. (Correction: As Jakub Narębski points out in the comments, git's submodule mechanism is a lot like svn:externals.)

Using git is mostly the same as with subversion. If you are familiar with subversion, you should pick up on the commands relatively quickly. (And if you're completely new to version control, I suggest getting Pragmatic Version Control Usng Git.) The one major catch is: Always remember to push commits to the central repository with svn push origin master. If you do not do this, the local commits will never reach the repository. This is an issue if you expect to redeploy the site from that repository in the future.

Finally, when using Git and Piston, you will need to add these directives to block access to the metadata:

RedirectMatch 404 /\\.git(/|$)
RedirectMatch 404 /\\.piston.yml

If still checking out from a CVS repository, e.g. from Drupal, you will still need to include this as well:

RedirectMatch 404 /CVS(/|$)

Project Euler as a Means of Learning

About four months ago, I wrote about Project Euler. Back then, I posted that I would do the problems in Ruby to try to hone my skills there. Since then, I've mostly done Ruby but I have also done some solutions in Haskell for speed reasons, namely prime number generation. (I should really revisit how I do that in Ruby...)

Project Euler is a good way to introduce some basic concepts. Each problem is best solved using a given set of the language. They are probably better exercises for those who like puzzles than the exercises normally taught in beginner books or in first semester programming courses. However, I see two problems with using Project Euler as a long-term means of learning a programming language.

First, Project Euler's focus is not on learning a given language or teaching about a given set of language features. Project Euler's focus is on the mathematical problems. More time in solutions, especially in later problems, is spent on figuring out the algorithm or method for solving the problem rather than on how to write that algorithm in a given programming langauge. Some language features or methodologies are never addressed because they never come up in the process of solving the problems.

Second, the scope of Project Euler problems is relatively small. The only focus within a given problem is answering that problem. Each solution amounts to a one-time script with components you might reuse later. As a result, Project Euler is insufficient for learning how to develop applications in a given language. (It may, however, have use in learning how to develop a library or module since some algorithms or components are used repeatedly.) There is not sufficient scope to investigate using it to develop an interactive application.

So I think that Project Euler works out well when first starting. However, once familiar with the basic concepts, supplementing or replacing Project Euler with another method of learning, e.g. building an application, is needed to ensure that there is further learning.

This is not to say that Project Euler should be completely abandoned at that point. It just ceases to be useful for learning about applying the programming language by itself. If you happen to enjoy the puzzles (I know I do), feel free to continue to do them.

GnuPG keys on USB

This is a reasonably simple process. Most of the process can be found in this Enigmail forum discussion.

  1. Move the GnuPG keys to a USB drive. (For the purpose of this discussion, I will assume that the USB drive is X: and the directory on the drive is .gnupg.)
  2. On the computer (not on the USB drive), change gpg.conf to include these directives:
    keyring X:\.gnupg\pubring.gpg
    primary-keyring X:\.gnupg\pubring.gpg
    secret-keyring X:\.gnupg\secring.gpg
    trustdb-name X:\.gnupg\trustdb.gpg

    Under Mac OS X, assuming a volume name of USB drive, you would add:

    keyring /Volumes/USB drive/.gnupg/pubring.gpg
    primary-keyring /Volumes/USB drive/.gnupg/pubring.gpg
    secret-keyring /Volumes/USB drive/.gnupg/secring.gpg
    trustdb-name /Volumes/USB drive/.gnupg/trustdb.gpg

    For Linux, it should be the same as for OS X but /Volumes/USB drive would be replaced by the mount point used for the drive.

  3. And that's it.

If you want to use an encrypted partition or filestore, e.g. through TrueCrypt, the above instructions are still valid. However, you would point it to wherever you have TrueCrypt mount the encrypted partition or filestore.

My PGP key

I finally went through the process of setting up a PGP key. The fingerprint is:

9A86 1FA4 DADE 9C93 F2B0  7C23 38E9 ECDE D61A 0437

You can retrieve the key from a public keyserver or you can download it here.

(There is another PGP key with a fingerprint ending in FCD4 761B but I prefer that it is not used.)

The PGP key was created following the steps used by Ana Guerrero in her blog post.

To make use of PGP, I have set up pinepgp for PINE and Enigmail for Thunderbird. (I'm trying to move away from PINE because it's not working out particularly well but I still have a lot of mail and other such used in it.) I haven't set up PGP on my iBook yet but that's one of my projects for the next few days. I also want to move the PGP keys to a USB drive for security but I haven't started on that process yet either. I will include more information on both when I have them set up.

Infix, Prefix, Postfix, Oh My

Any professionally taught programmer eventually has to learn binary trees. Some high school students are exposed early as part of the AP Computer Science AB exam but most have to learn it as part of a data structures course in college.

There are three ways to read a binary tree:

  • Prefix: Root node, then left child, then right child
  • Infix: Left child, then root node, then right child
  • Postfix: Left child, then right child, then root node

Take, for example, this really simple binary tree:

The ways to read this are:

  • Prefix: + 2 3
  • Infix: 2 + 3
  • Postfix: 2 3 +

The infix reading of this tree resembles (and, in fact, is) the standard way we write and interpret simple mathematical equations. "Two plus three equals..." (As an aside, all simple mathematical equations can be expressed as a binary tree. I'm not happy with the tools I have available to render trees right now so I will leave this as an exercise for you, the reader.)

The postfix reading should be familiar to anyone who owns a Hewlett-Packard graphing calculator. This form of representing mathematical equations is most commonly referred to as Reverse Polish notation. Postfix ordering of mathematical expressions is commonly used for rendering stack-based calculators, usually in assignments for a programming class.

The prefix reading resembles the standard way we use constructs in programming languages. If we had to represent "2 + 3" using a function, we would write something like plus( 2, 3 ). This is most clearly shown with LISP's construct ( + 2 3 ). Haskell's backtick operators around infix operators, e.g. `div`, have a side effect of reminding programmers that most functions are prefix-oriented.

So why discuss reading binary trees anyway? In a classroom, teaching the student how to read a binary tree leads to the student being able to program a way to read a binary tree which will then lead to other things. Here, I discuss reading them as background for the next post which may itself lead to other things.

Edit: (11 May 2009) Unfortunately, in the process of writing the next post, I realized that the entire premise of the post was invalid. So the next post will probably not have anything to do with trees.

Verizon Data Breach Investigations Report 2009

Via Dr. Anton Chukavin's post, I found the 2009 Verizon Business Data Breach Investigations Report. The document is a fascinating read to get an idea of the state of things last year. Chances are that this year will build on last year.

The targeting of financial institutions by organized crime entities is surprising to me and yet not surprising. Based on the behavior of would-be fraud perpetrators I have observed, there are a lot of credit card numbers out in the wild with little rhyme or reason to them. Compromising a financial institution or a merchant account provider, e.g. Heartland, would be an easy way to get credit card numbers into the open. However, the interest in account numbers and PINs is more disturbing and provides less recourse for the victim. This trend can be seen even in the US where ATM skimmers are becoming more prevalent.

The major thing I come away from the report with is: The big fish have a lot more to be worried about in the past. But that doesn't mean the small fish are safe in the water.