On Choosing a License

I have a web application I want to build. Other sites have an application like this built already but they don't quite do everything I want to do. I have started doing some exploratory work on it. (And I've found it's not quite as easy as it looks. This post by Benjamin Pollack comes to mind.)

I want to release the end result as open source so other people can use it. This way, those who wish to use it for themselves have the option to do so. However, before doing so, I have to pick a license.

Why Not the (A/L)GPL By Default?

The GPL is perhaps the most ubiquitous of the open source licenses. Various projects, from Linux to MySQL to Drupal, have benefited from the license and from the openness it fosters (and enforces). It also has its detractors who complain about the openness required, usually from business entities but sometimes from individual developers.

Zed Shaw sparked a new set of exchanges in the BSD-vs.-GPL holy war two weeks ago with his post "Why I (A/L)GPL". This caused a lot of discussion. For example, Kumar McMillan responded with a post detailing reasons not to license code with the GPL.

Separately, Jacob Kaplan-Moss posted a set of twenty questions for the GPL. James Bennett has other questions and concerns, in his post "When Licenses Attack". These posts point out a distinct lack of clarity with regards to what the GPL allows or disallows with dynamic languages. For example, the GPL explicitly mentions linking but does a Python include or a Ruby require constitute linking?

The GPL has a known loophole for web applications or other network services. The loophole is not a horrible idea, as mentioned by Dries Buytaert and Ted Haeger.

The presence of the loophole, however, concerns some people. The solution is the AGPL which forces service providers et al to provide a means to get the source code for a hosted network service, web application, etc. Some people like this. For example, Alberto García Hierro, formerly of byNotes, chose AGPL for his code.

However, even the AGPL has issues. The AGPL is technically incompatible with the GPL. There was some objection to some of the wording within the Debian community. On the forums for the Frog CMS, a web developer stated an issue with Frog's use of the AGPL and how he felt it would impact his client sites. Ted Haeger asks if the AGPL is too radioactive. And even Alberto García Hierro mentions issues with the AGPL. Both of them wonder if there's need for a LGPL-like version of the AGPL.

What About the Apache License?

Kumar McMillan suggests the Apache License as an alternative to the GPL. It is, indeed, an attractive alternative. Based on the license itself and part of chapter 10 of Van Lindberg's Intellectual Property and Open Source, it looks like a license well suited to a lot of projects who want to avoid the GPL. The clauses about patents and trademarks may not be useful for a small-time developer but the clause about contributions could certainly help to avoid headaches.

While there has been some discussion on how the BSD and MIT licenses interact with the GPL, I have found little documentation on how Apache-licensed code could be integrated into MIT- or BSD-licensed projects. It looks like preserving the license and attributions are necessary. I could see this being messy. The only concrete information I've found is that OpenBSD specifically forbids inclusion of source code licensed under version 2 of the Apache License.

So Choosing a License

For my particular application, the factors that impact a license decision are:

  • Platform: The language or framework the application is built on plays a part in determining the license. Any development using either that would be distributed must be done with a license that is compatible with that language or framework.

    An issue arises when dealing with frameworks that include parts of themselves in the final application. For example, some of the generators in Ruby on Rails could be claimed to work this way. This means that the output is considered a derivative work since it includes part of the original. (This is part of why GNU bison has a license exemption in its output files.) This further requires the use of a compatible license. (I am also not sure if it is possible to have two sections of a file under different licenses.)

    My current reasoning: The current draft of code is built on Rails which is licensed under the MIT/X11 license. Since the MIT license is one of the most permissive licenses, this does not restrict the license choice.

    The copyright status of output files from the Rails generators concerns me. Including parts of Rails within the application obviously makes it a derivative work of Rails. However, I do not know at what point, if any, the copyright for those sections of Rails would transfer to me or if those sections would always be copyrighted by the Rails development team and therefore would always fall under the MIT license and, therefore, always need the MIT license included.

    This concern alone makes the MIT license a strong candidate.

    Were I using Django, the BSD license would likely be a strong candidate for the same reasons.

  • Reusing code: As mentioned, this is not the first time someone has tried to do what I'm doing. There exist open source projects that at least somewhat overlap with what I'm doing. For the purpose of this discussion, we'll say that one is released under the three paragraph BSD license and one is released under the GPL.

    The BSD license has few restrictions on what can be done with the source code. As long as attribution is given and the terms of the license are mentioned, source code can be copied outright. Any derivative works, e.g. translations, modifications, etc., can be used or even relicensed as long as the original attribution and licensing is given.

    The GPL has significant restrictions on what can be done with the source code. While I can do almost anything I want with source code licensed under the BSD license (aside from strip attributions and the original license), I can only include GPL source code in other GPL'd works. Derivative works also have to licensed under the GPL.

    So if I use a BSD or MIT license, I can only use the BSD-licensed project for a reference. This is also true for the AGPL since, as mentioned earlier, it is technically incompatible with the GPL. I cannot use the GPL'd project as a reference. Only if I use the GPL can I use that project as a license.

    (This is technically not completely true. The GPL only applies to copyrighted material. According to section 102b of the US Copyright Law, copyright protection does not apply to ideas, procedures, or processes. It would therefore be theoretically possible to use the GPL'd project as a reference to find out how it does something. However, due to the high risk of cross-contamination, i.e. the likelihood that the reimplementation of a process or procedure would resemble a derivative work rather than a separate one, it is probably safer to not look at all.)

    Down the line, it is also possible that someone might want to use my code. If I release the code under the GPL, they cannot use it unless they themselves are using the GPL. The same is true of the AGPL. If I use a permissive license, there are no limitations.

    My current reasoning: Losing access to the GPL'd project is not a significant concern. It would probably speed development, at least some, but I would probably learn better if I implemented it myself from the beginning.

    I don't expect to do anything significant in coding this application. Anything I come up with could be easily developed by someone else given enough time. Requiring the use of a reciprocal license then just gets in the way.

  • Business model: This is often a sticky point for choosing licenses. A lot of the time, it comes down to two questions: "Do I want to have the option to make money off of this?" and "Do I care if other people make money off of this?"

    If there is a strong desire to prevent other people from making money off of the project, the GPL is a strong candidate. Since the source for the software must always be distributed with the binaries, it is unlikely that someone else could build a business model around direct sales of the software. There is no way to prevent another person from building a business model around offering support or other services based on the software. For example, since Drupal is released under the GPL, it is exceedingly difficult to build a business around selling the software. However, Acquia has a business model built around providing services for Drupal.

    Releasing software under the GPL does not prevent the copyright holder from making money off of it. While the value of paying for the software is lessened since a free version is available, there is nothing that prevents the copyright holder from providing the software under a commercial license. (As far as I know, no copyright license can prevent the copyright holder from relicensing the software.) MySQL AB saw some success with releasing a commercial version of MySQL.

    The BSD and MIT licenses place few restrictions on what someone else can do with the software. While they do not prevent the copyright holder from making money on direct sales of the software, there is nothing to prevent another person from doing the same.

    I believe that it is unrealistic to make money with a direct sales of a web application, especially one built on an open source framework. (There is a market for it obviously, given the existence of the ionCube PHP Encoder and Zend Guard.) Most of the money to be made with a web application is going to be found with services built around a specific application, e.g. hosting or local customization.

    The only way to escape the local customization and hosting loophole if you want to avoid others making money from the application would be to use the AGPL. This forces anyone who modifies the source code and deploys the reuslting work to provide a download link (or other means of distribution). Using any other license does not allow the developer to get access to any downstream modifications unless volunteered by the people who make them.

    My current reasoning:Since this is a web application and built on Ruby on Rails, I don't think there's any concern about the source code being used for monetary gain. (Were I using Java for this, when the application could be distributed solely in binary format, I probably prefer a license that enforced source code distribution.)

    I have no concerns about someone building a service around hosting the application. (I would consider doing this myself but I simply do not have the time.) My main concern is that someone would build a business model around a modified copy of the application and not send those changes upstream. This could only be mitigated through using the AGPL and the person doing this being honest enough to follow the terms of the license. However, given the issues surrounding the AGPL, I doubt that this would be a positive tradeoff.

So, given the above, the MIT license sounds like a strong candidate. I am still thinking it over but this is how I'm currently leaning.

Miscellanea from Rails-land

I have been experimenting with a Ruby on Rails project recently. Part of the project requires importing large (almost 150 MB, almost one million lines) text files of a given format into a normalized database. I have been working with this part of the project first because I want to find out how much space each log file will take up in the database.

Here's some of my notes and observations:

  • The original draft of the code parsed the line and then called save on the object for that line. Unfortunately, this script took six hours to run using script/runner. Unfortunately, I really need this process to take around one minute rather than six hours.

    ActiveRecord::Extensions (found via this post on the Accelerate HR blog) provides a method to import a large number of records at once. When configuring it to write 1,000 records at a time to the database, the SQLite version took about four and a half hours. Using "chunk" sizes of 10,000 took over nine hours before I stopped it manually because that caused the script to start swapping to disk. (Servers with only 512 MB of memory are no longer as useful as they used to be.)

    Switching to MySQL and using greater normalization results in faster run time with a chunk size of 1,000. However, even then, the import script takes about three hours to run.

  • Mixins can be used to share behavior between related models without repeating yourself. Jamis Buck and DHH call these "concerns". The use of mixins in this way does clean up the models considerably since there's a lot of similar code. However, I do have models that look like:
    class Model
      include Mixin1
      include Mixin2
  • At least on my test server (running CentOS 5.3), the version of Ruby packaged with the OS is broken when using large amounts of memory. Something causes a bit to be unset which yields strange errors. Sometimes it's just a segmentation fault. Sometimes ActiveRecord complains that the object has no "pime" field (when it should have been "time") or other such aberrations. And then there was when the SQLite driver complained that "INSART" was not a valid SQL command.

    These issues do not manifest under the version of Ruby Enterprise Edition installed so this suggests that the RPM ruby is broken.

  • I installed Ruby Enterprise Edition because of the suggested gains in performance and I plan to eventually run the completed application through passenger. However, I wonder how the performance of REE compares to Ruby 1.9.1 for this application.

Using include_path in PHP

Common convention states that include files for a PHP web application should be kept in a given place. This allows the developer to easily find them.

This becomes a headache though when a site is deep and has many subdirectories. You then start seeing code like:

require_once( '../../../../../../includes/config.php' );

This creates a headache for developers since they have to remember where a page is in respect to the includes directory. If the page is moved to a different subdirectory, the path in the require_once (of, if you prefer, include, require, or require_once) statement needs to be changed.

A simple workaround for this is to add the root directory of the site (or whichever directory is just above the includes directory) to PHP's include path.

Let's assume a directory structure like:

  `- www/
      |- logs/
      `- web/
         |- includes/
         |   `- config.php
         |- test/
         |   |- test/
         |   |   `- test /
         |   |       `- test2.php
         |   `- test.php
         `- index.php

In order to include includes/config.php without modifying the include path, each page would use a different path:

  • index.php:
    require_once( 'includes/config.php' );
  • test/test.php:
    require_once( '../includes/config.php' );
  • test/test/test/test2.php:
    require_once( '../../../includes/config.php' );

Now, if /var/www/sites/example.com/www/web is added to PHP's include path, all three scripts can load config.php with the statement:

require_once( 'includes/config.php' );

This clearly reduces the hassle of maintaining and referencing the single directory for includes.

To add the /var/www/sites/example.com/www/web directory to the include path, it must be appended to the include_path configuration setting. To do this for PHP running under Apache using mod_php, the following line can be added in a .htaccess file or directly to the site's VirtualHost paragraph:

php_value include_path '.:/usr/local/lib/php:/var/www/sites/example.com/www/web'

(This assumes that PEAR modules are stored under /usr/local/lib/php.)

If this is the only site on the server, the include_path setting can be edited in php.ini. If using a CGI-based PHP, you may be able to create a local php.ini for this same end. Consult the PHP documentation on runtime configuration for more information.

Using Git For Websites

I've written before about using subversion and Piston for hosting this particular site.

The problem with using subversion for websites is that it is not available when working offline. (Assuming that your repository does not reside on your laptop.) While this is not normally an issue for sites based around a CMS, e.g WordPress or Drupal, there are times that having access to version control would be useful. Examples include working on a new Drupal theme or doing old-fashioned HTML web development.

The solution to this problem is to use a distributed version control system. In a distributed control system, a local repository is maintained on the development machine which is then periodically synchronized with the master repository. One such system is git. Git is used by the Linux kernel developers and many other projects besides. It also has the feature of being able to interact with subversion which is useful for me since I'm not ready to phase out my subversion repository.

At a glance, git appears to be superior for websites. Aside from the advantage of being distributed, it also stores its metadata in a subdirectory of the top-level directory called .git. Compare this against subversion which stores its metadata in a subdirectory called .svn within each directory and subdirectory of the repository. This means that if you have directories that are owned by the webserver (this happens on websites sometimes), you have to get them chown'd before you can add them to a subversion repository. With git, this isn't an issue. (Git will not preserve the file ownership but this isn't usually a significant issue.)

There are a few differences between subversion and git. First, git does not use a central repository like subversion. A common structure in subversion would be to have a central repository and then each project would be a subdirectory. In git, each project has its own repository. While you can check out specific directories from subversion, you can only check out the entire repository for git.

While subversion allows checking in empty directories, git does not. This can be an issue for some applications. The Git FAQ suggests, as a workaround, adding a fiele called .gitignore to each otherwise empty directory. Git will then check in the .gitignore file.

Git does not appear to support anything like svn:externals. However, this does not appear to be a significant issue. Version 2 of Piston supports git (as well as subversion) so it can be used to fulfill the same purpose. (Correction: As Jakub Narębski points out in the comments, git's submodule mechanism is a lot like svn:externals.)

Using git is mostly the same as with subversion. If you are familiar with subversion, you should pick up on the commands relatively quickly. (And if you're completely new to version control, I suggest getting Pragmatic Version Control Usng Git.) The one major catch is: Always remember to push commits to the central repository with svn push origin master. If you do not do this, the local commits will never reach the repository. This is an issue if you expect to redeploy the site from that repository in the future.

Finally, when using Git and Piston, you will need to add these directives to block access to the metadata:

RedirectMatch 404 /\\.git(/|$)
RedirectMatch 404 /\\.piston.yml

If still checking out from a CVS repository, e.g. from Drupal, you will still need to include this as well:

RedirectMatch 404 /CVS(/|$)

Project Euler as a Means of Learning

About four months ago, I wrote about Project Euler. Back then, I posted that I would do the problems in Ruby to try to hone my skills there. Since then, I've mostly done Ruby but I have also done some solutions in Haskell for speed reasons, namely prime number generation. (I should really revisit how I do that in Ruby...)

Project Euler is a good way to introduce some basic concepts. Each problem is best solved using a given set of the language. They are probably better exercises for those who like puzzles than the exercises normally taught in beginner books or in first semester programming courses. However, I see two problems with using Project Euler as a long-term means of learning a programming language.

First, Project Euler's focus is not on learning a given language or teaching about a given set of language features. Project Euler's focus is on the mathematical problems. More time in solutions, especially in later problems, is spent on figuring out the algorithm or method for solving the problem rather than on how to write that algorithm in a given programming langauge. Some language features or methodologies are never addressed because they never come up in the process of solving the problems.

Second, the scope of Project Euler problems is relatively small. The only focus within a given problem is answering that problem. Each solution amounts to a one-time script with components you might reuse later. As a result, Project Euler is insufficient for learning how to develop applications in a given language. (It may, however, have use in learning how to develop a library or module since some algorithms or components are used repeatedly.) There is not sufficient scope to investigate using it to develop an interactive application.

So I think that Project Euler works out well when first starting. However, once familiar with the basic concepts, supplementing or replacing Project Euler with another method of learning, e.g. building an application, is needed to ensure that there is further learning.

This is not to say that Project Euler should be completely abandoned at that point. It just ceases to be useful for learning about applying the programming language by itself. If you happen to enjoy the puzzles (I know I do), feel free to continue to do them.


Subscribe to Ithiriel RSS