You are here

July 2009

Miscellanea from Rails-land

I have been experimenting with a Ruby on Rails project recently. Part of the project requires importing large (almost 150 MB, almost one million lines) text files of a given format into a normalized database. I have been working with this part of the project first because I want to find out how much space each log file will take up in the database.

Here's some of my notes and observations:

  • The original draft of the code parsed the line and then called save on the object for that line. Unfortunately, this script took six hours to run using script/runner. Unfortunately, I really need this process to take around one minute rather than six hours.

    ActiveRecord::Extensions (found via this post on the Accelerate HR blog) provides a method to import a large number of records at once. When configuring it to write 1,000 records at a time to the database, the SQLite version took about four and a half hours. Using "chunk" sizes of 10,000 took over nine hours before I stopped it manually because that caused the script to start swapping to disk. (Servers with only 512 MB of memory are no longer as useful as they used to be.)

    Switching to MySQL and using greater normalization results in faster run time with a chunk size of 1,000. However, even then, the import script takes about three hours to run.

  • Mixins can be used to share behavior between related models without repeating yourself. Jamis Buck and DHH call these "concerns". The use of mixins in this way does clean up the models considerably since there's a lot of similar code. However, I do have models that look like:
    class Model
      include Mixin1
      include Mixin2
    end
  • At least on my test server (running CentOS 5.3), the version of Ruby packaged with the OS is broken when using large amounts of memory. Something causes a bit to be unset which yields strange errors. Sometimes it's just a segmentation fault. Sometimes ActiveRecord complains that the object has no "pime" field (when it should have been "time") or other such aberrations. And then there was when the SQLite driver complained that "INSART" was not a valid SQL command.

    These issues do not manifest under the version of Ruby Enterprise Edition installed so this suggests that the RPM ruby is broken.

  • I installed Ruby Enterprise Edition because of the suggested gains in performance and I plan to eventually run the completed application through passenger. However, I wonder how the performance of REE compares to Ruby 1.9.1 for this application.

On Choosing a License

I have a web application I want to build. Other sites have an application like this built already but they don't quite do everything I want to do. I have started doing some exploratory work on it. (And I've found it's not quite as easy as it looks. This post by Benjamin Pollack comes to mind.)

I want to release the end result as open source so other people can use it. This way, those who wish to use it for themselves have the option to do so. However, before doing so, I have to pick a license.

Why Not the (A/L)GPL By Default?

The GPL is perhaps the most ubiquitous of the open source licenses. Various projects, from Linux to MySQL to Drupal, have benefited from the license and from the openness it fosters (and enforces). It also has its detractors who complain about the openness required, usually from business entities but sometimes from individual developers.

Zed Shaw sparked a new set of exchanges in the BSD-vs.-GPL holy war two weeks ago with his post "Why I (A/L)GPL". This caused a lot of discussion. For example, Kumar McMillan responded with a post detailing reasons not to license code with the GPL.

Separately, Jacob Kaplan-Moss posted a set of twenty questions for the GPL. James Bennett has other questions and concerns, in his post "When Licenses Attack". These posts point out a distinct lack of clarity with regards to what the GPL allows or disallows with dynamic languages. For example, the GPL explicitly mentions linking but does a Python include or a Ruby require constitute linking?

The GPL has a known loophole for web applications or other network services. The loophole is not a horrible idea, as mentioned by Dries Buytaert and Ted Haeger.

The presence of the loophole, however, concerns some people. The solution is the AGPL which forces service providers et al to provide a means to get the source code for a hosted network service, web application, etc. Some people like this. For example, Alberto García Hierro, formerly of byNotes, chose AGPL for his code.

However, even the AGPL has issues. The AGPL is technically incompatible with the GPL. There was some objection to some of the wording within the Debian community. On the forums for the Frog CMS, a web developer stated an issue with Frog's use of the AGPL and how he felt it would impact his client sites. Ted Haeger asks if the AGPL is too radioactive. And even Alberto García Hierro mentions issues with the AGPL. Both of them wonder if there's need for a LGPL-like version of the AGPL.

What About the Apache License?

Kumar McMillan suggests the Apache License as an alternative to the GPL. It is, indeed, an attractive alternative. Based on the license itself and part of chapter 10 of Van Lindberg's Intellectual Property and Open Source, it looks like a license well suited to a lot of projects who want to avoid the GPL. The clauses about patents and trademarks may not be useful for a small-time developer but the clause about contributions could certainly help to avoid headaches.

While there has been some discussion on how the BSD and MIT licenses interact with the GPL, I have found little documentation on how Apache-licensed code could be integrated into MIT- or BSD-licensed projects. It looks like preserving the license and attributions are necessary. I could see this being messy. The only concrete information I've found is that OpenBSD specifically forbids inclusion of source code licensed under version 2 of the Apache License.

So Choosing a License

For my particular application, the factors that impact a license decision are:

  • Platform: The language or framework the application is built on plays a part in determining the license. Any development using either that would be distributed must be done with a license that is compatible with that language or framework.

    An issue arises when dealing with frameworks that include parts of themselves in the final application. For example, some of the generators in Ruby on Rails could be claimed to work this way. This means that the output is considered a derivative work since it includes part of the original. (This is part of why GNU bison has a license exemption in its output files.) This further requires the use of a compatible license. (I am also not sure if it is possible to have two sections of a file under different licenses.)

    My current reasoning: The current draft of code is built on Rails which is licensed under the MIT/X11 license. Since the MIT license is one of the most permissive licenses, this does not restrict the license choice.

    The copyright status of output files from the Rails generators concerns me. Including parts of Rails within the application obviously makes it a derivative work of Rails. However, I do not know at what point, if any, the copyright for those sections of Rails would transfer to me or if those sections would always be copyrighted by the Rails development team and therefore would always fall under the MIT license and, therefore, always need the MIT license included.

    This concern alone makes the MIT license a strong candidate.

    Were I using Django, the BSD license would likely be a strong candidate for the same reasons.

  • Reusing code: As mentioned, this is not the first time someone has tried to do what I'm doing. There exist open source projects that at least somewhat overlap with what I'm doing. For the purpose of this discussion, we'll say that one is released under the three paragraph BSD license and one is released under the GPL.

    The BSD license has few restrictions on what can be done with the source code. As long as attribution is given and the terms of the license are mentioned, source code can be copied outright. Any derivative works, e.g. translations, modifications, etc., can be used or even relicensed as long as the original attribution and licensing is given.

    The GPL has significant restrictions on what can be done with the source code. While I can do almost anything I want with source code licensed under the BSD license (aside from strip attributions and the original license), I can only include GPL source code in other GPL'd works. Derivative works also have to licensed under the GPL.

    So if I use a BSD or MIT license, I can only use the BSD-licensed project for a reference. This is also true for the AGPL since, as mentioned earlier, it is technically incompatible with the GPL. I cannot use the GPL'd project as a reference. Only if I use the GPL can I use that project as a license.

    (This is technically not completely true. The GPL only applies to copyrighted material. According to section 102b of the US Copyright Law, copyright protection does not apply to ideas, procedures, or processes. It would therefore be theoretically possible to use the GPL'd project as a reference to find out how it does something. However, due to the high risk of cross-contamination, i.e. the likelihood that the reimplementation of a process or procedure would resemble a derivative work rather than a separate one, it is probably safer to not look at all.)

    Down the line, it is also possible that someone might want to use my code. If I release the code under the GPL, they cannot use it unless they themselves are using the GPL. The same is true of the AGPL. If I use a permissive license, there are no limitations.

    My current reasoning: Losing access to the GPL'd project is not a significant concern. It would probably speed development, at least some, but I would probably learn better if I implemented it myself from the beginning.

    I don't expect to do anything significant in coding this application. Anything I come up with could be easily developed by someone else given enough time. Requiring the use of a reciprocal license then just gets in the way.

  • Business model: This is often a sticky point for choosing licenses. A lot of the time, it comes down to two questions: "Do I want to have the option to make money off of this?" and "Do I care if other people make money off of this?"

    If there is a strong desire to prevent other people from making money off of the project, the GPL is a strong candidate. Since the source for the software must always be distributed with the binaries, it is unlikely that someone else could build a business model around direct sales of the software. There is no way to prevent another person from building a business model around offering support or other services based on the software. For example, since Drupal is released under the GPL, it is exceedingly difficult to build a business around selling the software. However, Acquia has a business model built around providing services for Drupal.

    Releasing software under the GPL does not prevent the copyright holder from making money off of it. While the value of paying for the software is lessened since a free version is available, there is nothing that prevents the copyright holder from providing the software under a commercial license. (As far as I know, no copyright license can prevent the copyright holder from relicensing the software.) MySQL AB saw some success with releasing a commercial version of MySQL.

    The BSD and MIT licenses place few restrictions on what someone else can do with the software. While they do not prevent the copyright holder from making money on direct sales of the software, there is nothing to prevent another person from doing the same.

    I believe that it is unrealistic to make money with a direct sales of a web application, especially one built on an open source framework. (There is a market for it obviously, given the existence of the ionCube PHP Encoder and Zend Guard.) Most of the money to be made with a web application is going to be found with services built around a specific application, e.g. hosting or local customization.

    The only way to escape the local customization and hosting loophole if you want to avoid others making money from the application would be to use the AGPL. This forces anyone who modifies the source code and deploys the reuslting work to provide a download link (or other means of distribution). Using any other license does not allow the developer to get access to any downstream modifications unless volunteered by the people who make them.

    My current reasoning:Since this is a web application and built on Ruby on Rails, I don't think there's any concern about the source code being used for monetary gain. (Were I using Java for this, when the application could be distributed solely in binary format, I probably prefer a license that enforced source code distribution.)

    I have no concerns about someone building a service around hosting the application. (I would consider doing this myself but I simply do not have the time.) My main concern is that someone would build a business model around a modified copy of the application and not send those changes upstream. This could only be mitigated through using the AGPL and the person doing this being honest enough to follow the terms of the license. However, given the issues surrounding the AGPL, I doubt that this would be a positive tradeoff.

So, given the above, the MIT license sounds like a strong candidate. I am still thinking it over but this is how I'm currently leaning.