You are here

Archive for Ruby

Array#compact vs. checking for nil manually

In the prototype code for a project, I wrote something like:

  1. items << item if item

As is, it's not an ugly line. However, it's inside a loop that creates an object based on each line in a text file. So, to provide context:

  1. items = []
  2.  
  3. File.open( filename, 'r' ) do |file|
  4.   file.each_lines do |line|
  5.     item = build_item( line.strip )
  6.     items << item if item
  7.   end
  8. end

So if the line contains one million lines, it runs that statement one million times. (As to why I'm testing, build_item may return nil if the line matches a given set of criteria.)

Array#compact returns a copy of the array with all nil elements removed from the array. (Array#compact! removes all nil elements from the current array.) Removing the if condition from the loop and adding items.compact! after the File.open block should do the same job as what I originally wrote.

To test, I wrote this program:

  1. require 'benchmark'
  2.  
  3. test_array1 = Array.new
  4.  
  5. 10_000_000.times do
  6.   test_array1 << "Foo"
  7.   test_array1 << nil
  8. end
  9.  
  10. test_array2 = Array.new( test_array1 )
  11. test_array3 = Array.new( test_array1 )
  12.  
  13. new_array = Array.new
  14.  
  15. Benchmark.bm do |x|
  16.   x.report( '#compact' ) { test_array1.compact }
  17.   x.report( '#compact!' ) { test_array2.compact! }
  18.   x.report( '#delete' ) { test_array3.delete( nil ) }
  19.   x.report( '#each' ) do
  20.     test_array1.each do |item|
  21.       new_array << item if item
  22.     end
  23.   end
  24. end

Running this under Ruby 1.9.1 (p243) returned these results:

      user     system      total        real
#compact  0.240000   0.100000   0.340000 (  0.336496)
#compact!  1.100000   0.120000   1.220000 (  1.218401)
#delete  3.140000   0.070000   3.210000 (  3.213278)
#each  3.130000   0.050000   3.180000 (  4.720402)

(For the curious, twenty runs of the same script returned similar numbers.)

The numbers indicate that Array#compact is significantly faster than the original code. Whether or not this optimization is needed is another matter entirely. Even for an array of twenty million elements, the difference in time is about four seconds. I suppose the difference between the two should come down to which is easier to read and understand.

I also decided to test Array#delete since items.delete( nil ) should be equivalent to item.compact!. Array#compact!, somewhat unsurprisingly, is faster although only by about two seconds. I'm not sure why Array#compact! should be almost a second slower than Array#compact.

First Experiences with BDD, Cucumber, And RSpec

I've restarted the project I have been working on. The prototype code worked well enough but I don't think the code is in good shape. I didn't add any testing to the prototype and this started to cause significant problems.

I don't know when or where I first heard about RSpec or behavior-driven development. However, when I found out that a book on RSpec, the aptly titled The RSpec Book, was being published, I pre-ordered it. I have been reading the beta PDF and decided to apply it to the rewrite.

Like any new programming methodology, it takes a little bit of getting used to. It took me a few hours last night to implement a single action on a controller but I'm certain this will improve over time. (I find also that switching between the PDF and TextMate on my laptop costs time too. This is definitely a time for which having a second monitor would be useful.)

I find BDD interesting because it forces a different way of programming than I'm used to. I spend a lot of time working on the model and seem to get around to the controllers and views near the end (if ever). Instead, when writing the feature for cucumber, I have to establish what behavior should occur for the user and then make sure that the controller and view gets written for that behavior. I obviously lack the experience to make an informed decision on whether or not BDD is a better software development methodology than TDD or other testing-enhanced processes. (Any process that involves testing is, in my opinion, significantly better than any that does not.)

One omission I found in the book (or maybe I haven't looked in the right place) is that there seems to be no discussion about spec'ing routes. RSpec does have methods for it. In discussion on the rspec-users mailing list, David Chelimsky provided a link to some examples so some documentation does exist. (I've left a note in the book errata so we'll see what happens.)

An issue I found working with RSpec is that ruby script/spec spec/ works but rake rspec does not. This seems to be because the rake task is not Rails-aware while script/spec is. I haven't done much more research into it though. Since the first command works, I'll continue to use it.

Miscellanea from Rails-land

I have been experimenting with a Ruby on Rails project recently. Part of the project requires importing large (almost 150 MB, almost one million lines) text files of a given format into a normalized database. I have been working with this part of the project first because I want to find out how much space each log file will take up in the database.

Here's some of my notes and observations:

  • The original draft of the code parsed the line and then called save on the object for that line. Unfortunately, this script took six hours to run using script/runner. Unfortunately, I really need this process to take around one minute rather than six hours.

    ActiveRecord::Extensions (found via this post on the Accelerate HR blog) provides a method to import a large number of records at once. When configuring it to write 1,000 records at a time to the database, the SQLite version took about four and a half hours. Using "chunk" sizes of 10,000 took over nine hours before I stopped it manually because that caused the script to start swapping to disk. (Servers with only 512 MB of memory are no longer as useful as they used to be.)

    Switching to MySQL and using greater normalization results in faster run time with a chunk size of 1,000. However, even then, the import script takes about three hours to run.

  • Mixins can be used to share behavior between related models without repeating yourself. Jamis Buck and DHH call these "concerns". The use of mixins in this way does clean up the models considerably since there's a lot of similar code. However, I do have models that look like:
    class Model
      include Mixin1
      include Mixin2
    end
  • At least on my test server (running CentOS 5.3), the version of Ruby packaged with the OS is broken when using large amounts of memory. Something causes a bit to be unset which yields strange errors. Sometimes it's just a segmentation fault. Sometimes ActiveRecord complains that the object has no "pime" field (when it should have been "time") or other such aberrations. And then there was when the SQLite driver complained that "INSART" was not a valid SQL command.

    These issues do not manifest under the version of Ruby Enterprise Edition installed so this suggests that the RPM ruby is broken.

  • I installed Ruby Enterprise Edition because of the suggested gains in performance and I plan to eventually run the completed application through passenger. However, I wonder how the performance of REE compares to Ruby 1.9.1 for this application.

Frustration Of My Own Making

I've spent the last two days trying to recompile gcc on my G4 iBook. It hasn't been pleasant and it hasn't worked. And that might be because I didn't do enough research.

In the process of doing Project Euler problems, I reached one where I needed to solve a system of linear equations. For those of you who don't remember what this looks like, it's something like:
$4x+3y = 10$
$7x-2y = 3$

While this is a reasonably simple example and can be solved by hand, they can get significantly more difficult. The standard way for solving problems in this fashion involves using matrices. (If you took a linear algebra class, this should look familiar.)

Coding a naive way to do this is not difficult. Cramer's Rule provides a way to find each value by using the determinant of matrices. (Apparently LU decomposition is faster but it looks harder to implement. I'll go with naive first and then refactor later if I need to.) The determinant of arbitrary-size square matrices can be expressed through the Laplace expansion until eventually 2x2 (or, I suppose, 3x3) matrices are found that can have their determinant determined mathematically.

In consulting Skiena's The Algorithm Design Manual, I found references to LAPACK, a linear algebra package for FORTRAN. This is not useful to me since I'm using Ruby. But if such a package exists for FORTRAN, surely one exists for Ruby.

Linalg is a linear algebra package for Ruby. Or, more precisely, it's a Ruby wrapper around LAPACK. As a result, it relies on embedding a FORTRAN compiler into the package. And this is where the pain starts.

OS X comes with LAPACK already. However, it does not come with FORTRAN support. In order to get FORTRAN support, you need to install software or recompile gcc. Failing to do a proper search and not finding this resource, I decided to compile a new gcc.

gcc 4.3.3 relies on gmp and mpfr. gmp 4.2.1, which, if you don't have it installed, gcc tells you to get from ftp://gcc.gnu.org/pub/gcc/infrastructure/, does not work on OS X. It builds and installs fine and then yields an obscure linker bug when you try to build mpfr or any other software that tries to include gmp. After an hour of fighting with this, I found out that gmp 4.2.1 was not the newest version but that 4.2.4 was. I had no issues with gmp 4.2.4.

Then there has been little love compiling gcc. At some point, gcc creates a spec file which is then used to pass the option -macosx-version-min which causes the compiler to fail. Editing the spec file and removing that lets it continue on before the next pass with a recreated spec file. Fixing that seems to correct the issue again. And then, eventually, there's a point in the build where it just fails. However, this may have been left over from something else so I tore everything down and restarted the process.

It looks like the issue with the spec file could also be of my own making. If I understand this post correctly, the issue might be that I don't have a new enough version of the developer tools installed. This may be possible since I think the version I have is the one that came with OS X 10.4 (and is, therefore, at least three years old).

With the information I've found this morning, I wonder if I could restart the process and have it run smoother. (Although the installable gfortran may be the smoothest option.) I just wish I knew days ago what I know now.

Project Euler

I discovered Project Euler yesterday via a post on the Ruby-Talk mailing list. If you're the sort of person who enjoys puzzles and doesn't mind math, I suggest that you try out the problems for yourself.

I've been working through the problems slowly and in numerical order. While most of my solutions have been "brute force," it has given me an opportunity to learn how to do things in Ruby.

Why slowly? Because I haven't had a lot of time.

Why in numerical order? Later problems build on earlier ones. As I reach a problem that requires a given function or routine, I can go "Hey, I just did that in problem #x" and reuse it.

Why Ruby? Ruby was my "language to learn" last year but I neglected it so I'm hoping I can get more done before I work on this year's language (which will probably be Erlang or Scheme).

A lot of what I've learned to do is quick one-liners with arrays, much like what Drew Olson posted almost two years ago. (I love reinventing the wheel, don't you?) And then there have been neat numerical tricks.

For example, to find out if a number is prime with a somewhat naive algorithm:

  1. class Fixnum
  2.   def prime?
  3.     return false if self < 2
  4.     return false if self == 2
  5.     return false if self % 2 == 0
  6.  
  7.     i = 3
  8.     bound = Math.sqrt( self )
  9.     while i <= bound and i < self
  10.       return false if self % i == 0
  11.       i += 2
  12.     end
  13.  
  14.     true
  15.   end
  16. end

Monkeypatching this into Fixnum is probably not the greatest idea to some people but it allows use of syntax like 31.prime? (useful for testing) and ( n + 1 ).prime?, which is theoretically useful. For example, to find if the Mersenne number for a given power of two is prime, you could write: ( 2 ** n - 1 ).prime?

And, most importantly, I find n.prime? to read better than prime?( n ). Readability always trumps cleverness. Luckily, here, it's both readable and clever.

If you want to learn how to use a language better, I suggest you use it for the Project Euler problems. After a while, you'll find ways to do things you've probably not thought of before (or, possibly, ever needed to do before). And for Ruby fans, you can find a use for Array#each_cons.

XML Parser: Rendering the Item in Ruby

The first language to be used for XML parsing is Ruby. Ruby appears to have three main XML libraries: REXML, Hpricot, and libxml-ruby. As time allows, each will be tested.

Before starting the testing, several things need to be done. First, there needs to be a representation of the Item class in Ruby. From the pseudocode initially shown in the problem statement, I derive the Ruby representation of the Item class and its helper classes.

Since no limitations have been set for accessing or modifying the object's data, all fields are implemented with attr_accessor. Unfortunately, since attr_accessor does not have the ability to set default values, the variable initialization needs to be done in the constructor.

One notable difference between the pseudocode and the Ruby version is the casing. While the pseudocode uses "camel case" variable names as standard for Java, the Ruby version uses "underscore" variable names as per the conventions of the community. This creates small differences like critRating vs. crit_rating. There is an ongoing debate as to which is easier to read.

This represents the bare minimum thus far needed to implement the Item object. When discussing testing, other methods will be added to the Item class and helper classes.