the informal ramblings of a formal language researcher

Thursday, August 27, 2009

from 90sec to 9sec.

I have been reading Knuth's pre-fascicles from Chapter 7 of TAOCP.

Last night when I transcribed one of the algorithms (7.2.1.1-W) to Scheme code, I got caught up in Knuth's claim that the iterative process described should take "at most a few seconds", yet my first version was taking 90 seconds to run in Larceny. So I started the human iterative process of revising the code, inspecting the compiler output, and doing more runs to figure out where the time was going.

At the end of it, I got the running time down to 9 seconds (from 90; that's a 10x speed-up!) You can shave off a few more seconds by switching to unsafe mode via compiler switches, but it seems like the really important thing was understanding which operations were going to turn into native machine instructions, which ones would turn into millicode invocations in the C runtime, and which ones would turn into Scheme procedure invocations.

(I think 9 qualifies as "a few seconds", especially for a dynamic language like Scheme.)

So here, for your viewing pleasure, is both the first and last versions of the code, along with the log in the comments of the changes I made and what effect they seemed to have.


;; ----------------------------------------------------------------------
;; What is largest set of five letter words connected via single letter
;; swaps? To find out, we use specialized representation of sgb-words,
;; where each word is mapped to an index in a bit-table with 2^25 entries
;; (I would represent each bit by a byte for ease of coding, but Larceny
;; cannot construct objects >= 16 MB, so my hand is forced.)

(define bytevector-bit-ref
(lambda (bv k)
(let ((q (quotient k 8))
(r (remainder k 8)))
(fxarithmetic-shift-right
(fxand (bytevector-u8-ref bv q)
(fxarithmetic-shift-left 1 r)) r))))

(define bytevector-bit-ref?
(lambda (bv k)
(not (zero? (bytevector-bit-ref bv k)))))

(define bytevector-bit-set!
(lambda (bv k bit)
(let ((q (quotient k 8))
(r (remainder k 8)))
(bytevector-u8-set!
bv q
(fxior (fxand (bytevector-u8-ref bv q)
(fxxor #b11111111 (fxarithmetic-shift-left 1 r)))
(fxarithmetic-shift-left bit r))))))

(define (word->index w)
(let ((char->num (lambda (c) (- (char->integer c) (char->integer #\a))))
(offsets (map (lambda (e) (expt 2 e)) '(20 15 10 5 0))))
(apply + (map * (map char->num (string->list w))
offsets))))

(define (index->word i)
(let ((num->char (lambda (n) (integer->char (+ n (char->integer #\a)))))
(offsets (map (lambda (e) (expt 2 e)) '(20 15 10 5 0))))
(list->string (map num->char (map (lambda (o) (remainder (quotient i o) (expt 2 5)))
offsets)))))

(define (read-sgb-words file)
(call-with-input-file file
(lambda (p)
(let loop ((words '()))
(let ((x (read-line p)))
(cond
((eof-object? x)
(reverse words))
(else
(loop (cons x words)))))))))

(define (make-sgb-words-table words)
(let ((table (make-bytevector (quotient (+ (expt 2 25) 7) 8))))
(for-each (lambda (x) (bytevector-bit-set! table (word->index x) 1))
words)
table))

(define words (read-sgb-words "sgb-words.txt"))
(define t (make-sgb-words-table words))

;; This procedure takes 96sec on Chimaera.
;; Knuth says this process should finish in "at most a few seconds"; see below.
(define (find-max-intermix t words)
(let ((A-best '())
(l-best 0)
(wiz (map word->index words))
(two^25 (expt 2 25))
(two^20 (expt 2 20))
(two^15 (expt 2 15))
(two^10 (expt 2 10))
(two^5 (expt 2 5)))
(let loop1 ((w1 wiz))
(cond
((not (null? w1))
(let loop2 ((w2 (cdr w1)))
(cond
((null? w2)
(loop1 (cdr w1)))
(else
(let* ((w (car w1))
(w* (car w2))
(z (fxxor w w*))
(m (+ two^20 two^15 two^10 two^5 1))
(m* (* two^5 m)))
(cond
((not (zero? (fxand (fxxor (fx- z m) z m) m*)))
(loop2 (cdr w2)))
(else
(let ((m (vector (fxand z (fx- two^5 1)) ; m_0
(fxand z (fx- two^10 two^5)) ; m_1
(fxand z (fx- two^15 two^10)) ; m_2
(fxand z (fx- two^20 two^15)) ; m_3
(fxand z (fx- two^25 two^20)) ; m_4
)))
(let ((l 1) (A (list w)))
(let* ((sw (lambda (j)
(set! w (fxxor w (vector-ref m j)))
(cond ((bytevector-bit-ref? t w)
(set! l (+ l 1))
(set! A (cons w A))))))
(swap-0 (lambda () (sw 0)))
(swap-1 (lambda () (swap-0) (sw 1) (swap-0)))
(swap-2 (lambda () (swap-1) (sw 2) (swap-1)))
(swap-3 (lambda () (swap-2) (sw 3) (swap-2)))
(swap-4 (lambda () (swap-3) (sw 4) (swap-3))))
(swap-4)
(cond ((> l l-best)
(display A) (newline)
(set! l-best l)
(set! A-best A))))))
(loop2 (cdr w2)))))))))))))


;; Knuth says this process should finish in "at most a few seconds" but its
;; taking Chimaera 96.4sec; lets see if allowing more inlining helps matters...
;; - Using internal define for bytevector-bit-ref and bytevector-bit-ref?
;; shaved off 6 seconds to 90.1sec
;; - Using macros to define { sw, swap-0, swap-1, swap-2, swap-3, swap-4 }
;; brings time down to 77.7sec
;; - Using macros to define bytevector-bit-ref and bytevector-bit-ref?
;; did not improve the time at all... (but I did not do it properly the first time!)
;; - Lifting the definitions of m and m* did not improve the time.
;; - Replacing the encoding of m_i via a 5-elem vector with explicit
;; variables { m_0 .. m_4 } may have shaved a fraction of a second off
;; the time...
;; - Fixing the macro for bytevector-bit-ref and bytevector-bit-ref? did not help.
;; - Replacing fxarithmetic-shift-left and fxarithmetic-shift-right with fxlsh and fxrsh
;; was huge; brought time down to 40.5sec
;; - Changing encoding of 2^k to thunks rather than let-bound constants did not help anything.
;; - Changing encoding of 2^k to macros rather than thunks did not help anything [[ but I may have been timing the wrong code here ]]
;; - Changing encoding of m and m' to macros rather than let-bound constants did not help anything.
;;
;; [[ note on last three items: it looks strongly like Twobit is
;; reintroducing the let-bindings for temporaries rather than
;; inline their values in the code, which is interesting.]]
;; [[ (are fxlsh and friends not interpreted at compile time?) ]]

;; - Manually constant folding fxlsh and friends got down to 39.8sec (maybe not statistically sig delta).
;; - Adding check for bytevector? of t at outset got down to 39.1sec (again maybe not stat sig on own).
;; - Switching to multi-valued fxdiv-and-mod was a terrible idea (x2.5 slowdown)
;; - And even using fxdiv fxmod separately was a terrible idea (x2 slowdown)

;; [[ We apparently do not have any compiler optimizations for fxquotient and similar; hmm. ]]

;; - Changing the code to use fxrhsl and fxand instead of quotient and remainder
;; brings running time down to 9sec (from between 37 and 39 seconds; another huge leap!).

(define (find-max-intermix.v16 t words)

(let*-syntax ((bytevector-bit-ref
(syntax-rules ()
((_ bv k)
(let ((q (fxrshl k 3))
(r (fxand k #b111)))
(fxrsha
(fxand (bytevector-u8-ref bv q)
(fxlsh 1 r)) r)))))

(bytevector-bit-ref?
(syntax-rules ()
((_ bv k)
(not (zero? (bytevector-bit-ref bv k))))))

(two^25 (syntax-rules () ((_) 33554432)))
(two^20 (syntax-rules () ((_) 1048576)))
(two^15 (syntax-rules () ((_) 32768)))
(two^10 (syntax-rules () ((_) 1024)))
(two^5 (syntax-rules () ((_) 32)))
(m (syntax-rules () ((_) 1082401 ))) ;; (+ (two^20) (two^15) (two^10) (two^5) 1)
(m* (syntax-rules () ((_) 34636832 ))) ;; (* (two^5) (m)))))
)

(let* ((A-best '())
(l-best 0)
(wiz (map word->index words)))
(cond
((not (bytevector? t)) (error))
(else
(let loop1 ((w1 wiz))
(cond
((not (null? w1))
(let loop2 ((w2 (cdr w1)))
(cond
((null? w2)
(loop1 (cdr w1)))
(else
(let* ((w (car w1))
(w* (car w2))
(z (fxxor w w*)))
(cond
((not (fxzero? (fxand (fxxor (fx- z (m)) z (m)) (m*))))
(loop2 (cdr w2)))
(else
(let ((m_0 (fxand z (fx- (two^5) 1)))
(m_1 (fxand z (fx- (two^10) (two^5))))
(m_2 (fxand z (fx- (two^15) (two^10))))
(m_3 (fxand z (fx- (two^20) (two^15))))
(m_4 (fxand z (fx- (two^25) (two^20)))))
(let ((l 1) (A (list w)))
(let*-syntax ((upd (syntax-rules ()
((upd) (cond ((bytevector-bit-ref? t w)
(set! l (+ l 1))
(set! A (cons w A)))))))
(sw (syntax-rules ()
((sw 0) (begin (set! w (fxxor w m_0)) (upd)))
((sw 1) (begin (set! w (fxxor w m_1)) (upd)))
((sw 2) (begin (set! w (fxxor w m_2)) (upd)))
((sw 3) (begin (set! w (fxxor w m_3)) (upd)))
((sw 4) (begin (set! w (fxxor w m_4)) (upd)))))
(swap-0 (syntax-rules ()
((_) (sw 0))))
(swap-1 (syntax-rules ()
((_) (begin (swap-0) (sw 1) (swap-0)))))
(swap-2 (syntax-rules ()
((_) (begin (swap-1) (sw 2) (swap-1)))))
(swap-3 (syntax-rules ()
((_) (begin (swap-2) (sw 3) (swap-2)))))
(swap-4 (syntax-rules ()
((_) (begin (swap-3) (sw 4) (swap-3))))))
(begin
(swap-4)
(cond ((> l l-best)
(display A) (newline)
(set! l-best l)
(set! A-best A)))))))
(loop2 (cdr w2))))))))))))))))


UPDATE: I realized it might be worthwhile to identify how much speed-up would have resulted if I had just applied the most important steps to the original version at the outset.

The answer is that just:

  1. Turning the bit-referencing operations into internal defines within the procedure, and

  2. Using the fxlsh, fxrshl, and fxand operations for the definitions of the bit-referencing operations


brings the running time down to 18 seconds. So somewhere around 70 seconds of the time was spent on inefficiencies due to using general purpose arithmetic operations when fixnum operations (that the compiler specializes) would probably suffice for most people; no need to use macros to force an inlined swap(4) operation (which is what Knuth recommends in his text).

Here is the 18 second version. Note that its code bytevector after being compiled by Twobit is 4.0 kilobytes, while the more heavily optimized find-max-intermix.v16 is 17.8 kilobytes; that is a 4x space regression for a 2x speed improvement.

(define (find-max-intermix t words)

(define bytevector-bit-ref
(lambda (bv k)
(let ((q (fxrshl k 3))
(r (fxand k #b111)))
(fxrshl
(fxand (bytevector-u8-ref bv q)
(fxlsh 1 r)) r))))

(define bytevector-bit-ref?
(lambda (bv k)
(not (zero? (bytevector-bit-ref bv k)))))

(let ((A-best '())
(l-best 0)
(wiz (map word->index words))
(two^25 (expt 2 25))
(two^20 (expt 2 20))
(two^15 (expt 2 15))
(two^10 (expt 2 10))
(two^5 (expt 2 5)))
(let loop1 ((w1 wiz))
(cond
((not (null? w1))
(let loop2 ((w2 (cdr w1)))
(cond
((null? w2)
(loop1 (cdr w1)))
(else
(let* ((w (car w1))
(w* (car w2))
(z (fxxor w w*))
(m (+ two^20 two^15 two^10 two^5 1))
(m* (* two^5 m)))
(cond
((not (zero? (fxand (fxxor (fx- z m) z m) m*)))
(loop2 (cdr w2)))
(else
(let ((m (vector (fxand z (fx- two^5 1)) ; m_0
(fxand z (fx- two^10 two^5)) ; m_1
(fxand z (fx- two^15 two^10)) ; m_2
(fxand z (fx- two^20 two^15)) ; m_3
(fxand z (fx- two^25 two^20)) ; m_4
)))
(let ((l 1) (A (list w)))
(let* ((sw (lambda (j)
(set! w (fxxor w (vector-ref m j)))
(cond ((bytevector-bit-ref? t w)
(set! l (+ l 1))
(set! A (cons w A))))))
(swap-0 (lambda () (sw 0)))
(swap-1 (lambda () (swap-0) (sw 1) (swap-0)))
(swap-2 (lambda () (swap-1) (sw 2) (swap-1)))
(swap-3 (lambda () (swap-2) (sw 3) (swap-2)))
(swap-4 (lambda () (swap-3) (sw 4) (swap-3))))
(swap-4)
(cond ((> l l-best)
(display A) (newline)
(set! l-best l)
(set! A-best A))))))
(loop2 (cdr w2)))))))))))))

mismatch twixt formals and invocation forms

As a follow up to my yet-another-generator macro post:

yesterday at lunch I was trying to describe my problems generalizing my generator macro to handle arbitrary lambda formals lists. Ryan was the one who summed the problem up best; I think he said "there is a asymmetry between the format of the formals list and the format of invocation"

The problem in essence is that there are two distinct cases that are easy.
If you want to match a proper list of formals and generate a corresponding invocation, you do something like:

(define-syntax adder
(syntax-rules ()
((_ (args ...) k ...)
(lambda (args ...) (+ k ... args ...)))))


And, if you want to do variable arity, you can do it by matching an identifier representing the whole list of arguments

(define-syntax adder
(syntax-rules ()
((_ argl k ...)
(lambda argl (apply + k ... argl)))))


But notice the difference between the bottom and the top. If some users of the adder macro want to be able to specify that they are generating a procedure of arity 2, with appropriate error detection, they can use the first but not the second. And of course, the first does not do variable arity procedures.

To deal with this problem in the context of my generator macro, I introduced a helper macro that builds up an application expression based on the form of the formals list.


(define-syntax build-apply
(syntax-rules ()
((build-apply (rator rands ...) (args ...)) ;; usual case
(rator rands ... args ...))
((build-apply (rator rands ...) (x . y)) ;; improper list, inductive case
(build-apply (rator rands ... x) y))
((build-apply (rator rands ...) rest-args) ;; improper list, base case
(apply rator rands ... rest-args))))


The order of all three clauses above is very significant.

With that in place, here is the new form of my generator macro(s):

;; (generator ARGS BODY ...) => procedure
;; in whose body yield is bound to coroutine with caller
(define-syntax generator
(transformer
(lambda (exp ren cmp)
`(,(ren 'generator-via) yield ,(cadr exp) ,@(cddr exp)))))

(define-syntax generator-via
(syntax-rules ()
((generator-via yield-id args body ...)
(let ((get-next #f))
(lambda args
(call-with-current-continuation
(lambda (exit)
(cond
(get-next (build-apply (get-next exit) args))
(else (let ((yield-id
(lambda results
(call-with-current-continuation
(lambda (next)
(set! get-next (lambda (new-exit . args)
(set! exit new-exit)
(build-apply (next) args)))
(apply exit results))))))
(call-with-values (lambda () body ...)
;; below could also use build-apply with
;; args, but then the "done-value"(s)
;; returned from a generator would be forced
;; to have value-count = length args
(lambda argl
(set! get-next (lambda (new-exit . args)
(apply new-exit argl)))
(apply exit argl)))))))))))))

Tuesday, August 25, 2009

yet another yield Scheme macro

I have seen yield done plenty of times before, but I managed to reinvent it again tonight.

A lot of the time when I've played with call/cc, I have approached it trying to recreate a form in a statement-oriented language, and so I would end up not caring about the argument I would pass to the invocation of the captured continuation (because all I cared about was reestablishing the control flow of the prior context; the prior context itself would just discard any value I passed to it). I think the usual interpretation of yield is an instance of this in statement oriented languages.

However, when I was playing with some code tonight, I saw that dummy value and said, "ugly! why is that there?" And I decided to see what would happen if I got rid of it.

This macro is the result. (Obviously you want a bit of sugar around it to non-hygenically bind yield, although there are cases where it is nice to bind a different name like visit instead of yield, depending on the domain.)


(define-syntax generator-via
(syntax-rules ()
;; puzzle 4 U: what role(s) does arg-list serve? (all occurrences matter)
((generator-via yield-id arg-list body ...)
(let ((yield-id #f) (get-next #f))
(lambda arg-list
(call-with-current-continuation
(lambda (exit)
(cond (get-next (get-next exit . arg-list))
(else (set! yield-id (lambda results
(call-with-current-continuation
(lambda (next)
(set! get-next
(lambda (new-exit . arg-list)
(set! exit new-exit)
(next . arg-list)))
(apply exit results)))))
(call-with-values (lambda () body ...)
;; puzzle 4 U: why below eta-expansion required?
(lambda args (apply exit args))))))))))))


Its got some fun behaviors. Consider:


> (define grows
(generator-via yield (x)
(let loop ((i 0))
(loop (+ i x (yield i))))))

> (grows 1)
0

> (grows 0)
1

> (grows 0)
2

> (grows 0)
3

> (grows 1)
5

> (grows 0)
6



(yes I just realized I could/should have let-bound the yield-id itself. The other set! invocations are believed to be necessary, barring tricks mixing letrec and call/cc.)

Wednesday, July 29, 2009

(no) interpreters on iphone

I'm collecting blog posts and web pages discussing clause 3.3.2 of the iPhone SDK EULA and semi-related restrictions on iPhone apps.

mobilephonedevelopment article

rhomobile's interpretation of no interpreters rule

story of how a Basic interpreter got downgraded to a calculator (and how to bring the interpreter back if you're willing to jailbreak).

PhoneGap Discussion of arbitrariness of rejection

MacNN coverage of the PhoneGap issue.

Commodore 64 emulator rejection article

FrotZ got through the filter; I asked the developer about that in this discussion

Tuesday, July 21, 2009

git-svn argh!

CCIS decommissioned a number of its Solaris machines recently.

I was using one of them as the point of access for the Larceny SVN repository, via the svn+ssh protocol.

More recently I have been playing with git-svn, using that as my way to interact with the repository. I made a local git-svn clone of the Larceny SVN repository, and then I would clone the clone and make branches in the cloned clones as part of my experimentation with new ideas. Fun fun fun.

Except that CCIS decommissioned a number of its Solaris machines recently, and I was using one of them as the point of access, which mean that the repository url looked something like svn+ssh://zulu.ccs.neu.edu/path-to-repos/, and git-svn (somewhat reasonably) assumes that these URLS would remain valid forever, and puts a copy of that URL into every log message of the clone.

Now, its easy to update those log messages; that's what git-filter-branch is for. (See also this related blog entry.)

But what's not so easy is dealing with the sudden split in the tree structure of the development on my cloned clones; all of my local deltas seem to point to the old changesets with the old bad URL. And "git rebase" does not seem to be magically able to handle this case.

Thus: git-svn argh!

If the SVN folks though that a "svn switch --relocate" command was important, maybe the git-svn developer(s) should also consider their design decisions...

Monday, January 12, 2009

Zecreasing Zerived Zvariables

This is a great little story. And probably something I could explain to my Mom.

http://www.freedom-to-tinker.com/blog/felten/debugging-zune-blackout

(plus I've had this obsession with date-oriented code lately...)

(p.s. can one even pronounce "Zvariables"?)

Friday, January 09, 2009

5 or 6 forced restarts later...

I wasted some time this evening trying to use gprof to profile Larceny.

Eventually I discovered that the -pg option simply does not work with gcc on Intel Macs. (Why on earth doesn't the man page for gprof say something about this?)

Apple instead recommends that one use Shark or Saturn, both provided when you install the CHUD Tools (Computer Hardware Understanding Developer Tools)

So I did that. I downloaded a copy of the CHUD Tools (and as far as I can remember, I did it by searching on ADC for them, and got a copy of chud.dmg). I was a little surprised by the file times (they said something like 2007) but I pressed on, eager to start seeing some profiler output, especially since all of my searches to read about Shark sounded very positive about it.

And then when I attempted to compile with -finstrument-functions, so I could use Saturn in the same way that I might have used gprof, I got a link error:

Undefined symbols:
"___cyg_profile_func_exit", referenced from:
_consolemsg in larceny.o
_consolemsg in larceny.o
...
"___cyg_profile_func_enter", referenced from:
_panic_abort in larceny.o
_panic_exit in larceny.o
...


I abandoned further attempts to use Saturn pretty quickly.

So then I tried Shark. And that was, if anything, worse. Because every time I tried to stop a profiling session, it would restart my machine!

Eventually I discovered (though not easily) this discussion post.

Hmm. CHUD 4.4.4 panics. I look at my version. Yep, its 4.4.4.

So I did a more careful search for a more recent version of CHUD (4.6.1, specifically). We'll see how that works out for me in a bit.

(I'm mostly posting this so that other people who run into a similar problem know that they need to get a newer version of CHUD. Maybe later I'll try to retrace my steps that led me to thinking that CHUD 4.4.4 was the version I needed to use...)

  • I do not know who linked to it, but some page (that I believe was on the Apple site) sent me to the follow ftp link, which has the version of CHUD that is incompatible with newer versions of OS X: ftp://ftp.apple.com/developer/Tool_Chest/Testing_-_Debugging/Performance_tools/CHUD_4.4.4.dmg

Followers