the informal ramblings of a formal language researcher

Thursday, August 27, 2009

from 90sec to 9sec.

I have been reading Knuth's pre-fascicles from Chapter 7 of TAOCP.

Last night when I transcribed one of the algorithms (7.2.1.1-W) to Scheme code, I got caught up in Knuth's claim that the iterative process described should take "at most a few seconds", yet my first version was taking 90 seconds to run in Larceny. So I started the human iterative process of revising the code, inspecting the compiler output, and doing more runs to figure out where the time was going.

At the end of it, I got the running time down to 9 seconds (from 90; that's a 10x speed-up!) You can shave off a few more seconds by switching to unsafe mode via compiler switches, but it seems like the really important thing was understanding which operations were going to turn into native machine instructions, which ones would turn into millicode invocations in the C runtime, and which ones would turn into Scheme procedure invocations.

(I think 9 qualifies as "a few seconds", especially for a dynamic language like Scheme.)

So here, for your viewing pleasure, is both the first and last versions of the code, along with the log in the comments of the changes I made and what effect they seemed to have.


;; ----------------------------------------------------------------------
;; What is largest set of five letter words connected via single letter
;; swaps? To find out, we use specialized representation of sgb-words,
;; where each word is mapped to an index in a bit-table with 2^25 entries
;; (I would represent each bit by a byte for ease of coding, but Larceny
;; cannot construct objects >= 16 MB, so my hand is forced.)

(define bytevector-bit-ref
(lambda (bv k)
(let ((q (quotient k 8))
(r (remainder k 8)))
(fxarithmetic-shift-right
(fxand (bytevector-u8-ref bv q)
(fxarithmetic-shift-left 1 r)) r))))

(define bytevector-bit-ref?
(lambda (bv k)
(not (zero? (bytevector-bit-ref bv k)))))

(define bytevector-bit-set!
(lambda (bv k bit)
(let ((q (quotient k 8))
(r (remainder k 8)))
(bytevector-u8-set!
bv q
(fxior (fxand (bytevector-u8-ref bv q)
(fxxor #b11111111 (fxarithmetic-shift-left 1 r)))
(fxarithmetic-shift-left bit r))))))

(define (word->index w)
(let ((char->num (lambda (c) (- (char->integer c) (char->integer #\a))))
(offsets (map (lambda (e) (expt 2 e)) '(20 15 10 5 0))))
(apply + (map * (map char->num (string->list w))
offsets))))

(define (index->word i)
(let ((num->char (lambda (n) (integer->char (+ n (char->integer #\a)))))
(offsets (map (lambda (e) (expt 2 e)) '(20 15 10 5 0))))
(list->string (map num->char (map (lambda (o) (remainder (quotient i o) (expt 2 5)))
offsets)))))

(define (read-sgb-words file)
(call-with-input-file file
(lambda (p)
(let loop ((words '()))
(let ((x (read-line p)))
(cond
((eof-object? x)
(reverse words))
(else
(loop (cons x words)))))))))

(define (make-sgb-words-table words)
(let ((table (make-bytevector (quotient (+ (expt 2 25) 7) 8))))
(for-each (lambda (x) (bytevector-bit-set! table (word->index x) 1))
words)
table))

(define words (read-sgb-words "sgb-words.txt"))
(define t (make-sgb-words-table words))

;; This procedure takes 96sec on Chimaera.
;; Knuth says this process should finish in "at most a few seconds"; see below.
(define (find-max-intermix t words)
(let ((A-best '())
(l-best 0)
(wiz (map word->index words))
(two^25 (expt 2 25))
(two^20 (expt 2 20))
(two^15 (expt 2 15))
(two^10 (expt 2 10))
(two^5 (expt 2 5)))
(let loop1 ((w1 wiz))
(cond
((not (null? w1))
(let loop2 ((w2 (cdr w1)))
(cond
((null? w2)
(loop1 (cdr w1)))
(else
(let* ((w (car w1))
(w* (car w2))
(z (fxxor w w*))
(m (+ two^20 two^15 two^10 two^5 1))
(m* (* two^5 m)))
(cond
((not (zero? (fxand (fxxor (fx- z m) z m) m*)))
(loop2 (cdr w2)))
(else
(let ((m (vector (fxand z (fx- two^5 1)) ; m_0
(fxand z (fx- two^10 two^5)) ; m_1
(fxand z (fx- two^15 two^10)) ; m_2
(fxand z (fx- two^20 two^15)) ; m_3
(fxand z (fx- two^25 two^20)) ; m_4
)))
(let ((l 1) (A (list w)))
(let* ((sw (lambda (j)
(set! w (fxxor w (vector-ref m j)))
(cond ((bytevector-bit-ref? t w)
(set! l (+ l 1))
(set! A (cons w A))))))
(swap-0 (lambda () (sw 0)))
(swap-1 (lambda () (swap-0) (sw 1) (swap-0)))
(swap-2 (lambda () (swap-1) (sw 2) (swap-1)))
(swap-3 (lambda () (swap-2) (sw 3) (swap-2)))
(swap-4 (lambda () (swap-3) (sw 4) (swap-3))))
(swap-4)
(cond ((> l l-best)
(display A) (newline)
(set! l-best l)
(set! A-best A))))))
(loop2 (cdr w2)))))))))))))


;; Knuth says this process should finish in "at most a few seconds" but its
;; taking Chimaera 96.4sec; lets see if allowing more inlining helps matters...
;; - Using internal define for bytevector-bit-ref and bytevector-bit-ref?
;; shaved off 6 seconds to 90.1sec
;; - Using macros to define { sw, swap-0, swap-1, swap-2, swap-3, swap-4 }
;; brings time down to 77.7sec
;; - Using macros to define bytevector-bit-ref and bytevector-bit-ref?
;; did not improve the time at all... (but I did not do it properly the first time!)
;; - Lifting the definitions of m and m* did not improve the time.
;; - Replacing the encoding of m_i via a 5-elem vector with explicit
;; variables { m_0 .. m_4 } may have shaved a fraction of a second off
;; the time...
;; - Fixing the macro for bytevector-bit-ref and bytevector-bit-ref? did not help.
;; - Replacing fxarithmetic-shift-left and fxarithmetic-shift-right with fxlsh and fxrsh
;; was huge; brought time down to 40.5sec
;; - Changing encoding of 2^k to thunks rather than let-bound constants did not help anything.
;; - Changing encoding of 2^k to macros rather than thunks did not help anything [[ but I may have been timing the wrong code here ]]
;; - Changing encoding of m and m' to macros rather than let-bound constants did not help anything.
;;
;; [[ note on last three items: it looks strongly like Twobit is
;; reintroducing the let-bindings for temporaries rather than
;; inline their values in the code, which is interesting.]]
;; [[ (are fxlsh and friends not interpreted at compile time?) ]]

;; - Manually constant folding fxlsh and friends got down to 39.8sec (maybe not statistically sig delta).
;; - Adding check for bytevector? of t at outset got down to 39.1sec (again maybe not stat sig on own).
;; - Switching to multi-valued fxdiv-and-mod was a terrible idea (x2.5 slowdown)
;; - And even using fxdiv fxmod separately was a terrible idea (x2 slowdown)

;; [[ We apparently do not have any compiler optimizations for fxquotient and similar; hmm. ]]

;; - Changing the code to use fxrhsl and fxand instead of quotient and remainder
;; brings running time down to 9sec (from between 37 and 39 seconds; another huge leap!).

(define (find-max-intermix.v16 t words)

(let*-syntax ((bytevector-bit-ref
(syntax-rules ()
((_ bv k)
(let ((q (fxrshl k 3))
(r (fxand k #b111)))
(fxrsha
(fxand (bytevector-u8-ref bv q)
(fxlsh 1 r)) r)))))

(bytevector-bit-ref?
(syntax-rules ()
((_ bv k)
(not (zero? (bytevector-bit-ref bv k))))))

(two^25 (syntax-rules () ((_) 33554432)))
(two^20 (syntax-rules () ((_) 1048576)))
(two^15 (syntax-rules () ((_) 32768)))
(two^10 (syntax-rules () ((_) 1024)))
(two^5 (syntax-rules () ((_) 32)))
(m (syntax-rules () ((_) 1082401 ))) ;; (+ (two^20) (two^15) (two^10) (two^5) 1)
(m* (syntax-rules () ((_) 34636832 ))) ;; (* (two^5) (m)))))
)

(let* ((A-best '())
(l-best 0)
(wiz (map word->index words)))
(cond
((not (bytevector? t)) (error))
(else
(let loop1 ((w1 wiz))
(cond
((not (null? w1))
(let loop2 ((w2 (cdr w1)))
(cond
((null? w2)
(loop1 (cdr w1)))
(else
(let* ((w (car w1))
(w* (car w2))
(z (fxxor w w*)))
(cond
((not (fxzero? (fxand (fxxor (fx- z (m)) z (m)) (m*))))
(loop2 (cdr w2)))
(else
(let ((m_0 (fxand z (fx- (two^5) 1)))
(m_1 (fxand z (fx- (two^10) (two^5))))
(m_2 (fxand z (fx- (two^15) (two^10))))
(m_3 (fxand z (fx- (two^20) (two^15))))
(m_4 (fxand z (fx- (two^25) (two^20)))))
(let ((l 1) (A (list w)))
(let*-syntax ((upd (syntax-rules ()
((upd) (cond ((bytevector-bit-ref? t w)
(set! l (+ l 1))
(set! A (cons w A)))))))
(sw (syntax-rules ()
((sw 0) (begin (set! w (fxxor w m_0)) (upd)))
((sw 1) (begin (set! w (fxxor w m_1)) (upd)))
((sw 2) (begin (set! w (fxxor w m_2)) (upd)))
((sw 3) (begin (set! w (fxxor w m_3)) (upd)))
((sw 4) (begin (set! w (fxxor w m_4)) (upd)))))
(swap-0 (syntax-rules ()
((_) (sw 0))))
(swap-1 (syntax-rules ()
((_) (begin (swap-0) (sw 1) (swap-0)))))
(swap-2 (syntax-rules ()
((_) (begin (swap-1) (sw 2) (swap-1)))))
(swap-3 (syntax-rules ()
((_) (begin (swap-2) (sw 3) (swap-2)))))
(swap-4 (syntax-rules ()
((_) (begin (swap-3) (sw 4) (swap-3))))))
(begin
(swap-4)
(cond ((> l l-best)
(display A) (newline)
(set! l-best l)
(set! A-best A)))))))
(loop2 (cdr w2))))))))))))))))


UPDATE: I realized it might be worthwhile to identify how much speed-up would have resulted if I had just applied the most important steps to the original version at the outset.

The answer is that just:

  1. Turning the bit-referencing operations into internal defines within the procedure, and

  2. Using the fxlsh, fxrshl, and fxand operations for the definitions of the bit-referencing operations


brings the running time down to 18 seconds. So somewhere around 70 seconds of the time was spent on inefficiencies due to using general purpose arithmetic operations when fixnum operations (that the compiler specializes) would probably suffice for most people; no need to use macros to force an inlined swap(4) operation (which is what Knuth recommends in his text).

Here is the 18 second version. Note that its code bytevector after being compiled by Twobit is 4.0 kilobytes, while the more heavily optimized find-max-intermix.v16 is 17.8 kilobytes; that is a 4x space regression for a 2x speed improvement.

(define (find-max-intermix t words)

(define bytevector-bit-ref
(lambda (bv k)
(let ((q (fxrshl k 3))
(r (fxand k #b111)))
(fxrshl
(fxand (bytevector-u8-ref bv q)
(fxlsh 1 r)) r))))

(define bytevector-bit-ref?
(lambda (bv k)
(not (zero? (bytevector-bit-ref bv k)))))

(let ((A-best '())
(l-best 0)
(wiz (map word->index words))
(two^25 (expt 2 25))
(two^20 (expt 2 20))
(two^15 (expt 2 15))
(two^10 (expt 2 10))
(two^5 (expt 2 5)))
(let loop1 ((w1 wiz))
(cond
((not (null? w1))
(let loop2 ((w2 (cdr w1)))
(cond
((null? w2)
(loop1 (cdr w1)))
(else
(let* ((w (car w1))
(w* (car w2))
(z (fxxor w w*))
(m (+ two^20 two^15 two^10 two^5 1))
(m* (* two^5 m)))
(cond
((not (zero? (fxand (fxxor (fx- z m) z m) m*)))
(loop2 (cdr w2)))
(else
(let ((m (vector (fxand z (fx- two^5 1)) ; m_0
(fxand z (fx- two^10 two^5)) ; m_1
(fxand z (fx- two^15 two^10)) ; m_2
(fxand z (fx- two^20 two^15)) ; m_3
(fxand z (fx- two^25 two^20)) ; m_4
)))
(let ((l 1) (A (list w)))
(let* ((sw (lambda (j)
(set! w (fxxor w (vector-ref m j)))
(cond ((bytevector-bit-ref? t w)
(set! l (+ l 1))
(set! A (cons w A))))))
(swap-0 (lambda () (sw 0)))
(swap-1 (lambda () (swap-0) (sw 1) (swap-0)))
(swap-2 (lambda () (swap-1) (sw 2) (swap-1)))
(swap-3 (lambda () (swap-2) (sw 3) (swap-2)))
(swap-4 (lambda () (swap-3) (sw 4) (swap-3))))
(swap-4)
(cond ((> l l-best)
(display A) (newline)
(set! l-best l)
(set! A-best A))))))
(loop2 (cdr w2)))))))))))))

No comments:

Followers