== To first reviewer:

> However, I think that a major (and perhaps fatal) weakness of the
> paper (for the ICFP audience) is that there isn't a take home message
> that applies outside the context of GHC.

We respectfully disagree.  Our results have immediate relevance for
anyone building a compiler for a lazy functional language.  Firstly,
we give quantitative evidence that adding extra tests to avoid
indirect jumps is a net win (Section 5).  Secondly, we present a
completely new idea, that of encoding the constructor tag in the
pointer, in a way that works *even in a lazy language* (Section 6).

We believe these results will hold for any implementation; of course
we can only substantiate this claim for the particular implementation
on which we did our experiments (but it's a strength of the paper that
we do demonstrate end-to-end performance improvements on a mature
compiler).  The referee is right to point out that we don't stress the
general applicability of the ideas sufficiently.

> * It occurs to me that one other scheme might be worth considering.
>  Rather than tagging the pointer to the closure, one might tag the
>  pointer to the info table. 

This scheme is described in (Hammond,1993) and we discuss it in the
related work section.  We didn't implement it, though.

> * In Section 5.1., the authors conjecture that the Xeon processor's
>  longer pipeline makes branch mispredictions more costly. ...

Good point - this is a conjecture in the paper, and we should be
clearer about that.  It is not crucial to the results of the paper.

For what it's worth, the reasoning goes like this: we established via
measurements that most of the speedup on the Opteron machine was due
to reduction in branch mispredictions.  Hence, if the speedup on Xeon
is greater, it is likely (though not certain, of course) that this is
due to a greater penalty for branch misprediction.  Also what we know
of the Intel's architecture (long pipeline) backs this up, and
reference (Fog,2006) states that a branch misprediction on this
architecture is "rarely less than 24 clock cycles", compared to 10-12
cycles on the AMD.

== Second reviewer:

We haven't systematically analysed the cause of slowdown, but we think
we know what factors cause it, and we can elaborate if necessary.
When a program has a low hitrate (case-on-evaluated-data), then the
extra tests are just wasted effort.  Also, in pointer-tagging, the GC
has to do more work to propagate the tags around.

== Third reviewer:

On CPU counter measurements: we used Opteron counters only, because of
resource limitations. We expect that Xeon counters would yield similar
results (except for the higher branch-misprediction penalty; see
above).

On pointer tagging for functions: dynamic measurements (see Section
6.3) already show that 3 bits are enough to cover 99% of the cases.