== To first reviewer: > However, I think that a major (and perhaps fatal) weakness of the > paper (for the ICFP audience) is that there isn't a take home message > that applies outside the context of GHC. We respectfully disagree. Our results have immediate relevance for anyone building a compiler for a lazy functional language. Firstly, we give quantitative evidence that adding extra tests to avoid indirect jumps is a net win (Section 5). Secondly, we present a completely new idea, that of encoding the constructor tag in the pointer, in a way that works *even in a lazy language* (Section 6). We believe these results will hold for any implementation; of course we can only substantiate this claim for the particular implementation on which we did our experiments (but it's a strength of the paper that we do demonstrate end-to-end performance improvements on a mature compiler). The referee is right to point out that we don't stress the general applicability of the ideas sufficiently. > * It occurs to me that one other scheme might be worth considering. > Rather than tagging the pointer to the closure, one might tag the > pointer to the info table. This scheme is described in (Hammond,1993) and we discuss it in the related work section. We didn't implement it, though. > * In Section 5.1., the authors conjecture that the Xeon processor's > longer pipeline makes branch mispredictions more costly. ... Good point - this is a conjecture in the paper, and we should be clearer about that. It is not crucial to the results of the paper. For what it's worth, the reasoning goes like this: we established via measurements that most of the speedup on the Opteron machine was due to reduction in branch mispredictions. Hence, if the speedup on Xeon is greater, it is likely (though not certain, of course) that this is due to a greater penalty for branch misprediction. Also what we know of the Intel's architecture (long pipeline) backs this up, and reference (Fog,2006) states that a branch misprediction on this architecture is "rarely less than 24 clock cycles", compared to 10-12 cycles on the AMD. == Second reviewer: We haven't systematically analysed the cause of slowdown, but we think we know what factors cause it, and we can elaborate if necessary. When a program has a low hitrate (case-on-evaluated-data), then the extra tests are just wasted effort. Also, in pointer-tagging, the GC has to do more work to propagate the tags around. == Third reviewer: On CPU counter measurements: we used Opteron counters only, because of resource limitations. We expect that Xeon counters would yield similar results (except for the higher branch-misprediction penalty; see above). On pointer tagging for functions: dynamic measurements (see Section 6.3) already show that 3 bits are enough to cover 99% of the cases.