Introduction

The issue of records in Haskell has been discussed at length https://ghc.haskell.org/trac/ghc/wiki/Records. Extensible records, of which there are many implementations which work with an unmodified ghc, are often less convenient to use than normal records, and between HList, vinyl and CTRex, the important features are covered.

One reason for this, is that labels for the records are inconvenient. If you have to write (Label :: Label "longDescriptiveName"), instead of just longDescriptiveName, that discourages use of those records in practice.

An existing approach is to generate values which have an appropriate type for use as a label. In other words, a top-level definition can be made like longDescriptiveName = Label :: Label "longDescriptiveName". One drawback is that the label takes up namespace, and when a longDescriptiveName is imported from two places, one needs to be hidden for the program to compile. Furthermore, depending on template-haskell is bad for portability, however the zeroTH preprocessor is applicable here.

Adding more syntax to the language could help. One inspiration is template haskell which adds 'ident which stands for the Name of the variable ident in scope. Using the same leading single quote ' is an option, but below a leading ` is stolen. There is a similar proposal using the "heavier" prefix # here. While this overlaps slightly with `infix` operators conceptually, but there is no ambiguity because labels have no trailing `. For example, a "pathological" input would be:

`name `F.fun` `ab.c.`d
`xy.z`w

Not every leading ` gets expanded. The preprocessor should produce something like:

(hLens' (Label :: Label "name"))
    `F.fun`
    (hLens' (Label :: Label "ab")).c.(hLens' (Label :: Label "d"))

(hLens' (Label :: Label "xy")).z(hLens' (Label :: Label "w"))

The contents of String or Character literals, such as "`" or '"' should not interfere.

Alternative syntax ('ident)

It would also be possible to overload the leading ', to be either a Name or a Label. This means desugaring 'ident to fromPrime (Label :: Label "ident") 'ident, where fromPrime is defined below:

class FromPrime sym a where
    fromPrime :: Label (sym::Symbol) -> Name -> a

instance FromPrime Name where
    fromPrime _ n = n

instance (Labelable p f s t a b, 
            lens ~ p (a -> f b) (Record s -> f (Record t))) =>
        FromPrime s lens where
    fromPrime label _ = hLens' label

The second instance is written FromPrime s Lens instead of FromPrime s (p (a -> f b) (Record s -> f (Record t)))), so that the instance will be selected without it being known that there is a Record type involved. This decision means that you get the hLens' version when the result type is not required to be Name.

One benefit is that parsers like haskell-src-exts need less modifications, since the leading quote is valid syntax when TemplateHaskell is enabled. On the other hand, the second instance of FromPrime rules means only one Record type is favored. Perhaps a workable solution is to have ghc export a data family Record (a :: k).

Utility definitions

This document is written with knitr. It doesn't support calling ghc for compiled code. But this feature can be added

library(knitr)
library(reshape)
## Loading required package: plyr
## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any
opts_chunk$set(cache = T, tidy = F, fig.height = 4, fig.width = 7)
# program to highlight
cat(file='HL.hs', 'import Text.Blaze.Html.Renderer.String
import Text.Highlighting.Kate
import System.Environment

main = do
    [inf,outf] <- getArgs
    writeFile outf . renderHtml .
        formatHtmlBlock defaultFormatOpts .
        highlightAs "hs" =<< readFile inf')
system2('ghc', args='HL.hs')

knit_engines$set(ghc = function(options) {
    filename <- paste0(options$label, '.hs')
    writeLines(options$code, filename)
    ec <- system2('ghc', args=c('-O2', filename))
    f <- tempfile()
    system2('./HL', args=c(filename, f))
    ff <- readLines(con=f)
    unlink(f)
    paste(ff, collapse='\n')
})

Preprocessor Definitions

Five different approaches to preprocessor writing are done. Some methods are slow or slightly incorrect. PP3 is the fastest. PP4 and PP5 are comparable. All three of these options are still linear-time.

PP and PP2 fail on larger inputs. Furthermore they do not consider character literals '"', so they do not transform the `z in the following input correctly:

x = '"'
y = `z

PP (Brex)

The first preprocessor uses the rex package, which in turn calls PCRE. The .* pattern match to grab the rest of the file (parameter c in the transform function) seems to actually involves time proportional to the size of the input, which leads to quadratic time.

By default PCRE is run without the dotall option. This option allows the second case of the transform function to properly handle strings which are split over lines, which happens if you use string literals such as:

x = "a really long comment here\
    \ and the rest of it on another line"
The Brex module defines a quasiquote with the appropriate options set
module  Brex where

import Text.Regex.PCRE.Rex
import Text.Regex.PCRE.Light.Base

brex = rexWithConf defaultRexConf{ rexByteString = True, rexPCREOpts = [ dotall ]}
Now use the quasiquote defined above
{-# LANGUAGE OverloadedStrings, QuasiQuotes, ViewPatterns #-}
module Main where
import Brex
import System.Environment
import Data.Monoid
import qualified Data.ByteString as B
import qualified Data.ByteString.Lazy as L
import qualified Data.ByteString.Lazy.UTF8 as L

main = operate =<< getArgs

operate [originalFileName, inputFile, outputFile] = do
    input <- B.readFile inputFile
    let linePragma = "{-# LINE 1 \"" <> L.fromString originalFileName <> "\" #-}\n"
    L.writeFile outputFile (linePragma <> L.fromStrict (transform input))
operate _ = do
    n <- getProgName
    error ("add this to hs files:   {-# OPTIONS_GHC -F -pgmF "++ n ++ " #-}")

transform :: B.ByteString -> B.ByteString
transform [brex|^(?{a}[^"`]*)(?{b}`([A-Z]([A-z]*)\.)*[A-z]*`)(?{transform -> c}.*)$|] = a <> b <> c
transform [brex|^(?{a}[^"`]*)(?{b}"(\\"|[^"]*)*")(?{transform -> c}.*)|] = a <> b <> c
transform [brex|^(?{a}[^"`]*)`(?{makeLabel -> b}[a-zA-Z]*)(?{transform -> c}.*)$|] = a <> b <> c
transform x = x

makeLabel b = "(hLens' (Label :: Label \""<>b<>"\"))"

PP2 (Peggy)

This preprocessor uses a parsing expression grammar implemented in the package peggy. The grammar specified here is probably easier to understand than the definition of transform in the previous option.
{-# LANGUAGE FlexibleContexts, QuasiQuotes #-}
import Text.Peggy
import Data.Monoid
import System.Environment

[peggy|
file :: String
  = str* { concat $1 }

str :: String
  = '\"' charLit* '\"' { '"' : concat $1 ++ "\"" }
  / '`' qual '`'  { '`' : $1 ++ "`" }
  / '`' qual { "(hLens' (Label :: Label \""++ $1 ++ "\"))" }
  / . { [$1] }

charLit :: String
  = '\\' '\"' { "\\\"" }
  / ![\"] . { [$1] }

qual :: String
  = ([A-Z]alpha* '.')* alpha+ { concatMap (\(a,b) -> a:b ++ ".") $1 ++ $2 }

alpha :: Char
 = [A-Z] { $1 }
 / [a-z] { $1 }
 / '_'   { '_' }
 / '\''  { '\'' }
 |]

main = operate =<< getArgs

operate [originalFileName, inputFile, outputFile] = do
    Right output <- parseFile file inputFile
    let linePragma = "{-# LINE 1 \"" <> originalFileName <> "\" #-}\n"
    writeFile outputFile (linePragma <> output)
A re-write of the above using "difference lists". For example see (dlist)[http://hackage.haskell.org/package/dlist], though in this case no newtype is used. This avoids the problem caused by operations like ++ being associated badly, which can leading to O(n2) time. But this change does not fix the problem, as benchmarks below show.
{-# LANGUAGE FlexibleContexts, QuasiQuotes #-}
import Text.Peggy
import Data.Monoid
import System.Environment

[peggy|
file :: String -> String
  = str* { foldr (.) id $1 }

str :: String -> String
  = '\"' charLit* '\"' { ('"' :) . foldr (.) id $1 . ('"':) }
  / '`' qual '`'  { ('`':) . $1 . ('`':) }
  / '`' qual { ("(hLens' (Label :: Label \""++) .  $1 . ("\"))"++) }
  / . { ($1 :) }

charLit :: String -> String
  = '\\' '\"' { ("\\\""++) }
  / ![\"] . { ($1:) }

qual :: String -> String
  = ([A-Z]alpha* '.')* alpha+ { foldr (\(a,as) b -> (a :) . (as++) . ('.':) . b) ($2++) $1 }

alpha :: Char
 = [A-Z] { $1 }
 / [a-z] { $1 }
 / '_'   { '_' }
 / '\''  { '\'' }
 |]

main = operate =<< getArgs

operate [originalFileName, inputFile, outputFile] = do
    Right output <- parseFile file inputFile
    let linePragma rest = "{-# LINE 1 \"" <> originalFileName <> "\" #-}\n" <> rest
    writeFile outputFile (linePragma . output $ "")

Common string-based

In module PPMain below, some definitions used in the rest of the code are defined. The only difference between those methods is the ?takeQual implicit parameter. The INLINE may or may not have been necessary (or helpful).
{-# LANGUAGE ImplicitParams, ViewPatterns, NoMonomorphismRestriction #-}
module PPMain where
import Data.Monoid
import System.Environment
import Data.List

{-# INLINE mkMain #-}
mkMain = operate =<< getArgs

{-# INLINE operate #-}
operate [originalFileName, inputFile, outputFile] = do
    input <- readFile inputFile
    let linePragma = "{-# LINE 1 \"" <> originalFileName <> "\" #-}\n"
    writeFile outputFile (linePragma <> s input)

{-# INLINE s #-}
s (stripPrefix "'\"'" -> Just xs) = "'\"'" ++ s xs
s (stripPrefix "'`'"  -> Just xs) = "'`'"  ++ s xs
s ('"': xs) = '"' : t xs
s ('`':  (?takeQual -> (a,xs))) = a ++ s xs
s (x:xs) = x : s xs
s [] = []

{-# INLINE t #-}
-- | inside string
t (stripPrefix "\\\"" -> Just xs) = "\\\"" ++ t xs
t ('"' : xs) = '"' : s xs
t (x:xs) = x : t xs

{-# INLINE addLabel #-}
addLabel xu = "(hLens' (Label :: Label \""++xu++"\"))"

PP3 (list processing)

The definition of takeQual here is a bit confusing to say the least.
{-# LANGUAGE ImplicitParams, ViewPatterns #-}
module Main where
import PPMain
import Data.List
import Data.Char

main = let ?takeQual = takeQual [] in mkMain

takeQual accum (span isAlpha -> (x @ ((isUpper -> True) : _),'.':ys@(y:_)))
    | isUpper y = takeQual (x:accum) ys
    | isLower y,
      (a,b) <- span isAlpha ys,
      '`':b <- b = ('`' : intercalate "." (reverse ((a ++ "`"):x:accum)), b)
    | (a,b) <- span isAlpha ys = (addLabel (intercalate "." (reverse (a:x:accum))), b)
takeQual accum (span isAlpha -> (x,'`':xs)) = ('`': intercalate "." (reverse ((x ++ "`") : accum)), xs)
takeQual accum (span isAlpha -> (x, xs)) = (addLabel (intercalate "." (reverse (x:accum))), xs)

PP4 (Peggy)

The parsing expression grammar from PP2 is used to replace the takeQual function. Qualified identifiers end rather soon, considering that you end up with spaces or parentheses. So the inefficency seen in PP2 doesn't come into play, since takeQual is called with strings that are at most 1 line full of characters (80ish).
{-# LANGUAGE ImplicitParams, FlexibleContexts, QuasiQuotes, ViewPatterns #-}
module Main where
import PPMain
import Data.Char
import Text.Peggy

[peggy|
qual :: String
  = ([A-Z]alpha* '.')* alpha+ { concatMap (\(a,b) -> a:b ++ ".") $1 ++ $2 }

alpha :: Char
 = [A-Z] { $1 }
 / [a-z] { $1 }
 / '_'   { '_' }
 / '\''  { '\'' }
 |]

main = let ?takeQual = takeQual in mkMain

takeQual (span (\e -> isAlpha e || elem e "._") -> (a, rest))
    | Right a' <- parseString qual "<takeQual>" a,
      aTail <- zipLeft a a' = case aTail ++ rest of
            '`' : rest -> ('`':a' ++ "`", rest)
            rest -> (addLabel a', rest)

-- | Defined as below, except more lazy / possibly faster
-- 
-- prop> \(xs::[()]) (ys::[()]) -> zipLeft xs ys == drop (length ys) xs
zipLeft (_:xs) (_:ys) = zipLeft xs ys
zipLeft xs _ = xs

PP5 (regex-applicative)

The same kind of grammar used above can also be described using regex-applicative. Furthermore, applicative-quoters is used to spread the definition out over a couple lines. This is perhaps the clearest version.
{-# LANGUAGE FlexibleContexts, QuasiQuotes, ViewPatterns, ImplicitParams #-}
module Main where
import PPMain
import Data.Char
import Text.Regex.Applicative
import Control.Applicative.QQ.ADo

main = let ?takeQual = takeQual in mkMain

-- "ModuleName."
modNameDot = [ado|
    m <- psym isUpper
    odName <- many (psym isAlpha)
    dot <- sym '.'
    m:odName ++ [dot] |]

-- "M.Od.Ule.Name.something"
qualIdent = [ado|
    modNames <- many modNameDot
    end <- some (psym isAlpha)
    concat (modNames ++ [end]) |]

takeQual x = case findLongestPrefix qualIdent x of
    Just (a , '`' : rest) -> ('`' : a ++ "`", rest)
    Just (a, rest) -> (addLabel a,rest)
    Nothing -> ("``", x) -- unlikely

Benchmarking

First generate larger input files by repeating the contents of dopt.hs multiple times. The preprocessors are not affected by the repeated module Main where import ... that result. That program is just an example of the `ident notation.

Not all of the preprocessors run with the largest input file, since that would take too much time.

library(plyr);

mkN <- function(n, file0='dopt.hs')
    system(paste('cat', paste(rep(file0, n), collapse=' '), ' > ', 'd.hs'))

d <- rbind(expand.grid(n= 1:100,rep=1:2, prog = c('PP3', 'PP4', 'PP5')),
          expand.grid(n= 1:8 , rep=1:2, prog = c('PP2','PP2b')),
          expand.grid(n= 1:30, rep=1:2, prog = 'PP'))
d <- d[ sample.int(nrow(d)), ]
d <- adply(d, 1, splat(function(n, prog, ...) {
        mkN(n)
        as.data.frame(as.list(system.time(
            system(paste0('./',prog, ' d.hs d.hs tmp.hs')))))
}))
opts_chunk$set(fig.height=6,fig.width=12,dev='svg')
library(ggplot2)
d2 <- melt(d, id.vars=c('n', 'prog', 'rep'))
plot(ggplot(d2, aes(n, value, col=variable))
     + facet_grid(~prog)
     + geom_point(size=0.5)
     + geom_smooth(method='loess'))
timings for all options. n is the number of repetitions of the dopt file. PP2 and PP take time quadratic in the size of the input (n^2).

timings for all options. n is the number of repetitions of the dopt file. PP2 and PP take time quadratic in the size of the input (n2).

d2 <- subset(d2, prog != 'PP2' & prog != 'PP' & prog != 'PP2b')
d2.bands <- ddply(d2, c('variable', 'prog'), function(x) {
        m <- predict(loess(value ~ n, x), se=T)
        cbind(x,
        data.frame(y=m$fit, y.max=m$fit + m$se*2, y.min=m$fit - m$se*2))
     })
library(directlabels)
plot(ggplot(d2.bands, aes(n, y=y, ymax=y.max, ymin=y.min,
                          col=variable,
                          group=interaction(variable, prog)))
     + geom_point(size=0.5)
     + geom_smooth(stat='identity')
     + ylab('seconds')
     + geom_dl(data=within(subset(d2.bands, rep==1),
                           lab <- sapply(strsplit(as.character(prog), 'PP'), function(x) x[2])),
               aes(n, y, label=lab,
                   col=variable), method='last.bumpup')
     )
Bands for point-wise 95% confidence intervals for the last 3 methods (numbered), and the 5 different timings (color)

Bands for point-wise 95% confidence intervals for the last 3 methods (numbered), and the 5 different timings (color)

pts <- dlply(d2, c('variable'), function(x) {
    m <- lm(value ~ prog:n, x)
    ms <- sapply(seq(from=5, to=80, by=5), function(nn) {
           m2 <- list(confint(update(m, data=subset(x, n>nn))))
           names(m2) <- nn
           m2 })
    c(list('0'=confint(m)), ms)
})
pts <- cast(melt(pts),  L1+L2+X1 ~ X2)
names(pts)[4:5] <- c('ymin','ymax')
pts$y <- with(pts, (ymin+ymax)/2)
plot(ggplot(within(subset(pts,X1 != '(Intercept)'),
                   levels(X1) <- substr(levels(X1), 7,7)),
            aes(as.numeric(L2), y=y, ymin=ymin, ymax=ymax,
                     label=X1, col=X1, group=interaction(X1,L1))) +
     facet_wrap(~L1, scales='free_y') +
     xlab('n restricted to values higher than') +
     ylab('slope (seconds / n)') +
     geom_dl(method='first.bumpup') +
     scale_color_discrete(guide='none') +
     geom_smooth(stat='identity'))
Since the previous graph has straight lines, the slopes of those lines are plotted below. The x-axis is the minimum value of n included in the linear regression. Increasing that threshold n has little effect besides increasing the pointwise 95% confidence intervals for the slopes.

Since the previous graph has straight lines, the slopes of those lines are plotted below. The x-axis is the minimum value of n included in the linear regression. Increasing that threshold n has little effect besides increasing the pointwise 95% confidence intervals for the slopes.

Statistics

Here's some regression that confirms the linear term n is good enough. There is some exp(n) are still "worth keeping" in the model, but this does not imply that performance will degrade exponentially: consider that some terms have negative coefficients (exp(n):progPP3), which imply at a high enough n the runtime will drop to 0 again.

library(stargazer)
stargazer(
step(lm(value ~ (n + I(n^2) + exp(n)):prog,
        within(subset(d2, variable == 'user.child'),{
            n <- scale(n)
            value <- scale(value)
        })), trace=0), type='text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                value           
## -----------------------------------------------
## n:progPP3                    1.142***          
##                               (0.049)          
##                                                
## n:progPP4                    0.844***          
##                               (0.049)          
##                                                
## n:progPP5                    0.908***          
##                               (0.049)          
##                                                
## I(n2):progPP3                0.150***          
##                               (0.026)          
##                                                
## I(n2):progPP4                -0.089***         
##                               (0.026)          
##                                                
## I(n2):progPP5                 -0.034           
##                               (0.026)          
##                                                
## exp(n):progPP3               -0.227***         
##                               (0.036)          
##                                                
## exp(n):progPP4               0.173***          
##                               (0.036)          
##                                                
## exp(n):progPP5                0.071**          
##                               (0.036)          
##                                                
## Constant                      -0.018           
##                               (0.034)          
##                                                
## -----------------------------------------------
## Observations                    600            
## R2                             0.983           
## Adjusted R2                    0.983           
## Residual Std. Error      0.131 (df = 590)      
## F Statistic         3,833.000*** (df = 9; 590) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01