Friday, April 04, 2014

Haskell error reporting with locations

Error reporting in GHC is not always the nicest. For example, I often develop code by using undefined as a placeholder for code I've not written yet. Here's a simple example:
import System.Environment
  main = do
    args <- getargs
    if null args then
      undefined
     else
      undefined
Running this looks like this:
$ ./H
  H: Prelude.undefined
Which undefined caused that? Looking at the error message we have no idea. Wouldn't it be nice with some location information?

We can actually get location information by using Control.Exception.assert:

import Control.Exception(assert)
  import System.Environment

  main = do
    args <- getargs
    if null args then
      assert False undefined
     else
      assert False undefined
Now running it is much more informative:
$ ./H
  H: H.hs:7:9-14: Assertion failed
How is assert able to report the location? If we dig deep enough we discover that it's because the ghc compiler contains a special hack to recognize this function and give it location information.

A generalized hack

In a Haskell compiler that I've implemented I've taken this compiler hack and extended it so it can be used for any function.  It comes in two parts, location information and location transparent definitions.

__LOCATION__

The __LOCATION__ identifier is always defined and utterly magical. Its value is a string that describes the location of that very identifier. This is the very opposite of a referentially transparent name. In fact it's value varies with where it is placed in the code! So it's definitely not for purists. But I'm a practical man, so I sometimes have resort of the ugliness of reality. And in reality we want to report locations in errors.

Enough philosophy, here's an example:

main = do
    print __LOCATION__
    print   __LOCATION__
And running it prints:
"test/Test.hs:2:11"
  "test/Test.hs:3:13"
And to illustrate the impurity:
main = do
    let loc = __LOCATION__
    print loc
    print loc
And running this:
"test/Test.mu:2:15"
  "test/Test.mu:2:15"

Location transparency

The __LOCATION__ identifier gives the location of itself. This is of little use on its own. Imagine the definition we could give for undefined. Somewhere in the Prelude module it could say something like
undefined = error ("undefined: " ++ __LOCATION__)
But if we use this all that it will tell us is where the definition of undefined is, not where it was used.

To get the point of use instead of the definition I've introduced location transparent definitions. In a location transparent definition the __LOCATION__ identifier will not refer to its own position, but to the position of the reference to the definition. Location transparency is introduced with a pragma.

{-# LOCATIONTRANSPARENT undefined #-}
  undefined = error ("undefined: " ++ __LOCATION__)
With this definition our initial example looks like this when we run it:
undefined: test/H.hs:6:9
In fact, the real definition of undefined doesn't look like that. The __LOCATION__ identifier is only used in the definition of error, so it looks something like this:
{-# LOCATIONTRANSPARENT error #-}
  error :: String -> a
  error s = throw (ErrorCall (__LOCATION__ ++ ": " ++ s))

  {-# LOCATIONTRANSPARENT undefined #-}
  undefined = error "undefined"
Since both error and undefined are transparent any use of undefined will be reported with the location of the use.

Furthermore, we can make a few more functions location transparent, e.g.,

{-# LOCATIONTRANSPARENT head #-}
  head :: [a] -> a
  head [] = error "Empty list"
  head (x:xs) = x
A simple example:
main = putStr (head [])
Which will print:
test/Head.hs:1:16: Empty list
which is the location where head was called.

Implementation

There are different ways to implement this feature, and I'm going to sketch two of them.

First: Every function that has the LOCATIONTRANSPARENT pragma will be inlined at the point of use, and the __LOCATION__ identifier in the inlined code will be updated to reflect the call site. The definitions must be processed in a bottom-up fashion for this to work. It's fairly simple to implement, but will cause some code bloat due to inlining.

Second: Every function that has LOCATIONTRANSPARENT pragma will be rewritten (by the compiler) to have an extra location argument, and each use of this function will be rewritten to pass in the current location. For example (using $$f for the location version of f):

main = putStr ($$head __LOCATION__ [])

  $$head __LOCATION__ [] = $$error __LOCATION__ "Empty list"
  $$head __LOCATION__ (x:xs) = x
  $$error __LOCATION__ s = throw (ErrorCall (__LOCATION__ ++ ": " ++ s))
This should be fairly straightforward to implement, but I've not tried it. (It's somewhat like dynamic binding, so maybe ghc could reuse that mechanism for locations.)

And, of course, the global __LOCATION__ identifier has to be recognized by the compiler and replaced by a string that is its location.

Conclusion

I implemented the __LOCATION__ hack quite a while ago, and I like the much improved reporting of error locations. I hope someone will add it to ghc as well.

3 comments:

  1. If you extend the LOCATIONTRANSPARENT concept all the way up to main() what you wind up with is basically a stack trace, like you'd find in imperative languages.

    I've long said that the main drawback of laziness isn't space predictability or performance overheads, but rather the lack of automatic stack traces -- even in production code, with little or no overhead, like Java has.

    There have been a lot of proposals for adding stack traces to lazy languages, but IMHO all of them are either unpredictable (they nondeterministically truncate the reported stack trace in some way) or else have serious memory consumption problems that prevent their use in production code (in which case they're really more of a debugger than a stack trace).

    Carry on the good fight!

    - a

    ReplyDelete
  2. I don't understand how the LOCATIONTRANSPARENT mechanism can assigne blame correctly. For example:

    test1 = bar 0
    test2 = bar 1

    {-# LOCATIONTRANSPARENT bar #-}
    bar 0 = error "invalid zero"
    bar x = head []

    If I understand the proposed extensions correctly, in this example, evaluating test1 leads to an error message that blames test1 for using zero. I think this is correct.

    And evaluating test2 leads to an error message that blames test2 for using an empty list. This is clearly wrong, because test2 doesn't even use an empty list. Instead, the blame should be assigned to the second equation of bar. So I think this is wrong.

    If we drop the LOCATIONTRANSPARENT pragma, the error location for test2 is better, but the error location for test1 is worse.

    How can we assign blame correctly for both test1 and test2?

    ReplyDelete