3. Lexical form

3.1 Overview

Although Perfect programs are normally represented using an extended character set in an integrated development environment, they can be represented in the standard ASCII character set or in Unicode. This chapter describes how the character set is used to express the various tokens of the language.

3.2 Character set

The character set used by Perfect comprises the letters A through Z and a through z, the digits 0 through 9 and the following special characters:

  . , ; : ? ! ' " ` + - * / % & | ( ) { } [ ] < > ~ # ^ _ = \ @

[SC] Other printable characters supported by the underlying character set are legal only in comments and in character and string literals. Nonprintable characters other than those characters or character combinations used to represent space, newline and horizontal tab are illegal in Perfect text, except that a nonprintable character or character combination representing end-of-file may be present at the very end of the program if allowed by the underlying file system and character representation.

It is recommended that when printing or displaying Perfect text, tab stops are considered to exist every 4 space-character widths from the left hand margin.

[Note: the character "$" is the only printable 7-bit character in the ASCII set that is not used.]

3.3 Comments

Comments are introduced by two adjacent forward slash characters ("//") and terminate at the end of the line. The last line in a file is always considered as having an end, even if there is no end-of-line marker before the end of the file.

3.4 White space

Comments, and newline, space and tab characters (other than those within comments, and space characters within string and character literals) are collectively known as whitespace. Multiple adjacent whitespace elements are equivalent to a single whitespace.

Whitespace may occur between any two program tokens but not within an identifier, literal, reserved word or multi-character token. Whitespace may, however, appear between two tokens that construct a new operator from an existing one according to the rules of the language. Whitespace must occur between a pair of adjacent tokens if the beginning of the second would otherwise be a legal continuation of the first (e.g. between a reserved word and an identifier).

3.5 Multi-character tokens

The multi-character tokens of the language are:

  <=  >=  <<   >>  <<=  >>=   <==  ==>  <==>  ||   ~~  ^=  :-  ::  ->  <-   <->  ++  --  ** ##  ..  ...

Where an input character sequence can be interpreted in more than one way, the lexical analyser picks the longest leading sub-sequence that forms a token, then applies the same rule to the remainder of the input sequence. For example, "=>>" would be interpreted as "=>"followed by ">", not as "=" followed by ">>", even if the former interpretation gave rise to a parsing or other error message and the latter did not.

3.6 Reserved words

The reserved words of the language are:

abstract    absurd    after    any    anything    as    assert    associative    axiom    bag    begin    bool    build    byte    catch    change    char    class    commutative    confined    const    decrease    deferred    define    done    early    end    enum    exists   external    false    fi    final    float    for    forall    from    function    ghost    goto    has    heap    highest    if    idempotent    identity    implements    import    in    inherits    int    interface    internal    invariant    is    it    keep    let    like    limited    loop    lowest    map    name    nonmember    null    of    on    opaque    operator    out    over    pair    par    pass    post    pragma    pre    proof    property    public    rank    real    redefine    ref    repeated    require    result    satisfy    schema    selector    self    seq    set    storable    super    supports    tag    that    then    those    throw    total    trace    triple    true    try    until    value    var    via    void    when    within    yield

Reserved words are case-sensitive. Note that catch, float, implements, supports, throw, trace and try are not used at present, but are reserved for future use.

The following words are not reserved but are names for built-in global methods and are therefore best avoided:

debugHalt    debugPrint    flatten    interleave    loadObject    max    min    storeObject    swap

Similarly, the following words are names for built-in classes and are also best avoided:

ByteData    ByteStream    CharDecoder    CharEncoder    CharEncoderDecoder    Comparator    DebugType    Environment    FileAttribute    FileError    FileHandle    FileMode     FileModeType    FilePath    FileRef    FileStats    FileStream    GuardedObject    InputStream    nat    OsInfo    OsType    OutputStream    ReverseComparator    SerialError    SerialErrorType    SimpleComparator    Socket    SocketError    SocketMode    StandardInputStream    StandardOutputStream    Storable    StreamBase    StreamHeap    string    Time

3.7 Identifiers

Identifiers comprise a letter or underscore character optionally followed by any number of characters each of which is a letter, digit or underscore character, provided only that the resulting string is not a reserved word. There is no limit on the length of an identifier. All characters in an identifier are significant. The case of letters is significant.

3.8 Character literals

Literals of type char are written as the desired character between opening-single-quote symbols thus:

`a`

The backslash character has a special meaning within character literals in that the backslash character and one or more of the characters following it are replaced by a single character, as follows:

\a alert (bell)
\b backspace
\f form feed
\n line feed
\r carriage return
\t horizontal tab
\v vertical tab
\\   \
\` `
\" "
\(ddd) the character represented by the integer literal ddd

In the case of the form `\(ddd)`, ddd is any integer literal such that the resulting integer is within the range appropriate to the character set in use. There must be no whitespace between the brackets and the integer literal.

The use of any other character following the initial backslash is illegal. The amount of storage associated with each character and the character set supported are implementation dependent (a typical implementation might offer a choice of ASCII or Unicode).

[SC] Exactly one printable character, space character or backslash combination equivalent to one character must appear between the quotes. If the `\(ddd)` form is used then the integer literal ddd must in be the range of the supported character set.

3.9 String literals

Literals of type "sequence of characters" (seq of char) are written as a sequence of characters enclosed in double quotes. The backslash character has the same special meaning as it does in character literals and every backslash sequence gives rise to a single character in the string. The closing quote must be on the same line as the opening quote.

Within a character or string constant, nonprintable characters (including newline and tab characters) are not permitted, and comments are not recognised.

[SC] The sequence between the quotes must comprise only printable characters, space characters and valid backslash sequences.

3.10 Integer literals

An integer literal is written as a sequence of decimal digits, or as a sequence of hexadecimal digits (0-9 and A-F or a-f) preceded by 0X or 0x, or as a sequence of binary digits preceded by 0B or 0b. There is no fixed limit on the size of integer literals, however if the compiler or target uses bounded integers, an error message will be generated in respect of any integer literal that cannot be represented. The case of any letter forming part of an integer literal is not significant.

Underscore characters may be inserted within the sequence of digits (but not at the start or the end) to improve readability.

3.11 Real literals

Real literals are written in the form s.s or ses or s.ses where s is any sequence of decimal digits and e is the letter e or the letter E optionally followed by a minus sign. The digit string following e or E is interpreted as a decimal exponent. White space is not permitted within a real literal. Each digit string s may contain embedded underscore characters to improve readability.

 

Perfect Language Reference Manual, Version 5.0, September 2011.
© 2001-2011 Escher Technologies Limited. All rights reserved.