Brass 2 source syntax - preliminary document

Stuff between -> <- are asides.

This document is preliminary and is subject to change.

==============================================================================================================

Notes
    - When I talk about char/string, they're .NET's types (so 16-bit UCS-2).

Stuff open to discussion
    - String handling...
        - When you have multiple characters, how is that handled? "ld" -> "l" * 256 + "d"? If so, how do we
          differentiate between big/little endian? [Possible overridable property in assembler plugin]?
        - Do strings return a byte array or a char array? (In other words - how is encoding handled?)
            - Prefix? UTF8"Something" = UTF-8 string? Looks a bit messy.

New stuff:
    - Merged string tokens into constant tokens.
    - Difference between "executing" an expression and just taking the result.
    - Info on what makes a numeric constant or a label constant.
    - How commands are identified (assembler commands and directive commands).

==============================================================================================================


Basic syntax:

==============================================================================================================

GENERAL

1. Label, module, or directive names are *not* case sensitive. They will never be, there will never be a
switch for this. Case sensitivity only serves to confuse matters, in my opinion.

==============================================================================================================

TOKENS

At the atomic level, a line of source is made up of tokens.
A token is represented by a series of characters that the parser attaches some significance to.
There are a variety of different sorts of token:

----

1. Operator token
This is a mathematical operator. Once swallowed by the parser the string representation of the operator is
ignored - the token is given a value from an enumeration.
The currently supported operators are:
{*}, {/}, {+}, {-}, {%}, {|}, {&}, {^}, {>}, {<}, {!}, {~}, {?}, {:}, {>>}, {<<}, {==}, {!=}, {>=}, {<=},
{**}, {&&}, {||}, {=}, {+=}, {-=}, {*=}, {/=}, {%=}, {&=}, {|=}, {^=}, {<<=}, {>>=}, {++}, {--}
Most of them should be recognisable from their C counterparts.

{?} and {:} are ternary conditional operators -> current note: ternary operators not yet supported <-
{**} is a power operator (2**8 returns 256).
See the section below on expressions for more information about the mathematics involved.

2. Punctuation token
These represent an item of punctuation. Recognised items of punctuation are:
{,} (Comma), {[} (OpenBracket), {]} (CloseBracket), {(} (OpenParenthesis),
{)} (CloseParenthesis), {\} (LineBreak)

3. Comment token
An entire comment is wrapped up into a single token. C-style (/* */) and assembly-style (;) will both be
supported. For example, {/* Comment */} or {; Comment}

4. Constant token
A constant token tries to represent a value of some kind, such as a numeric constant {164}, a string
constant {"A"} or a label {draw_screen}.
All constants provide properties to get or set a value (double precision). However, not all subclasses of
the constant token will let you get/set a value. Here are the constant token subclasses:

--- Numeric constant
    Stores a number. You get a value from this, but not set one. Some examples: {123}, {$FF}, {.2d}.
    Numeric constants can have a prefix xor a suffix to denote a base. You may not have both.

                                        +------+--------+--------+
                                        | Base | Prefix | Suffix |
                                        +------+--------+--------+
                                        |    2 |   %    |   b    |
                                        |    8 |   @    |   o    |
                                        |   10 |        |   d    |
                                        |   16 |   $    |   h    |
                                        +------+--------+--------+

    All suffixes can be in lower or upper case. As base 10 has no prefix, a number without prefix or suffix 
    is assumed to be a decimal constant.
    Once the base of the number has been established, and the prefix/suffix removed, the rest of the token
    is examined. Depending on which base it is, the range of valid characters varies:

      +------+---------------------------------------------+
      | Base | Valid Characters                            |
      +------+---------------------------------------------+
      |    2 | 0 1                                         |
      |    8 | 0 1 2 3 4 5 6 7                             |
      |   10 | 0 1 2 3 4 5 6 7 8 9 .                       |
      |   16 | 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F |
      +------+---------------------------------------------+

    Base 10 is the only one that support the use of a radix point. The radix point is a full-stop ('.'),
    regardless of operating system locale.

    If there are any invalid characters inside the numeric constant, or there is a problem with the prefix
    or suffix (for example, $FFh is an invalid number), the token is assumed to NOT be a numeric constant.

--- String constant
    Starts and ends with either " or '. You can get a value from this, but not set one. Examples: {"!"},
    {'hello'}.
    If a string constant is created using "double quotes" instead of 'single quotes', each character is
    translated using a user-defined character mapping table.
    Internally, characters are represented using Unicode.
    You can use \ as an escape character.

    ->  ToDo: insert information about escape characters 
        Also: Expand characters to 16-bit? ({W"Yo"} = four bytes with 'W'?)
              What value is returned? "A" is easy, "Hello you silly person" would be what? Some people use
              this, for example ld h,"H" \ ld l,"i" is performed with ld hl,"Hi" <-
              
--- Label constant
    A label constant is a token that has a name that is a valid label name. It might not actually correspond
    to an actual label; this cannot be determined until later on.
    Hence; you can set a value from this token, at which point it is either written to the label it refers to
    OR a new label is created to accomodate it.
    However, you can try and get a value - it might fail if the label you are trying to access doesn't exist.
    An example of a label constant that wouldn't be tied to an actual label is an assembly mnemonic.
    The label constant can be mutated to its correct type at a later point by an assembler or  directive
    plugin.
    
    There is a special label constant, $, which will ALWAYS point to the current instruction pointer.
    
    If you put a colon on the END of the name, it means you are referring
    to the label's internal value. If you put a colon at the BEGINNING of the name, it is assumed that you are
    referring to the label's page number.
    If you omit the colon completely it is assumed that you are referring to the value.

    Hence:
    $  = 1 ~ sets value of current instruction pointer to 1
    $: = 2 ~ sets value of current instruction pointer to 2
    :$ = 3 ~ sets page of current instruction pointer to 3
    
    Label constants can be made up of any character with the exception of:
        - any whitespace.
        - any operator token.
        - any numeric constant prefix (%, $, @).
        - any punctuation token.
        - full stop (.).
        - colon (:).
    
    Labels may also not start with an arabic numeral (0, 1, 2, 3, 4, 5, 6, 7, 8 or 9).

--- Directive constant
    These do not fit the rules of a numeric or a label constant.
    They do, however, start with a # or a . and are followed by a string that would be valid as a label.
    You cannot get or set a numeric value from them.
    

All tokens know their index in the original string and their basic string representation.

Operator tokens also know their "parenthesis" index - how many opened parentheses are between them and the
start of the source line. For example;
 
 .  . .  .  .   .  .  .
3+(2+4/(4-3)+((5*4)/2)-1)
 0  1 1  2  1   3  2  1  <- each operator token's parenthesis index.

This is used by the expression parser.

==============================================================================================================

EXPRESSIONS

Some tokens can be grouped together to form entire expressions. An expression of more than one token will be
a mathematical expression. An expression with one token might or might not be.

A single-token expression could be any token, so they are not especially interesting. If you had the following
sequence of tokens, though:

ld a, (50 * -5) / $10 ; Load something into the accumulator

...you would end up with these expressions.

1. Constant    {ld}
2. Constant    {a}
3. Punctuation {Comma}
4. Punctuation {OpenParenthesis}
   Constant    {50}
   Operator    {Multiplication}
   Operator    {UnarySubtraction}
   Constant    {5}
   Punctuation {CloseParenthesis}
   Operator    {Division}
   Constant    {$10}
5. Comment     {; Load something into the accumulator}

4 is the one of interest, being made up of multiple tokens.
Expressions can be evaluated. The process for this is rather involved;

1. Cycle through the sequence of tokens. Create a list of all operator tokens, and strip out punctuation.

2. Sort the revised list of operators (OperatorToken : IComparable). The sort takes into consideration these
   factors, in order:
   - Are the parenthesis indices different? If so, compare them and return the result.
   - Are the operators different? If so, compare then and return the result.
   - Compare the token's index within the original string and return the result. For unary operators, the
     further right it is the higher precedence it is. For binary operators, the further left it is the
     higher precedence.
     
     -~1   <- the ~ has precedence over the -
     1+2-3 <- the + has precedence over the -
     
     See table below for operator order of precedence.
        
3. Go through the sorted list of operators. If it's binary:

       {4} {+} {5}
        |___ ___|
            |______ Grab the outer tokens and perform the operation on them.
            
       {4} {9} {5}
            |______ Replace the operator with the result
            
        .  {9}  .
        |_______|__ Delete the two constants.
		
	If it's unary:
	
        {-} {1}
             |___ Grab the rightmost token and perform the operation on it.
		     
         .  {-1}
              |__ Replace the token with the result and delete the operator.

4. Return the only remaining token's value.

Order of precedence:

                +-----------------+-----------------------------------+-------------------+
             -> | Unary           | + - ! ~ ++ --                     | Higher precedence |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Power           | **                                |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Multiplicative  | * / %                             |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Additive        | + -                               |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Shift           | << >>                             |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Relational      | < > <= >=                         |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Equality        | == !=                             |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Logical         | &                                 |                   |
                |                 | ^                                 |                   |
                |                 | |                                 |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
                | Conditional     | &&                                |                   |
                |                 | ||                                |                   |
                + - - - - - - - - + - - - - - - - - - - - - - - - - - +                   |
             -> | Assignment      | = += -= *= /= %= &= |= ^= <<= >>= | Lower precedence  |
                +-----------------+-----------------------------------+-------------------+
                    
Note that both unary and assignment operators are regarded as being right-associative; that is, they are
evaluated from right-to-left as opposed to from left-to-right when compared to operators of the same
precedence and at the same parenthesis level.

Note that ++ and -- are both unary operators. That is to say, ++x works, but x++ doesn't.

For a constant token to be able to be evaluated it needs to contain either;
    - a constant numeric value, or;
    - a label's name.

The parser will try to evaluate it as a constant token before trying to look it up as a label.

The assignment operators will try and write back the result of the operation to the argument to the left of
the operator. For example (running Brass 2 in interactive mode):

    > $ = 10
    = 10
    > $ *= 2
    = 20
    > $ *= 2
    = 40
    > $ *= 2
    = 80
    > $ += $
    = 160
    > $ = 1.1
    = 1.1
    > $ *= $
    = 1.21
    > $ *= $
    = 1.4641
    > $ *= $
    = 2.14358881
    > $ *= $
    = 4.59497298635722
    > 4 = 3
    E Cannot assign to constants
    
(Here $ represents the label that is used to store the current instruction pointer).
 
Unlike Brass 1, where '=' was treated as a 'magic' directive (aliased to .equ), the new Brass 2 expression
parser handles assignment operators natively.
 
 -> Shouldn't be a problem: Brass 1 didn't accept = as an operator at all as far as I can tell, so potential
    pitfalls with code such as ".if x=1" shouldn't crop up. <-

You can do things like this:

    > $: = 10
    = 10
    > :$ = 2
    = 2
    > :$ *= $:
    = 20
    > :$
    = 20
    > $:
    = 10
    
If an assignment operator cannot write back to the argument to the left of it AND it is visible not a numeric
constant AND the token name before the operator is a valid name, a new label is created with the result of
the assignment. Otherwise, an error is displayed.
 
    > x = y
    E Invalid constant/label name 'y'
    > y = 5
    = 5
    > x = y += y
    = 10
    > x
    = 10
    
As well as evaluating an expression to find the result, there is also functionality to execute an expression.
An executed expression MUST make at least one assignment.
When you execute an expression, if there is only one token and it's a label token, it is assigned the latest
version of the instruction counter.

Evaluated:

1+1 [2]
x   [error: label 'x' not found]
x=3 [3]
x   [3]
y   [error: label 'y' not found]

Executed:

1+1 [Error: no assignment made]
x   [value of instruction counter]
x=3 [3]
x   [Error: duplicate label]
y   [value of instruction counter]

Evaluated expressions would be used as arguments to assembly instructions (eg {ld a,48+3}).
Executed expressions would be used to create new labels (eg {spawn_enemy:})

==============================================================================================================

COMMANDS

A line of source code is made up of commands. There are a number of different types of command. For example:

.align 256 \ Function ld a, 10 /* Load 10 into the accumulator */
1111111111   22222222 33333333 4444444444444444444444444444444444

1: Directive command.
Directives are made up of a constant token starting with either a '.' or '#' character. There is NO difference
in behaviour; all directives can be invoked using '.' or '#' variants. Picking one or the other would break
existing code. Can be used for clarity in code, if you so wish:

#if condition1
.if condition2
.else
.endif
#else
.if condition2
.else
.endif
#endif

Brass 1 used . for all directives, but offered # aliases for a few TASM ones for backwards compatibility. I
think the best action is just to allow both. Comments would be useful.

2: Expression command.
This is just an expression. As seen above, if the expression only consists of a label name then it will be
assigned with the current value of the instruction counter.
If no assignments were made by the expression in this context, an error is displayed (to stop duplicate
label names).

3: Assembler command.
The exact internal syntax depends on the currently loaded assembler plugin.

4: Comment command.
This just contains a comment.

To identify and group commands from expressions, the process is this:

1. Set all comment expressions as comment commands. These are easy.
2. Group all remaining expression groups by the {\} LineBreak punctuation character.
3. Detect whether it's a directive or an assembler command. For directives, is the first token a constant and 
   start with a '.'? If so, look up the directive from the current set of loaded plugins and see if there is a
   match. For assembler commands, pass them to the assembler plugin and see if it can make head or tail of 
   them.
4. If we still don't know what it is, remove the first expression from the sequence - chances are it's a 
   label. Try and evaluate that, and also try and work out whether what follows it is a directive or assembler 
   command using the methods outlined in 3.

For example, we can do this with our above example, and get this:

1. {.align} {256}
2. {Function:} {ld} {a} {,} {10}

All I've done so far is split it by the {\} token and removed the comment.

Now, we can tell 1 is a directive by seeing that the first token is a constant token and starts with a .
A directive's arguments are assumed to start from that token and go on until the end of the command (so
finished with an end of line, a {\} token or a comment).

We can check if it's a valid directive or not by comparing the name against a hash table containing a list of 
the loaded directive plugins. As for how the directive is actually run, that's -> to be continued <-

To identify an assembler command, it is passed to the assembler plugin. The assembler plugin exposes this 
method:

    public override bool TryMatchSource(
        Parser.ExpressionGroup[] source,
        out Assembler.Instruction instruction,
        out int[] evaluatedIndices, 
        out int size) {
        
    }

First up, we try to pass in entire thing. As you can see, we have a surplus {Function:} at the start of it,
and so the assembler plugin will not be able to match it (unless it's programmed to respond to the {Function:} 
mnemonic). It'll return false. The exact inner workings of the assembler plugin is up to the plugin author.

Myself, I check if the first item is a constant. If not, return false. I then have a big switch on the number
of ExpressionGroups passed in (eg {CCF} is source.Length == 1, {LD}{A}{,}{10+3} would be source.Length == 4 
and so on). Under each of those I switch on the string value of the first token .ToUpper(), and so on and so
forth. The code for this is automatically generated by Brass 1.

If the assembler thinks something has matched, it'll respond "true" and output a unique identifier 
(instruction) relating to the matched instruction, which indicies it would like to have evaluated (for
{res}{0}{,}{(ix+5)} it'd want indicies 1 and 3) and how big it is (in bytes).

-> I think I might end up passing source by reference, so in the case of the above the assembler could remove
the IX token and so get the correct value evaluated - sounds sensible? <-

In this case, however, it won't match. In this case, the command will be cleft in twain:

2.a. {Function:}
2.b. {ld} {a} {,} {10}

(well, the first expression is removed) and we try again. This time we *execute* the first part, and find that
that works and creates a new label, Function, with the value of $. We then pass {ld} {a} {,} {10} into the
assembler again and this time it matches successfully.

This is all, as you can guess, part of pass 1. On the second pass, all that is done is that we rattle through
the cached list of commands and re-execute them. In the case of the assembler, we call a different method,

    public override byte[] AssembleSource(
        Assembler.Instruction instruction,
        long[] evaluated) {

    }

...passing back the instruction ID it so thoughtfully handed us last time and the results of the expressions
it asked to be evaluated (so for our {res}{0}{,}{(ix+5)} demo it would be long[] { 0, 5 }). It would then
return an array of bytes that would be written to the current page by the compiler.

-- still more to come! --