Brass 2 source syntax - preliminary document Stuff between -> <- are asides. This document is preliminary and is subject to change. ============================================================================================================== Notes - When I talk about char/string, they're .NET's types (so 16-bit UCS-2). Stuff open to discussion - String handling... - When you have multiple characters, how is that handled? "ld" -> "l" * 256 + "d"? If so, how do we differentiate between big/little endian? [Possible overridable property in assembler plugin]? - Do strings return a byte array or a char array? (In other words - how is encoding handled?) - Prefix? UTF8"Something" = UTF-8 string? Looks a bit messy. New stuff: - Merged string tokens into constant tokens. - Difference between "executing" an expression and just taking the result. - Info on what makes a numeric constant or a label constant. - How commands are identified (assembler commands and directive commands). ============================================================================================================== Basic syntax: ============================================================================================================== GENERAL 1. Label, module, or directive names are *not* case sensitive. They will never be, there will never be a switch for this. Case sensitivity only serves to confuse matters, in my opinion. ============================================================================================================== TOKENS At the atomic level, a line of source is made up of tokens. A token is represented by a series of characters that the parser attaches some significance to. There are a variety of different sorts of token: ---- 1. Operator token This is a mathematical operator. Once swallowed by the parser the string representation of the operator is ignored - the token is given a value from an enumeration. The currently supported operators are: {*}, {/}, {+}, {-}, {%}, {|}, {&}, {^}, {>}, {<}, {!}, {~}, {?}, {:}, {>>}, {<<}, {==}, {!=}, {>=}, {<=}, {**}, {&&}, {||}, {=}, {+=}, {-=}, {*=}, {/=}, {%=}, {&=}, {|=}, {^=}, {<<=}, {>>=}, {++}, {--} Most of them should be recognisable from their C counterparts. {?} and {:} are ternary conditional operators -> current note: ternary operators not yet supported <- {**} is a power operator (2**8 returns 256). See the section below on expressions for more information about the mathematics involved. 2. Punctuation token These represent an item of punctuation. Recognised items of punctuation are: {,} (Comma), {[} (OpenBracket), {]} (CloseBracket), {(} (OpenParenthesis), {)} (CloseParenthesis), {\} (LineBreak) 3. Comment token An entire comment is wrapped up into a single token. C-style (/* */) and assembly-style (;) will both be supported. For example, {/* Comment */} or {; Comment} 4. Constant token A constant token tries to represent a value of some kind, such as a numeric constant {164}, a string constant {"A"} or a label {draw_screen}. All constants provide properties to get or set a value (double precision). However, not all subclasses of the constant token will let you get/set a value. Here are the constant token subclasses: --- Numeric constant Stores a number. You get a value from this, but not set one. Some examples: {123}, {$FF}, {.2d}. Numeric constants can have a prefix xor a suffix to denote a base. You may not have both. +------+--------+--------+ | Base | Prefix | Suffix | +------+--------+--------+ | 2 | % | b | | 8 | @ | o | | 10 | | d | | 16 | $ | h | +------+--------+--------+ All suffixes can be in lower or upper case. As base 10 has no prefix, a number without prefix or suffix is assumed to be a decimal constant. Once the base of the number has been established, and the prefix/suffix removed, the rest of the token is examined. Depending on which base it is, the range of valid characters varies: +------+---------------------------------------------+ | Base | Valid Characters | +------+---------------------------------------------+ | 2 | 0 1 | | 8 | 0 1 2 3 4 5 6 7 | | 10 | 0 1 2 3 4 5 6 7 8 9 . | | 16 | 0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F | +------+---------------------------------------------+ Base 10 is the only one that support the use of a radix point. The radix point is a full-stop ('.'), regardless of operating system locale. If there are any invalid characters inside the numeric constant, or there is a problem with the prefix or suffix (for example, $FFh is an invalid number), the token is assumed to NOT be a numeric constant. --- String constant Starts and ends with either " or '. You can get a value from this, but not set one. Examples: {"!"}, {'hello'}. If a string constant is created using "double quotes" instead of 'single quotes', each character is translated using a user-defined character mapping table. Internally, characters are represented using Unicode. You can use \ as an escape character. -> ToDo: insert information about escape characters Also: Expand characters to 16-bit? ({W"Yo"} = four bytes with 'W'?) What value is returned? "A" is easy, "Hello you silly person" would be what? Some people use this, for example ld h,"H" \ ld l,"i" is performed with ld hl,"Hi" <- --- Label constant A label constant is a token that has a name that is a valid label name. It might not actually correspond to an actual label; this cannot be determined until later on. Hence; you can set a value from this token, at which point it is either written to the label it refers to OR a new label is created to accomodate it. However, you can try and get a value - it might fail if the label you are trying to access doesn't exist. An example of a label constant that wouldn't be tied to an actual label is an assembly mnemonic. The label constant can be mutated to its correct type at a later point by an assembler or directive plugin. There is a special label constant, $, which will ALWAYS point to the current instruction pointer. If you put a colon on the END of the name, it means you are referring to the label's internal value. If you put a colon at the BEGINNING of the name, it is assumed that you are referring to the label's page number. If you omit the colon completely it is assumed that you are referring to the value. Hence: $ = 1 ~ sets value of current instruction pointer to 1 $: = 2 ~ sets value of current instruction pointer to 2 :$ = 3 ~ sets page of current instruction pointer to 3 Label constants can be made up of any character with the exception of: - any whitespace. - any operator token. - any numeric constant prefix (%, $, @). - any punctuation token. - full stop (.). - colon (:). Labels may also not start with an arabic numeral (0, 1, 2, 3, 4, 5, 6, 7, 8 or 9). --- Directive constant These do not fit the rules of a numeric or a label constant. They do, however, start with a # or a . and are followed by a string that would be valid as a label. You cannot get or set a numeric value from them. All tokens know their index in the original string and their basic string representation. Operator tokens also know their "parenthesis" index - how many opened parentheses are between them and the start of the source line. For example; . . . . . . . . 3+(2+4/(4-3)+((5*4)/2)-1) 0 1 1 2 1 3 2 1 <- each operator token's parenthesis index. This is used by the expression parser. ============================================================================================================== EXPRESSIONS Some tokens can be grouped together to form entire expressions. An expression of more than one token will be a mathematical expression. An expression with one token might or might not be. A single-token expression could be any token, so they are not especially interesting. If you had the following sequence of tokens, though: ld a, (50 * -5) / $10 ; Load something into the accumulator ...you would end up with these expressions. 1. Constant {ld} 2. Constant {a} 3. Punctuation {Comma} 4. Punctuation {OpenParenthesis} Constant {50} Operator {Multiplication} Operator {UnarySubtraction} Constant {5} Punctuation {CloseParenthesis} Operator {Division} Constant {$10} 5. Comment {; Load something into the accumulator} 4 is the one of interest, being made up of multiple tokens. Expressions can be evaluated. The process for this is rather involved; 1. Cycle through the sequence of tokens. Create a list of all operator tokens, and strip out punctuation. 2. Sort the revised list of operators (OperatorToken : IComparable). The sort takes into consideration these factors, in order: - Are the parenthesis indices different? If so, compare them and return the result. - Are the operators different? If so, compare then and return the result. - Compare the token's index within the original string and return the result. For unary operators, the further right it is the higher precedence it is. For binary operators, the further left it is the higher precedence. -~1 <- the ~ has precedence over the - 1+2-3 <- the + has precedence over the - See table below for operator order of precedence. 3. Go through the sorted list of operators. If it's binary: {4} {+} {5} |___ ___| |______ Grab the outer tokens and perform the operation on them. {4} {9} {5} |______ Replace the operator with the result . {9} . |_______|__ Delete the two constants. If it's unary: {-} {1} |___ Grab the rightmost token and perform the operation on it. . {-1} |__ Replace the token with the result and delete the operator. 4. Return the only remaining token's value. Order of precedence: +-----------------+-----------------------------------+-------------------+ -> | Unary | + - ! ~ ++ -- | Higher precedence | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Power | ** | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Multiplicative | * / % | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Additive | + - | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Shift | << >> | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Relational | < > <= >= | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Equality | == != | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Logical | & | | | | ^ | | | | | | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | | Conditional | && | | | | || | | + - - - - - - - - + - - - - - - - - - - - - - - - - - + | -> | Assignment | = += -= *= /= %= &= |= ^= <<= >>= | Lower precedence | +-----------------+-----------------------------------+-------------------+ Note that both unary and assignment operators are regarded as being right-associative; that is, they are evaluated from right-to-left as opposed to from left-to-right when compared to operators of the same precedence and at the same parenthesis level. Note that ++ and -- are both unary operators. That is to say, ++x works, but x++ doesn't. For a constant token to be able to be evaluated it needs to contain either; - a constant numeric value, or; - a label's name. The parser will try to evaluate it as a constant token before trying to look it up as a label. The assignment operators will try and write back the result of the operation to the argument to the left of the operator. For example (running Brass 2 in interactive mode): > $ = 10 = 10 > $ *= 2 = 20 > $ *= 2 = 40 > $ *= 2 = 80 > $ += $ = 160 > $ = 1.1 = 1.1 > $ *= $ = 1.21 > $ *= $ = 1.4641 > $ *= $ = 2.14358881 > $ *= $ = 4.59497298635722 > 4 = 3 E Cannot assign to constants (Here $ represents the label that is used to store the current instruction pointer). Unlike Brass 1, where '=' was treated as a 'magic' directive (aliased to .equ), the new Brass 2 expression parser handles assignment operators natively. -> Shouldn't be a problem: Brass 1 didn't accept = as an operator at all as far as I can tell, so potential pitfalls with code such as ".if x=1" shouldn't crop up. <- You can do things like this: > $: = 10 = 10 > :$ = 2 = 2 > :$ *= $: = 20 > :$ = 20 > $: = 10 If an assignment operator cannot write back to the argument to the left of it AND it is visible not a numeric constant AND the token name before the operator is a valid name, a new label is created with the result of the assignment. Otherwise, an error is displayed. > x = y E Invalid constant/label name 'y' > y = 5 = 5 > x = y += y = 10 > x = 10 As well as evaluating an expression to find the result, there is also functionality to execute an expression. An executed expression MUST make at least one assignment. When you execute an expression, if there is only one token and it's a label token, it is assigned the latest version of the instruction counter. Evaluated: 1+1 [2] x [error: label 'x' not found] x=3 [3] x [3] y [error: label 'y' not found] Executed: 1+1 [Error: no assignment made] x [value of instruction counter] x=3 [3] x [Error: duplicate label] y [value of instruction counter] Evaluated expressions would be used as arguments to assembly instructions (eg {ld a,48+3}). Executed expressions would be used to create new labels (eg {spawn_enemy:}) ============================================================================================================== COMMANDS A line of source code is made up of commands. There are a number of different types of command. For example: .align 256 \ Function ld a, 10 /* Load 10 into the accumulator */ 1111111111 22222222 33333333 4444444444444444444444444444444444 1: Directive command. Directives are made up of a constant token starting with either a '.' or '#' character. There is NO difference in behaviour; all directives can be invoked using '.' or '#' variants. Picking one or the other would break existing code. Can be used for clarity in code, if you so wish: #if condition1 .if condition2 .else .endif #else .if condition2 .else .endif #endif Brass 1 used . for all directives, but offered # aliases for a few TASM ones for backwards compatibility. I think the best action is just to allow both. Comments would be useful. 2: Expression command. This is just an expression. As seen above, if the expression only consists of a label name then it will be assigned with the current value of the instruction counter. If no assignments were made by the expression in this context, an error is displayed (to stop duplicate label names). 3: Assembler command. The exact internal syntax depends on the currently loaded assembler plugin. 4: Comment command. This just contains a comment. To identify and group commands from expressions, the process is this: 1. Set all comment expressions as comment commands. These are easy. 2. Group all remaining expression groups by the {\} LineBreak punctuation character. 3. Detect whether it's a directive or an assembler command. For directives, is the first token a constant and start with a '.'? If so, look up the directive from the current set of loaded plugins and see if there is a match. For assembler commands, pass them to the assembler plugin and see if it can make head or tail of them. 4. If we still don't know what it is, remove the first expression from the sequence - chances are it's a label. Try and evaluate that, and also try and work out whether what follows it is a directive or assembler command using the methods outlined in 3. For example, we can do this with our above example, and get this: 1. {.align} {256} 2. {Function:} {ld} {a} {,} {10} All I've done so far is split it by the {\} token and removed the comment. Now, we can tell 1 is a directive by seeing that the first token is a constant token and starts with a . A directive's arguments are assumed to start from that token and go on until the end of the command (so finished with an end of line, a {\} token or a comment). We can check if it's a valid directive or not by comparing the name against a hash table containing a list of the loaded directive plugins. As for how the directive is actually run, that's -> to be continued <- To identify an assembler command, it is passed to the assembler plugin. The assembler plugin exposes this method: public override bool TryMatchSource( Parser.ExpressionGroup[] source, out Assembler.Instruction instruction, out int[] evaluatedIndices, out int size) { } First up, we try to pass in entire thing. As you can see, we have a surplus {Function:} at the start of it, and so the assembler plugin will not be able to match it (unless it's programmed to respond to the {Function:} mnemonic). It'll return false. The exact inner workings of the assembler plugin is up to the plugin author. Myself, I check if the first item is a constant. If not, return false. I then have a big switch on the number of ExpressionGroups passed in (eg {CCF} is source.Length == 1, {LD}{A}{,}{10+3} would be source.Length == 4 and so on). Under each of those I switch on the string value of the first token .ToUpper(), and so on and so forth. The code for this is automatically generated by Brass 1. If the assembler thinks something has matched, it'll respond "true" and output a unique identifier (instruction) relating to the matched instruction, which indicies it would like to have evaluated (for {res}{0}{,}{(ix+5)} it'd want indicies 1 and 3) and how big it is (in bytes). -> I think I might end up passing source by reference, so in the case of the above the assembler could remove the IX token and so get the correct value evaluated - sounds sensible? <- In this case, however, it won't match. In this case, the command will be cleft in twain: 2.a. {Function:} 2.b. {ld} {a} {,} {10} (well, the first expression is removed) and we try again. This time we *execute* the first part, and find that that works and creates a new label, Function, with the value of $. We then pass {ld} {a} {,} {10} into the assembler again and this time it matches successfully. This is all, as you can guess, part of pass 1. On the second pass, all that is done is that we rattle through the cached list of commands and re-execute them. In the case of the assembler, we call a different method, public override byte[] AssembleSource( Assembler.Instruction instruction, long[] evaluated) { } ...passing back the instruction ID it so thoughtfully handed us last time and the results of the expressions it asked to be evaluated (so for our {res}{0}{,}{(ix+5)} demo it would be long[] { 0, 5 }). It would then return an array of bytes that would be written to the current page by the compiler. -- still more to come! --