KatScript grammar, parser and generator¶
Source code: script
KatScript is an application-specific, declarative language used to build Finesse
models as an alternative interface to the Python API. The KatScript parser converts
KatScript from text files or the user’s shell into Finesse objects like models,
components and detectors. The KatScript generator does the opposite, taking a
Model
and unparsing it back to KatScript.
Collections of tokens within KatScript represent elements such as mirror
,
detectors such as pd
, functions such as modes
, analyses
such as xaxis
, or comments and whitespace. Commands themselves may also
contain expressions, and expressions may contain symbols. The parser has to handle all
of these types of syntax.
Parsing KatScript is more complicated than many other typical small application-specific languages due to the nature of the script and the way in which it is typically used. KatScript is a declarative language, meaning that the order in which commands are defined should not influence the results; yet, because it has to create objects in Python - an imperative language - the parser must figure out the correct order in which to create commands and expressions that might rely on those defined elsewhere. Another issue to handle arises from the requirement to be able to regenerate KatScript from a parsed model in a similar form to the original script that built the model. This need arises from the way that Finesse is typically used when building “base” scripts to represent gravitational wave detectors. The process of constructing such scripts involves optimising particular aspects of the script sequentially, where the outputs from analyses are used to update the script before another aspect is optimised. The final script is then the form that is shared with others who can then use the optimised script to study a particular behaviour of the interferometer at the operating point.
This page describes the formal KatScript grammar, and how the parsing and unparsing toolchains work on a high level.
Note
While the process of converting some script into machine code is typically called
“parsing”, parsing is strictly just one part of the toolchain. As is typical, the
Finesse “parser” system is actually made up of a “tokeniser”, a “parser” and a
“compiler”. Similarly, “unparsing” collectively refers to the process of converting
a Model
to KatScript, containing an “uncompiler” (itself containing an
“unbuilder” and “unfiller”), an “unparser”, and an “untokenizer”.
Grammar¶
Source code: script/parser.py
KatScript is a declarative grammar supporting statements and arguments. The following
sections contain a close approximation of the grammar rules in EBNF notation. When run, the
parser starts at the start
rule and recursively reduces tokens
until finished, backtracking if necessary.
Note
The declarations below are not an exact definition of KatScript; the real grammar is more complicated due to the way in which whitespace and errors are handled. It is sufficient however for the purpose of outlining the key rules.
Script¶
Scripts contain zero or more script_line
productions followed by
an ENDMARKER
token.
start ::= (script_line
+)?ENDMARKER
script_line ::= (statement
)?NEWLINE
Statements¶
Statements are either element-like (space delimited) or function-like (comma delimited). Both have types but only elements have names.
statement ::=element
|function
Elements¶
Element statements are space-delimited and may not span multiple lines. They take the
form myelement myname 1 2 3 d=4 e=5
.
element ::=NAME
empty
+NAME
(empty
+element_params
)? element_params ::=element_value_list
empty
+element_key_value_list
|element_value_list
|element_key_value_list
element_value_list ::=positional_value
(empty
+element_value_list
)? element_key_value_list ::=key_value
(empty
+element_key_value_list
)?
Functions¶
Function statements are comma-delimited, enclosed in brackets and may span multiple
lines. They take the form myfunction(1, 2, 3, d=4, e=5)
.
function ::=NAME
"("function_params
? ")" function_params ::=function_value_list
empty
* ","empty
*function_key_value_list
(empty
* ',')? |function_value_list
(empty
* ",")? |function_key_value_list
(empty
* ",")? function_value_list ::=positional_value
empty
* (","empty
*function_value_list
)* function_key_value_list ::=key_value
empty
* (","empty
*function_key_value_list
)?
Parameters¶
Parameters can be positional (simply a value) or keyword (a key, an equals sign, and a value).
positional_value ::=value
!"=" value ::=expr
|function
|array
|NONE
|BOOLEAN
|STRING
key_value ::=NAME
"="value
Expressions¶
Expressions can be unary (e.g. -1
) or binary (e.g. 1+2
). Standard operator
precedence is baked in to the
rules by the way in which they are separated and referenced by each other.
Note
In practice, there are additional aspects to these rules to match
empty
productions between terms of the expressions if it is in
a context that allows it (e.g. a parenthesised expression or
function
).
expr ::= (expr
("+" | "-"))?expr1
expr1 ::= (expr1
("*" | "/" | "//"))?expr2
expr2 ::= ("+" | "-")expr2
|expr3
expr3 ::=expr4
("**"expr2
)? expr4 ::= "("expr
")" |function
|array
|NUMBER
|NAME
|REFERENCE
Arrays¶
Arrays are comma-delimited lists
of values
or arrays
contained within brackets ([...]
).
Numerical arrays are created with type numpy.ndarray
when a standard
array
is contained within an expression.
array ::= "["empty
*array_values
?empty
* "]" array_values ::=array_value
(empty
* ","empty
*array_values
)* (empty
* ",")? array_value ::=expr
|array
|NONE
|NUMBER
|NAME
|BOOLEAN
|REFERENCE
|STRING
Empty¶
Emptiness can be represented by whitespace, comments or newlines. Emptiness is sometimes
important (e.g. between parameters of an element
), sometimes not
(e.g. between parameters of a function
).
empty ::=WHITESPACE
|COMMENT
|NEWLINE
Tokens¶
Tokens are matched using regular expressions, as shown below. The matching is pretty
simple except for numbers
. Numbers may be integer, float,
imaginary or scientific, and may contain SI prefices such as k. Scientific forms
cannot also use SI prefices. Number tokens are not tokenised with signs. Such unary
operations are matched as expressions and eventually converted into functions in a later
stage. Similarly, complex numbers are actually represented by two expressions (the real
and imaginary parts) and a binary operator (+
or -
), built by the parser into
functions. The number matching regular expression is particularly huge; see
finesse.script.tokens.NUMBER
for the full form.
Note
The ENDMARKER
entry below is a meta token which has no
representation in string form. It is used by the tokeniser to indicate the end of
the current file.
NAME ::= [a-zA-Z_][a-zA-Z0-9_.]* NUMBER ::= (see text) STRING ::= '[^\n'\\]*(?:\\.[^\n'\\]*)*'|"[^\n"\\]*(?:\\.[^\n"\\]*)*" BOOLEAN ::= true|True|false|False NONE ::= none|None WHITESPACE ::= [ \f\t]+ COMMENT ::= #[^\r\n]* NEWLINE ::= \r?\n+ ENDMARKER ::= (see note)
Parsing¶
Parsing starts with the tokeniser, which converts raw characters into particular token
types such as NAME
, STRING
,
NUMBER
, COMMENT
, etc.
Because it must be possible to regenerate KatScript including comments and whitespace from a parsed Finesse model, the parsing process produces a concrete syntax tree rather than an abstract syntax tree as is more typical. The tokeniser therefore does not ignore any token types, and the parser produces productions containing whitespace and comment tokens.
In the documentation below, tokens of a particular type are referred to using uppercase,
such as NUMBER
to distinguish them from parsed productions such as
expr
(expression).
Tokeniser¶
Source code: script/tokenizer.py
The available KatScript token types are defined in src/finesse/script/tokens.py
using regular expressions. Text may match multiple token types, but the order in which
they are defined in the overall regular expression that is built from the individual
token expressions in src/finesse/script/tokens.py
determines the matching priority.
Parser¶
Source code: script/parser.py
The parser reads the token stream until exhausted and attempts to reduce them to productions, which are containers representing particular rules in KatScript grammar. The KatScript parser implementation is a so-called packrat parser, using recursive descent to perform unlimited lookahead, in contrast to e.g. LALR(1) parsers that can only look at the next character to decide which production to use to reduce the current tokens. This algorithm uses potentially unlimited memory but is guaranteed to run in linear time due to the use of caching. In practice memory use is limited by the input script size and KatScript grammar, which are both typically small with respect to available memory on modern computers. Packrat parsers also provide other advantages over LALR(1) such as the ability to handle left-recursive rules.
The first parser rule searches for all tokens followed by an
ENDMARKER
. The ENDMARKER
is a special token
yielded by the tokeniser to indicate the end of the file, and there should only be one,
which indicates the parsing should stop. Within the tokens it sees, the parser then
looks for lines (groups of tokens ending with a NEWLINE
)
containing statement
productions, which can themselves be either
element
or function
productions that accept
one or more arguments. These productions accept the same types of arguments but differ
in the way they require them to be delimited: in the case of
element
arguments, the delimiters are spaces (on a single line),
whereas for function arguments the delimiters are commas (and whitespace and newlines
are optional).
Arguments can be numbers, null, booleans, expressions, references and strings. Expressions can be arbitrarily nested and take more care to parse in a way that obeys operator precedence; multiple rules are therefore required.
Containers
Some grammar rules in the parser have equivalent containers, which are objects that hold the tokens that represent the
relevant production. Rules without equivalent containers usually provide data to a
higher order rule that has a container. These containers are primarily used by the
compiler but also for error handling, so they inherit from TokenContainer
which defines some useful attributes and properties for figuring out where in the input
the production appears. There are also ArgumentContainer
and
ExtraTokenContainer
mixins which containers can inherit to conveniently add
support for additional types of data. The attributes these mixins define are also
recognised and supported by the compiler, so container-specific rules in the compiler
are mostly not needed.
Arrays
Arrays are sequences of numbers, expressions, strings, or other arrays. This data
structure is provided to allow such sequences to be passed to elements and commands.
There can be some special behaviour depending on the location in which the array
appears. Stand-alone array arguments are passed as lists
, but arrays
embedded inside expressions are converted to ndarray
so as to support
element-wise operations such as 2 * [1, 2]
. This context-specific behaviour is
employed so that lists of strings can be passed to e.g. plot commands in KatScript
without having these converted to ndarray
, which are most suited to
containing numbers.
Note that in order to support the default list
type behaviour for arrays not
used in expressions, the array
rule has to come before
expr
in rules that can match both productions (otherwise all
arrays regardless of contents would be converted to ndarray
).
Error handling
Source code: script/exceptions.py
Error handling in the parser is not “free”. Due to the nature of the recursive descent algorithm, if the syntax is incorrect the parser will back-track to the last successful production and try the next rule. If all available alternative rules are exhausted, only then would a syntax error normally be raised. This normally leads to unhelpful error messages, without any indication as to the syntax that actually caused the error, because the parser back-tracked all the way to the first rule. It is therefore also important to build a number of special rules into the grammar that match common error conditions, which can then trigger more useful error messages.
Handled errors usually raise a KatScriptError
, which uses the provided error
item (a token or container) and the original script to figure out where to determine
where in the script the error(s) occurred. Depending on how Finesse is being run,
these error locations may be shown with markers or in a special colour. Here is an
example of an error message:
from finesse.script import parse
parse("laser l1 P=1 freq=1M") # "freq" is not a valid argument.
KatDirectiveBuildError: (use finesse.tb() to see the full traceback)
line 1: 'laser' got an unexpected keyword argument 'freq'
-->1: laser l1 P=1 freq=1M
^^^^
Syntax: l name P=1 f=0 phase=0 signals_only=false
Compiler¶
Source code: script/compiler.py
This section frequently shows the results of building KatScript into graphs, so we’ll define a convenience function to let us see them:
import matplotlib.pyplot as plt
from finesse.script.compiler import KatCompiler
from finesse.plotting import plot_graph
def plot_and_show(script, **kwargs):
"""Compile script and show the syntax graph."""
compiler = KatCompiler()
compiler.compile(script, **kwargs)
# Plot the graph, colouring nodes by their type.
plot_graph(
compiler.graph,
layout="planar",
node_labels=False, # Reduce clutter.
node_attrs=True,
edge_attrs=True,
attr_font_size=12,
edge_font_size=8,
ax=plt.figure(figsize=(12, 16)).gca(),
node_color_key=lambda node, data: data["type"]
)
The compiler’s job is to take the parsed productions and turn them into a model
and elements
containing the relevant parameter values
and references. There are three sub-steps to compilation: filling, resolving and
building. These all involve writing to or reading from a graph that gets constructed
to represent the KatScript’s syntax.
Filling
The filler matches different types of production yielded from the parsing step and
creates corresponding nodes in the graph. Productions that support arguments have these
arguments recursively filled as well, and edges of type ARGUMENT
are drawn to each
from the owning node.
As mentioned in the introduction, a concrete syntax tree is constructed because
Finesse cannot ignore (and therefore must parse) whitespace, comments and other
meta-tokens so that it can still reconstruct something resembling the original kat
script later. In practice, the tokens stored in the container parsed from a particular
part of the KatScript are stored as node attributes. All containers have a primary
token, which is stored as the field token
. Named containers (elements and functions)
also store a name_token
field, and extra characters such as the =
of a keyword
parameter are stored in extra_tokens
.
After filling, the graph is strictly a tree because references between
parameters of different elements have not yet been resolved. Everything descends from
the ROOT
node, which represents the model:
# Filled tree for a model containing a simple reference.
plot_and_show(
"""
l l1 P=(1/2)*l2.P
l l2 P=1
""",
resolve=False,
build=False,
)
Resolving
The resolver takes the graph and draws edges between dependencies created by references,
i.e. arguments containing tokens of type NAME
or
REFERENCE
. It also performs some sanity checks on the graph, such
as to check for invalid components, duplicate commands, and cycles between references
(i.e. that which would be created by e.g. m m1 R=m2.R T=0
and m m2 R=m1.R T=0
).
After resolving, the graph looks something like this:
# Resolved graph for a model containing a simple reference.
plot_and_show(
"""
l l1 P=(1/2)*l2.P
l l2 P=1
""",
resolve=True,
build=False,
)
Note that the l2.P node is now connected to l2
by a DEPENDENCY
edge. The
l2.P
node is the first statement (l l1 P=0.5*l2.P
)’s first argument (P
)’s
second argument. As a reference, this parameter depends on l2
existing by the time
it is being compiled, so an edge is drawn to the target to represen this dependency. We
don’t draw an edge to the target parameter node (P
) itself because the target
parameter is created at the same time as its owning element anyway (this however doesn’t
give us a way to handle references between parameters of the same element, but we’ll get
to that in the build step).
The resolver also figures out the correct order in which the elements must be built
given the dependencies created by references. If you look closely at the ARGUMENT
edges from the root node, you will see that the order
attribute has changed since
the filling step. This is because l1
depends on l2
(via its P=0.5*l2.P
argument) even though l1
was defined earlier in the script.
Building
The builder starts at the root of the resolved graph and recursively follows each
argument in order, building everything into relevant objects. It’s in this step that the
model is created (if one wasn’t provided), each element node is built into a relevant
ModelElement
. Arguments are passed to the ModelElement
constructors.
Statement-level functions behave similarly, with the arguments being passed to a
function instead of a model element.
The graph is used to determine whether a given expression can be eagerly evaluated.
Eager evaluation is possible for any expression node that represents a closed branch of
the graph, i.e. one without nested references. Such expressions are evaluated at compile
time, and only the resulting value is passed to the Python object or function. Compound
expressions containing references have only the closed sub-branches eagerly evaluated.
As an example, take the graph shown in the previous section. The expression for the
P
parameter, (1/2)*l2.P
, contains two top level clauses: (1/2)
and l2.P
.
The latter is a reference and must be left unevaluated at parse time, but the former can
be evaluated immediately. The builder therefore converts this expression to 0.5*l2.P
before it gets passed to Laser.__init__()
.
The final step in building is handling self-references, which are parameters that
reference another parameter in the same component. Such references create cycles in the
graph during the resolving step, but these are intentionally ignored at that time. These
references cannot be resolved in a single pass because element parameters are only
created when the element itself is created, via its __init__
method, and that method
requires initial values for the parameters. Instead, parameters that contain
self-references are built with an initial value set to a Resolving
symbol,
which, if evaluated (e.g. by code running in an element’s __init__
), will throw an
error. Once a component gets built, any parameters with Resolving
symbols then
are fully resolved by calling the relevant build routine once again. This time, because
the target parameters now exist (because the element was since created), it is possible
to make real references to them using Finesse symbols.
The conversion between parsed, resolved arguments and Finesse objects is handled by
KatSpec
, which contains a series of BaseAdapter
instances
representing the available statements. This way, only the language constructs need to be
hard-coded into the parser stack and the available commands, elements and analyses can
be changed, e.g. to register a custom component.
Continuing with the example used so far, shown below is the result of the build step. Since this is also the last step in the parsing process, this is also what gets returned to the user.
from finesse.script.compiler import KatCompiler
compiler = KatCompiler()
model = compiler.compile(
"""
l l1 P=(1/2)*l2.P
l l2 P=1
"""
)
The Model
contains the relevant items:
print(model.info())
<finesse.model.Model object at 0x7f0c181b0040>
1 optical mode
==============
- [0 0]
2 components
============
- Laser('l2', signals_only=False, f=0.0, phase=0.0, P=1.0)
- Laser('l1', signals_only=False, f=0.0, phase=0.0, P=(0.5*l2.P))
0 detectors
===========
0 cavities
==========
0 locking loops
===============
Unparsing¶
Source code: script/generator.py
The job of the unparser is to transform a Finesse Model
and other objects
into KatScript. Currently this is an experimental feature, though it is already quite
powerful - for example, combined with the legacy parser it is capable of converting
Finesse 2 style KatScript to that of Finesse 3.
The unparser can be used on models originally built with KatScript or via the Python API; it does not matter. The original KatScript form is currently not retained during model compilation, so unparsing does not retain the original form: script order may be different, comments are lost and whitespace is normalised. Future improvements to the unparser are planned to allow the original parsed form of a model to be retained as much as practical (see #200).
The unparser is a toolchain containing an uncompiler (itself containing an unbuilder and unfiller), an unparser, and an untokenizer. The majority of the computation takes place in the uncompiler stage, with unparsing and untokenizing relatively trivial operations. These are explained in the sections below.
Uncompiler¶
The uncompiler transforms a Model
or other Finesse object into KatScript by
first building a KatGraph
and then tokenising it. The graph filling is
performed by the unbuilder and the tokenising is performed by the unfiller.
Unbuilder
The graph is built using a functional programming dynamic dispatch paradigm, applying
polymorphic behaviour depending on type (using functools.singledispatch()
). The
dispatcher is called recursively, with the code for each matched type extracting and
dispatching contained objects, adding nodes for each intermediate step, until types that
have direct KatToken
analogues are reached. Edges are drawn connecting related
nodes, such as the name, parentheses, delimiters and parameters of functions
.
For example, a model gets built by passing it to the dispatcher. The dispatcher matches
and runs the code for Model
, which extracts its elements, commands and
analysis, further calling the dispatcher on each. In the case of the model’s elements,
the dispatcher matches and runs the code for ModelElement
, which extracts,
among other things, the model parameters
and once more passes
these to the dispatcher. Parameters containing terminal types such as numbers are again
dispatched, with the dispatcher this time matching code that creates a
KatNumberToken
. The graph filled by this sequence is a tree (unlike the graph
built by the parser, where references create edges between branches), with each branch
(representing elements, commands and analyses and their constituents therein) ultimately
terminating with nodes corresponding to tokens
.
Unfiller
The graph created by the unbuilder is converted into a series of production
objects by using the type attribute defined for each node to
determine which production constructor to call. Connected nodes are themselves
recursively converted to appropriate productions.
The productions are added to a KatScript
object. This is then passed to the
unparser.
Unparser¶
The unparser receives a KatScript
object, loops through its productions,
extracts its tokens in order and calls the untokenizer.
Untokenizer¶
The untokenizer takes KatToken
objects as input and converts them to their
corresponding KatScript string forms. In practice, the token’s raw value is simply
extracted and returned.