The KatScript parser¶
The KatScript parser converts KatScript from text files or the user’s shell into Finesse objects like models, components and detectors. It is a large and somewhat complicated subpackage of Finesse. This section outlines its main functions and organisation.
Starting from a text file or string, the parser has to create corresponding
finesse and Python objects.
KatScript is used to define the items for a single model. The commands within the script
can represent elements such as
mirror, detectors such as
pd, functions such as
modes, analyses such as
xaxis, or comments and whitespace. Commands themselves may also contain
expressions, and expressions may contain symbols. The parser has to handle all of these
types of syntax.
Parsing KatScript is more complicated than many other typical small application-specific languages due to the nature of the script and the way in which it is typically used. KatScript is a declarative language, meaning that the order in which commands are defined should not influence the results; yet, because it has to create objects in Python - an imperative language - the parser must figure out the correct order in which to create commands and expressions that might rely on those defined elsewhere. Another issue to handle arises from the requirement to be able to regenerate KatScript from a parsed model in a similar form to the original script that built the model. This need arises from the way that Finesse is typically used when building “base” scripts to represent gravitational wave detectors. The process of constructing such scripts involves optimising particular aspects of the script sequentially, where the outputs from analyses are used to update the script before another aspect is optimised. The final script is then the form that is shared with others who can then use the optimised script to study a particular behaviour of the interferometer at the operating point.
KatScript is a declarative grammar supporting statements and arguments. The following
sections contain a close approximation of the grammar rules in EBNF notation. When run, the
parser starts at the
start rule and recursively reduces tokens
until finished, backtracking if necessary.
The declarations below are not an exact definition of KatScript; the real grammar is more complicated due to the way in which whitespace and errors are handled. It is sufficient however for the purpose of outlining the key rules.
Scripts contain zero or more
script_line productions followed by
start ::= (
ENDMARKERscript_line ::= (
Statements are either element-like (space delimited) or function-like (comma delimited). Both have types but only elements have names.
Element statements are space-delimited and may not span multiple lines. They take the
myelement myname 1 2 3 d=4 e=5.
element_params)? element_params ::=
element_value_list)? element_key_value_list ::=
Function statements are comma-delimited, enclosed in brackets and may span multiple
lines. They take the form
myfunction(1, 2, 3, d=4, e=5).
function_params? ")" function_params ::=
empty* ',')? |
empty* ",")? |
empty* ",")? function_value_list ::=
function_value_list)* function_key_value_list ::=
Parameters can be positional (simply a value) or keyword (a key, an equals sign, and a value).
value!"=" value ::=
Expressions can be unary (e.g.
-1) or binary (e.g.
1+2). Standard operator
precedence is baked in to the
rules by the way in which they are separated and referenced by each other.
In practice, there are additional aspects to these rules to match
empty productions between terms of the expressions if it is in
a context that allows it (e.g. a parenthesised expression or
expr ::= (
expr("+" | "-"))?
expr1expr1 ::= (
expr1("*" | "/" | "//"))?
expr2expr2 ::= ("+" | "-")
expr2)? expr4 ::= "("
Arrays are comma-delimited
arrays contained within brackets (
Numerical arrays are created with type
numpy.ndarray when a standard
array is contained within an expression.
array ::= "["
empty* "]" array_values ::=
empty* ",")? array_value ::=
Emptiness can be represented by whitespace, comments or newlines. Emptiness is sometimes
important (e.g. between parameters of an
element), sometimes not
(e.g. between parameters of a
Tokens are matched using regular expressions, as shown below. The matching is pretty
simple except for
numbers. Numbers may be integer, float,
imaginary or scientific, and may contain SI prefices such as k. Scientific forms
cannot also use SI prefices. Number tokens are not tokenised with signs. Such unary
operations are matched as expressions and eventually converted into functions in a later
stage. Similarly, complex numbers are actually represented by two expressions (the real
and imaginary parts) and a binary operator (
-), built by the parser into
functions. The number matching regular expression is particularly huge; see
finesse.script.tokens.NUMBER for the full form.
ENDMARKER entry below is a meta token which has no
representation in string form. It is used by the tokeniser to indicate the end of
the current file.
NAME ::= [a-zA-Z_][a-zA-Z0-9_.]* REFERENCE ::= &[a-zA-Z_][a-zA-Z0-9_.]* NUMBER ::= (see text) STRING ::= '[^\n'\\]*(?:\\.[^\n'\\]*)*'|"[^\n"\\]*(?:\\.[^\n"\\]*)*" BOOLEAN ::= true|True|false|False NONE ::= none|None WHITESPACE ::= [ \f\t]+ COMMENT ::= #[^\r\n]* NEWLINE ::= \r?\n+ ENDMARKER ::= (see note)
The parsing toolchain¶
While the process of converting some script into machine code is typically called “parsing”, parsing is strictly just one part of the toolchain. A typical “parser” system is actually made up of a tokeniser, a parser and a compiler. The Finesse parser is no different.
Parsing starts with the tokeniser, which converts raw characters into particular token
types such as
Because it must be possible to regenerate KatScript including comments and whitespace from a parsed Finesse model, the parsing process produces a concrete syntax tree rather than an abstract syntax tree as is more typical. The tokeniser therefore does not ignore any token types, and the parser produces productions containing whitespace and comment tokens.
In the documentation below, tokens of a particular type are referred to using uppercase,
NUMBER to distinguish them from parsed productions such as
The available KatScript token types are defined in
using regular expressions. Text may match multiple token types, but the order in which
they are defined in the overall regular expression that is built from the individual
token expressions in
src/finesse/script/tokens.py determines the matching priority.
The parser reads the token stream until exhausted and attempts to reduce them to productions, which are containers representing particular rules in KatScript grammar. The KatScript parser implementation is a so-called packrat parser, using recursive descent to perform unlimited lookahead, in contrast to e.g. LALR(1) parsers that can only look at the next character to decide which production to use to reduce the current tokens. This algorithm uses potentially unlimited memory but is guaranteed to run in linear time due to the use of caching. In practice memory use is limited by the input script size and KatScript grammar, which are both typically small with respect to available memory on modern computers. Packrat parsers also provide other advantages over LALR(1) such as the ability to handle left-recursive rules.
The first parser rule searches for all tokens followed by an
ENDMARKER is a special token
yielded by the tokeniser to indicate the end of the file, and there should only be one,
which indicates the parsing should stop. Within the tokens it sees, the parser then
looks for lines (groups of tokens ending with a
statement productions, which can themselves be either
function productions that accept
one or more arguments. These productions accept the same types of arguments but differ
in the way they require them to be delimited: in the case of
element arguments, the delimiters are spaces (on a single line),
whereas for function arguments the delimiters are commas (and whitespace and newlines
Arguments can be numbers, null, booleans, expressions, references and strings. Expressions can be arbitrarily nested and take more care to parse in a way that obeys operator precedence; multiple rules are therefore required.
Some grammar rules in the parser have equivalent containers, which are objects that
hold the tokens that represent the relevant production. Rules without equivalent
containers usually provide data to a higher order rule that has a container. These
containers are primarily used by the compiler but also for error handling, so they
TokenContainer which defines some useful attributes and
properties for figuring out where in the input the production appears. There are also
ExtraTokenContainer mixins which containers
can inherit to conveniently add support for additional types of data. The attributes
these mixins define are also recognised and supported by the compiler, so
container-specific rules in the compiler are mostly not needed.
Arrays are sequences of numbers, expressions, strings, or other arrays. This data
structure is provided to allow such sequences to be passed to elements and commands.
There can be some special behaviour depending on the location in which the array
appears. Stand-alone array arguments are passed as
lists, but arrays
embedded inside expressions are converted to
ndarray so as to support
element-wise operations such as
2 * [1, 2]. This context-specific behaviour is
employed so that lists of strings can be passed to e.g. plot commands in KatScript
without having these converted to
ndarray, which are most suited to
Note that in order to support the default
list type behaviour for arrays not
used in expressions, the
array rule has to come before
expr in rules that can match both productions (otherwise all
arrays regardless of contents would be converted to
Error handling in the parser is not “free”. Due to the nature of the recursive descent algorithm, if the syntax is incorrect the parser will back-track to the last successful production and try the next rule. If all available alternative rules are exhausted, only then would a syntax error normally be raised. This normally leads to unhelpful error messages, without any indication as to the syntax that actually caused the error, because the parser back-tracked all the way to the first rule. It is therefore also important to build a number of special rules into the grammar that match common error conditions, which can then trigger more useful error messages.
Handled errors usually raise a
KatScriptError, which uses the provided error
item (a token or container) and the original script to figure out where to determine
where in the script the error(s) occurred. Depending on how Finesse is being run,
these error locations may be shown with markers or in a special colour. Here is an
example of an error message:
from finesse.script import parse parse("laser l1 P=1 freq=1M") # "freq" is not a valid argument.
KatDirectiveCompilationError: (use finesse.tb() to see the full traceback) line 1: 'laser' got an unexpected keyword argument 'freq' (expected syntax: laser name P=1 f=0 phase=0) -->1: laser l1 P=1 freq=1M ^^^^
This section frequently shows the results of building KatScript into graphs, so we’ll define a convenience function to let us see them:
import matplotlib.pyplot as plt from finesse.script.compiler import KatCompiler from finesse.plotting import plot_graph def plot_and_show(script, **kwargs): """Compile script and show the syntax graph.""" compiler = KatCompiler() compiler.compile(script, **kwargs) # Plot the graph, colouring nodes by their type. plot_graph( compiler.graph, layout="planar", node_labels=False, # Reduce clutter. node_attrs=True, edge_attrs=True, attr_font_size=12, edge_font_size=8, ax=plt.figure(figsize=(12, 16)).gca(), node_color_key=lambda node, data: data["type"] )
The compiler’s job is to take the parsed productions and turn them into a
elements containing the relevant parameter values
and references. There are three sub-steps to compilation: filling, resolving and
building. These all involve writing to or reading from a graph that gets constructed
to represent the KatScript’s syntax.
The filler matches different types of production yielded from the parsing step and
creates corresponding nodes in the graph. Productions that support arguments have these
arguments recursively filled as well, and edges of type
ARGUMENT are drawn to each
from the owning node.
As mentioned in the introduction, a concrete syntax tree is constructed because
Finesse cannot ignore (and therefore must parse) whitespace, comments and other
meta-tokens so that it can still reconstruct something resembling the original kat
script later. In practice, the tokens stored in the container parsed from a particular
part of the KatScript are stored as node attributes. All containers have a primary
token, which is stored as the field
token. Named containers (elements and functions)
also store a
name_token field, and extra characters such as the
= of a keyword
parameter are stored in
After filling, the graph is strictly a tree because references between
parameters of different elements have not yet been resolved. Everything descends from
ROOT node, which represents the model:
# Filled tree for a model containing a simple reference. plot_and_show( """ l l1 P=(1/2)*&l2.P l l2 P=1 """, resolve=False, build=False, )
The resolver takes the graph and draws edges between dependencies created by references,
i.e. arguments containing tokens of type
REFERENCE. It also performs some sanity checks on the graph, such
as to check for invalid components, duplicate commands, and cycles between references
(i.e. that which would be created by e.g.
m m1 R=&m2.R T=0 and
m m2 R=&m1.R
After resolving, the graph looks something like this:
# Resolved graph for a model containing a simple reference. plot_and_show( """ l l1 P=(1/2)*&l2.P l l2 P=1 """, resolve=True, build=False, )
Note that the &l2.P node is now connected to
l2 by a
DEPENDENCY edge. The
&l2.P node is the first statement (
l l1 P=0.5*&l2.P)’s first argument (
second argument. As a reference, this parameter depends on
l2 existing by the time
it is being compiled, so an edge is drawn to the target to represen this dependency. We
don’t draw an edge to the target parameter node (
P) itself because the target
parameter is created at the same time as its owning element anyway (this however doesn’t
give us a way to handle references between parameters of the same element, but we’ll get
to that in the build step).
The resolver also figures out the correct order in which the elements must be built
given the dependencies created by references. If you look closely at the
edges from the root node, you will see that the
order attribute has changed since
the filling step. This is because
l1 depends on
l2 (via its
argument) even though
l1 was defined earlier in the script.
The builder starts at the root of the resolved graph and recursively follows each
argument in order, building everything into relevant objects. It’s in this step that the
model is created (if one wasn’t provided), each element node is built into a relevant
ModelElement. Arguments are passed to the
Statement-level functions behave similarly, with the arguments being passed to a
function instead of a model element.
The graph is used to determine whether a given expression can be eagerly evaluated.
Eager evaluation is possible for any expression node that represents a closed branch of
the graph, i.e. one without nested references. Such expressions are evaluated at compile
time, and only the resulting value is passed to the Python object or function. Compound
expressions containing references have only the closed sub-branches eagerly evaluated.
As an example, take the graph shown in the previous section. The expression for the
(1/2)*&l2.P, contains two top level clauses:
&l2.P. The latter is a reference and must be left unevaluated at parse time, but the
former can be evaluated immediately. The builder therefore converts this expression to
0.5*&l2.P before it gets passed to
The final step in building is handling self-references, which are parameters that
reference another parameter in the same component. Such references create cycles in the
graph during the resolving step, but these are intentionally ignored at that time. These
references cannot be resolved in a single pass because element parameters are only
created when the element itself is created, via its
__init__ method, and that method
requires initial values for the parameters. Instead, parameters that contain
self-references are built with an initial value set to a
which, if evaluated (e.g. by code running in an element’s
__init__), will throw an
error. Once a component gets built, any parameters with
Resolving symbols then
are fully resolved by calling the relevant build routine once again. This time, because
the target parameters now exist (because the element was since created), it is possible
to make real references to them using Finesse symbols.
The conversion between parsed, resolved arguments and Finesse objects is handled by
KatSpec, which contains a series of
representing the available statements. This way, only the language constructs need to be
hard-coded into the parser stack and the available commands, elements and analyses can
be changed, e.g. to register a custom component.
Continuing with the example used so far, shown below is the result of the build step. Since this is also the last step in the parsing process, this is also what gets returned to the user.
from finesse.script.compiler import KatCompiler compiler = KatCompiler() model = compiler.compile( """ l l1 P=(1/2)*&l2.P l l2 P=1 """ )
Model contains the relevant items:
<finesse.model.Model object at 0x7f3086150d30> 1 optical mode ============== - [0 0] 2 components ============ - Laser('l2', f=0.0, phase=0.0, P=1.0) - Laser('l1', f=0.0, phase=0.0, P=0.5*l2.P) 0 detectors =========== 0 cavities ========== 0 locking loops ===============