A deep dive into Excel syntax to decipher the semantics of formulas

How to use XLtoy to parse and extract formula semantics from an excel workbook, and analyze its topology with better instruments.

6 min readMar 31, 2021

Every spreadsheet has its own syntax, and excel is not far behind. So implementing a model in a spreadsheet must consider two main aspects: syntax and topology. My purpose is to break down these, and of course others, aspects to get a complete picture, and implement it in XLtoy package. Today we analyze how graph theory can be effectively applied in an excel workbook to solve problems that cannot be solved with other approaches.

Spreadsheet syntax is heavily influenced from many aspects: the context of use, how symbols are identified, how refers to the layout and order of the function and its arguments. Probably, the main aspect, is what is needed to identify the cells that contain the necessary values. This point is crucial and it gives a model written in Excel (aka xmodel) a taste that sounds like a positional syntax.

The ‘position’ aspect, or the topology, of a not-spreadsheet model, is a little bit different, because, usually input data are expressed by a label, we want an easy way to manage how data, symbols, and functions are identified, so our need must be fulfilled in the syntax. This aspect is amplified using time series data and in the forecasting models, because we must identify inputs and a concept of time, is very common to find notations like :

We can call it relative syntax, following the reading, we will discover how the semantics of the model can help us in the management of complex cases.

Let’s take a look at these concepts in practice:

Working on a formula (in red), it needs 2 inputs, but without knowing, if the model evolves vertically or horizontally, we cannot understand from the syntax if the cell in blue indicates the previous time, and the one in green the simultaneous time or vice versa. Now let’s go a step further and analyze how relationships in these kinds of models can be managed. We will introduce the idea of the topology of a model.

The Excel graph

When you change the value of a cell, Excel automagically updates all descendant cells… This is awesome, but how it works under the hood?

To consistently compute formulas, it must resolve a topological problem, you can see it like a graph in which nodes are the cells of the spreadsheet and the edges are the relationship between symbols used by formulas. Using this data structure, excel knows who depends on who, and with this information, it can solve all formulas in a precise order called topological order.

Excel offers a tool to trace predecessors and successors, but, this can track only a node at a time. In the left image, it was activated on only 3 formulas. It is clear that with increasing complexity it soon becomes useless, so we would not count on it.

The XLtoy graph

As explained in the last article, XLTtoy couple each variable to his label so keeping again our simple example it can be better explained in a form in which topology is represented by the syntax. The formula I11=I10*H11 can be better represented by:

ENDO[T]=ENDO[T-1]*EXO[T]

T identifies the time step, in this case, a row of the sheet. This idea is crucial to transform a positional syntax into a relative syntax, which is much more effective in describing the semantic of formulas in the model. This is very powerful, In fact, we can summarize all the lines in the model, in a single cycle in which each formula is expressed in its relative form. So, in this example, the ENDO variable can be simplyfied like the following:

for T in range(to_simulate):
   ENDO[T]=ENDO[T-1]*EXO[T]

When the parser is activated, XLtoy internally fills a graph using this information and uses it to solve all topological problems. This is the internal representation of this xmodel.

The showcase

For this exercise, I’ve used a simple real estate model (a special thank’s to Andrea Monaco for his help in providing me with the idea and the model implementation) for forecasting the price of a property over time. Some quantitative parameters establish the starting points, the model projects its value in the future in a quarterly calendar. This xmodel is available in the package for download.

All described functions of XLtoy are, available via the command line API, I avoid the description of the command because it is already in the previous one. We need only an option to obtain the entire topology in a graph format, but for clarity, we add another option to obtain the relative syntax in the standard output:

$ xltoy parse re_model.xlsx --yaml --gml_graph re_model.gmlFoglio1:
  change_rate: =(index[T]-maintenance_cost[T])
  index: 0.01
  maintenance_cost: =0.1318181818181818/10*time_Y[T]
  price: =price[T-1]*(1+change_rate[T])
  time_Y: 0.25

yaml that shows the parsed model, this is optional but useful for showing the output.
gml_graph <fname> that produces a file named re_model.gml in standard GML format

Now the graph is available for a deep analysis of the relationship, for a first touch in with the GML file we can start a jupyter notebook and add few lines of code:

Adding some formatting options we can reach all information contained in the graph as shown below. This graph has an enormous informative charge because it shows a mnemonic label for each formula, in yellow, and each edges the dependency time, e.g. price depends on itself on the previous time and the change_rate at the same time.

Conclusion

XLtoy can parse an excel model and revamp it in a more readable syntax and make it available via a graph, so we can adopt the most suitable tools, to do a fine-grained introspection, and solve problems related to the interrelation between variables like:

Find all parents of one or more formulas, the predecessors.
Find all affected variables, when a single value of a cell is changed, the successors.
Find a topological resolution of the model, based on the interrelations.
Find all useless formulas, this happens when we must do a reduction of the outcome in complex models.

When the models contain thousands of variables a graph approach saves a lot of time and allows you to keep complex models under control.

Another nonmarginal aspect is that this approach is totally generalizable, and allows us to rewrite the entire model in an arbitrary language. I mean, if we know how to reach data, how to handle functions and relations between formulas, we are able to rewrite all logic using a different syntax.
I’m working on a version able to write python grammar capable of being executed from a python interpreter. I’m searching for help to write the js version too.

Stay tuned!