How to efficiently find differences on BIG excel worksheet

Photo by Mika Baumeister on Unsplash

Many times i found myself in front of big datasets stored in xlsx format. To be honest, open format is not designed to store lots of data, old binary file xls (Excel Binary Format) was more performing, from this point of view.

Anyway, handling these kinds of files is a very tedious process, because a lot of the time is spent to open, compare, manage changes, and in this use case, to find differences between 2 updated versions of the same data source.

A friend of mine, weekly downloads some new meteorological data. These data are produced in some manner and stored on a single xlsx file, which has grown a lot in few months of work. Each time there are hundreds of new rows and even, some updates in previous data. Using Microsoft Excel is not so easy to seek and find these differences, so an out of box tools can help to save time and money.

XLtoy is a python library that simplifies many operations on xlsx files. Shortly it’s focalized on maintenance phases of a model like change management, topology analysis, data lineage, model auditability, and so on, what we can call model-ops phase. There are a couple of interesting ideas that allow us to interpret an Excel workbook with the aim of efficiently extracting data and formulas.

When we use excel, most of the cells are occupied by the development of our analyzes, so only a subset of cells contains the outcome. XLtoy is focalized only on this subset. With a small intervention on the sheet itself, it is therefore possible to define which cells to focus the parser on, and rewrite them in a more efficient format. This crucial phase is required in order to define data working areas. In few words using the name manager we associate each range with a logic name, and with this, we help the parser to understand what to do with the contained cells.

For a deep dive please refer to the online documentation.

For this exercise, we handle only data (formulas and other stuff will be discussed in the next articles) and we use the ability to “collect” and store them in a more efficient format; with this to automate operations like diff between two similar workbooks.

I cannot use original files, so the exercise was done using a data source contains >3M cells from https://data.world a free data sources provider.

All operation was done using command line, in terms of automation processes this is a big improvement.

  • Make some changes, all kinds of changes, inside the data working area, like add and remove cells or entire rows are allowed. Save this new version too as Air_Quality2.xlsx
  • Use xltoy diff to find differences.
    $> xltoy diff Air_Quality.json Air_Quality2.xlsx --timeit -vvv

For this exercise, i’ve changed only 2 cells and the outcome was:

change:
Air_Quality:
H146162: Wood -> Iron
N218636: 0 -> 1

It means, that difference of kind change was found in sheet Air_Quality. Two cells are involved: H146162 and N218636 the values are also reported: Wood was changed in Iron ad a 0 become a 1.

In this simple example, starting from an xlsx file, we have touched some interesting features of the XLtoy package like, limit parsing to a working area in order to speed up operations, the capability to take a snapshot and store it in a better format, JSON in this case, and finally, the diff command to show the differences. Diff command show output in YAML format by default, other options are available in future versions. With XLtoy we have done the first step versus an automated and cross-platform solution.

About performance, it took:

  • about 7 secs. to open the file in Excel
  • 35 sec. to do the snapshot (collect + store) in JSON format
  • 1.5 sec. to load JSON file
  • diff costs less than 1sec.

Pragmatic programmer | Data Engineer | Python architect