Consistency Testing for Legacy Applications

Let’s say you have inherited a vast and messy legacy project. You are asked to upgrade some core libraries to the latest versions and implement a handful of new features. You diligently go through what documentation exists and interview the product owner and users. The application still works, but the original developers have all gone on to greener pastures. After looking at the source it becomes obvious that a project that started with clear development patterns and principles has devolved into an entropic, convoluted mess. The unit test coverage for this application is sparse and the existing unit tests are either dubious or broken. The product owner certifies the product by manually testing it. This is a problem. How can you repeatedly demonstrate that your non-functional changes will not break the product? Many things can be missed with manual testing, and it is a time-consuming process.


It’s very hard to prove that a system still works correctly by writing tests against what you know the system will do (positive tests) or what you can predict a system will do given some unorthodox input (negative tests). Only very high code coverage will instill confidence; it’s unreasonable to create this unit test suite before starting on the actual work at hand. You may not have the time to learn all the ins and outs of the code. When adding positive tests you will very quickly veer into the Rumsfeldian “unknown unknowns”: you can only test for the things you know or can reasonably expect. If the input into your system is large and varied, automated tests asserting system behavior are likely to miss edge cases.


However, what you really need to demonstrate is not that the application behaves correctly. You need to demonstrate that the application behaves consistently after you have applied your changes. Really, you need to show that the application is functionally no worse off now than it was before. Behavior, whether correct or incorrect, should not be altered unintentionally after you make your changes. If there were bugs in the system before, it’s reasonable to expect that the bugs are still there after the system modification. You may never fully understand the system’s complicated use cases and business rules. When testing for consistent behavior, you don’t really need to.


You need to decide what represents application state, the transformation action, and the measurable output. Consider a feed ingestion system. The state is the database, the transformation is the act of loading a feed file, and the measurable output can again be the database, transformed according to the ingestion rules. To do a consistency test, perform the following steps:

  1. Start with a known baseline database. Save it off as a snapshot or an SQL dump. Let’s call it D.

  2. Run a known feed file against the unmodified legacy ingestion code.

  3. Save off the modified database state. Lets call it D-Legacy.

  4. Now restore the database to the original state D.

  5. Swap in your modified, upgraded application code.

  6. Run the same feed file again.

  7. Save off the database state to D-New.

  8. Compare D-Legacy to D-New. If your changes are consistent, then D-Legacy = D-New.


Although there exist a number of tools around doing database diffs (for instance the Percona Toolkit), it’s always possible to fall back on the tried and 1 shell scripting. For instance:

$ echo “show tables;” | mysql -N -uuser -p D_legacy | awk ‘{print “select * from “$0”;”}’ | mysql -uuser -p D_legacy > /tmp/D_legacy.txt
$ echo “show tables;” | mysql -N -uuser -p D_new | awk ‘{print “select * from “$0”;”}’ | mysql -uuser -p D_new > /tmp/D_new.txt
$ #Look at the diff file $ diff /tmp/D_legacy.txt /tmp/D_new.txt
$ #Look at all records that are different, assuming you expect the row sort order will be the same
$ comm -3 /tmp/D_legacy.txt /tmp/D_new.txt

Naturally the code above would need to be refined. We can be more targeted about what we diff. It’s more fair to say that D-Legacy should be kind-of-equal to D-New. For instance, it makes sense to ignore timestamp fields and dataless keys (especially if this is a multi-threaded feed ingestion system). The `comm` tool works when the two input files are comparably sorted:

$ echo “select field1, field2, field3, …, fieldX from tableZ;” | mysql -uuser -p D_legacy | sort > /tmp/D_legacy_tableZ.txt
$ echo “select field1, field2, field3, …, fieldX from tableZ;” | mysql -uuser -p D_new | sort > /tmp/D_new_tableZ.txt
$ comm -3 /tmp/D_legacy_tableZ.txt /tmp/D_new_tableZ.txt

The same approach can be extended to other types of systems. A web application that generates a large report after some user input can just as easily be tested for consistency. The user interactions can be scripted via Selenium, and the final report, be it an HTML document, a CSV, or PDF, can likewise be compared. The tools can vary, but the approach stays the same. Even if this testing is not automated, and the expected differences are not programmatically accounted for, a single run can go a long way to show how much impact your changes have on the functional behavior of the system.


A final note: consistency testing is best left to legacy systems. It’s useful for building a certain level of confidence in the maintenance work you do. This should not be used early in the product development cycle. When starting up a new project, it’s best to stick to positive and negative assertion tests: unit, integration, QA, and UAT. A new project is by its very nature inconsistent.