Anevis Solutions’ internally developed PDF document comparison software provides our customers with an additional quality check. How the comparison software works will be explained here in detail.
There are several steps for the software to recognize optical or text differences as well as the magnitude of those differences.
Is there an entire optical equality?
To check this, it is necessary that the documents are transformed pagewise into a pixel based form. The resolution must at least correspond to the resolution of the human eye. This is due to the fact that a PDF is a page-and vector-based file format. After the document is rasterized it is possible to compare two pages pixel by pixel. The duration of this process depends on the chosen resolution. For resolutions that are as high as described above this is feasible in an acceptable time, which is around 1.5 seconds per page on a low end computer.
This check has proved to be the most effective part of the automated comparison process because if this check is true the documents are equal and no more manual checks are needed. Thus it saves a lot of effort for staff in the quality assurance department.
If there is no optical equality is there text equality?
In this case the encoded text of the PDF document must be read out and compared for equality based on pages.
If this is the case the user of the comparison software can for example review the effects of a font change in a PDF document. In combination with the generated visualisation of the optical differences this feature can be useful to compare various typefaces for the same document.
If there is a difference between the PDF documents, how huge is it?
The most difficult question to answer is how huge the optical change is. Therefore it is important that the estimation algorithm is as close as possible to the human perception of documents that e.g. shifted lines or charts have to be rated as sticking differences. But for example contrast does not play an important role because normally PDF documents have a unified contrast. This is especially the case for factsheets where the documents include a lot of text and charts usually in front of a white background.
To fulfill these requirements various image comparison algorithms were tested and evaluated. An algorithm which rates the structural similarity of images performed best for this specific use case.
As a further optimisation of the comparison process this rating of differences allows the user of the software to focus on the pages with the highest differences to search for.
If you want to start reading the series from the beginning you can find the first part of the series here.
If you’d like to have future articles delivered conveniently to your inbox, please sign up for our newsletter.
Want to learn more about optimizing investment strategies and improving analytics?