Skip to main content

Accuracy Reports: Matching Explained

Inner workings of our accuracy reports, including matching logic, metrics, and algorithms used.

Updated today

What is Accuracy Report?

Accuracy reports help you measure how well Veryfi models are performing in extracting information from your documents. You can create and view different reports to track performance across various initiatives, vendors, or fields in the "Analytics" section model over model.

The goal of this article is to provide a detailed explanation of the inner workings of the accuracy reports tool, not to elaborate on its goal or how to create reports and use the tool more generally.

👨🏻‍🔧 For instructions on how to use the tool and create reports, please see this Article and the following video YouTube.

What is Ground Truth and why is it important?

Ground truth is the correct, manually reviewed data that Veryfi uses as a benchmark to measure accuracy. It is necessary to determine what the correct output is in order to measure accuracy, so a set of documents that has been manually reviewed for the fields of interest is what we call the “Ground truth”.

Think of it as your "gold standard" data:

  • It's created by human reviewers who carefully check and verify the correct values

  • You can tag these documents (data set) for easy filtering and navigation in future

  • Only evaluate fields and documents you've verified.

Why it is important to evaluate fields and documents you have verified?

Let's say you have an invoice where you've carefully checked and confirmed that the vendor name is "Acme Corp." However, you haven't yet reviewed whether the line items on that invoice are correct.

In this case:

✅ DO: Use this invoice to check how well the Veryfi extracts vendor names

❌ DON'T: Use this same invoice to check line item accuracy (Don't include line items to the list of fields to track accuracy for)

Why? Because while you know for certain the vendor name is correct (you checked it!), you haven't verified if the line items are correct. Using unverified data as "ground truth" could give you misleading accuracy results.

A practical example:

  1. You have an invoice from Staples

  2. You've manually verified:

    • The vendor name is correct

    • The invoice date is correct

  3. But you haven't checked:

    • The line items

    • The tax amounts

  4. Therefore, only use this invoice to measure accuracy for vendor name and date extraction - not for line items or tax amounts.

Think of it like grading a test - you can only grade the questions you have the correct answers for. If you don't have the answer key for certain questions, you can't accurately score those parts.

How to read reports?

Each report shows:

  • The field being measured (those you included into the report)

  • Number of extractions

  • Model version used and Date of analysis

  • F1 score for each field

You can click "Show Detail" to see side-by-side comparisons of what the model extracted versus the ground truth.

🧑🏻‍🔬 The expected number of extractions is based on the ground truth, and the score shown for each field is the F1 score. F1 is a metric commonly used for summarizing the performance of AI models since it takes into account both precision and recall, it is explained in further detail below.

How do we measure accuracy?

At Veryfi we use several sophisticated methods to ensure fair and practical accuracy measurements including but not limited to  fuzzy matching, F1 score, Levenshtein distance, and Hunt–Szymanski algorithm.

A. Fuzzy matching (document level)

"Fuzzy matching" is used for certain fields where small differences don't impact usability of a result in most cases, the document-level fields that don’t require an exact match are the following: addresses, phone numbers, and names.

Addresses:

Pre-processing steps:

  1. Remove all instances of "USA" and "United States" from the end of the address

  2. Remove all dashes and last 4 digits from zip codes (e.g. "12345-6789" becomes "12345")

  3. Replace full state names with abbreviations

  4. Replace address terms with abbreviations (e.g. "st" for "street")

  5. Remove common variations like "P.O. Box"

  6. Remove punctuation and spaces from the address

Matching 

In short: Consider names matching if they're at least 85% similar
🧑🏻‍🏫 Explained: We calculate the Levenshtein distance between the ground truth and extraction values and divide them over the average length of both strings, it is determined a match if this value is equal or below 0.15 considered an 85% similarity.

Phone numbers:

Pre-processing steps:

  1. We focus on the actual digits

  2. For numbers with 8+ digits, we match the last 8 digits

  3. For shorter numbers, all digits must match

Matching

If the ground truth number has at least 8 digits, at least the last 8 digits must match. If the phone number is less than 8 digits, all digits must match.

Names:

This applies to the fields “vendor.name", "bill_to.name", "ship_to.name", and "vendor.raw_name”.

Pre-processing steps:

  1. We remove common business prefixes (e.g. "the", "sarl")

  2. Remove common suffixes (e.g. "inc", "llc")

  3. Remove all special characters and spaces, but allow multilingual (Ex: "é" for "e", Chinese characters)

Matching

In short: Consider names matching if they're at least 85% similar
🧑🏻‍🏫 Explained: We calculate the Levenshtein distance between the ground truth and extraction values and divide them over the average length of both strings, it is determined a match if this value is equal or below 0.15 considered an 85% similarity. Additionally, if both strings start with the same characters, and the overlapping characters are at least 50% of the length of the longer of the two names, it is a match.

Dates:

Date fields in our json response are datetime objects, but only the date portion is used for matching, time is ignored.

B. Fuzzy matching (array-like fields) / Special Handling for Line Items

line_items and tax_lines can appear more than once in a document, and must be matched in order to be compared, the matching algorithm is explained below. All fields for array-like objects use exact matching, except for line_item description.

Line_items.description 

Pre-processing:

1. Remove all special characters and spaces, but allow multilingual (Ex: "é" for "e", Chinese characters)

Matching 

In short: Descriptions are considered matching if they're at least 90% similar

🧑🏻‍🏫 Explained: We calculate the Levenshtein distance between the ground truth and extraction values and divide them over the average length of both strings; it is determined a match if this value is equal or below 0.1, considered a 90% similarity. Additionally, if both strings start with the same characters, and the overlapping characters are at least 50% of the length of the longer of the two strings, it is a match.

Array field matching

Sometimes, line_items can be skipped or broken down, so it is necessary to match each line_item in the extraction and the ground truth; We use a single field for this matching, which is the first one from this list to be present in the report: total, description, full_description, price. If no field from this list is in the report, the first field added to the report will be used; the algorithm used for matching is the Hunt-Szymanski.

Hunt-Szymanski algorithm:

Given two sentences, the algorithm will identify the largest common subsequence between them, allowing for gaps to be formed between the two of them. As an example:

Let's say you have two sentences:

  1. This is a good sample sequence for the example

  2. There is a good sample sequence used for an example

An exact match must be achieved between two words is necessary for two words to match, the resulting subsequence would be:

is a good sample sequence for example

Obtained from:

This  is a good sample sequence      for the example

There is a good sample sequence used for an  example

In the case to line item totals, lets say that the ground truth has five line items, with the respective totals: 1.0, 2.0, 3.0, 5.0, 6.0; and the model prediction has six line items: 1.0, 1.0, 2.0, 4.0, 5.0, 6.0; The resulting match would be:

1.0 2.0 5.0 6.0

Coming from:

1.0     2.0 3.0 5.0 6.0

1.0 1.0 2.0 4.0 5.0 6.0

The line items in the report would be:

Index

Extracted total

Ground truth total

1

1.0

1.0

2

null

1.0

3

2.0

2.0

4

3.0

4.0

5

5.0

5.0

6

6.0

6.0

Levenshtein distance:

This distance is defined as the minimum number of edits necessary to transform one string of text to another, edits include insertions, deletions, and edits of a single character; let's say you have two line item descriptions:

  1. Starbucks large ch@ramel vanilla coldmacchiato

  2. $tarbuckslarge caramel vanilla cold macchiato

The result is indifferent to the string we start with, so let's take the second string and transform it into the second one:

First, we do an edit on the first character, transforming the $ to an S, and get:

Second string: Starbuckslarge caramel vanilla cold macchiato

Original string: Starbucks large ch@ramel vanilla coldmacchiato

Then, we make an insertion of a space between Starbucks and large

Second string: Starbucks large caramel vanilla cold macchiato

Original string: Starbucks large ch@ramel vanilla coldmacchiato

Then, we insert an h in caramel

Second string: Starbucks large charamel vanilla cold macchiato

Original string: Starbucks large ch@ramel vanilla coldmacchiato

Then, edit the first a in charamel for an @

Second string: Starbucks large ch@ramel vanilla cold macchiato

Original string: Starbucks large ch@ramel vanilla coldmacchiato

Finally, delete the space between cold and macchiato

Second string: Starbucks large ch@ramel vanilla coldmacchiato

Original string: Starbucks large ch@ramel vanilla coldmacchiato

We have made a total of five edits to go from one string to the other, so the Levenshtein distance between the two strings is 5. If we want to transform it into the metric that we commonly use, we divide by the average length of the two original strings, the first string has a length of 46 characters, and the second string has 45. 5 edits over 45.5 is approximately 0.11; so these strings would match as we use a threshold of 0.1, or 10%.

What does the F1 Score mean?

F1 score is a metric commonly used to describe the performance of AI models, especially useful when dealing with unbalanced datasets, where the likelihood of one result is significantly different from the others. Let's begin with the definitions that we have for our use case:

True Positive: occurs when a ground truth value is not empty and matches the extracted value.

False Positive: occurs when a ground truth value is not empty and does not match the extracted value.

Recall: is computed as

Number of True Positives / Number of cases when the ground truth value is not empty (or null)

Precision: is computed as

Number of True Positives / (Number of True Positives + Number of False Positives)

F1: is computed as

2 * Precision * Recall / (Precision + Recall).

Sample categorization of matching types:

GroundTruth Value

Extracted Value

Is match?

Match type

5482 Wilshire Blvd

Box 1589

Los Angeles, CA 90036

USA

5482 Wilshire Blvd

Box 1589

Los Angeles, CA 90036

Yes

  • True Positive

  • Ground truth was not empty

5482 Wilshire Blvd

Box 1589

Los Angeles, CA 90036

USA

5482 Wilshire Blvd

Los Angeles, CA 90036

No

  • False Positive

  • Ground truth was not empty

null

5482 Wilshire Blvd

Los Angeles, CA 90036

No

  • False Positive

5482 Wilshire Blvd

Box 1589

Los Angeles, CA 90036

USA

null

No

  • Ground truth was not empty

null

null

Yes

  • Not used in calculation of F1, Precision, or Recall

  • The values of the objects generated during the array field matching step are all treated as null

  • For array-like fields, larger documents (with a high number of array values) will have a higher impact on the metrics.

  • null to null matches are shown in gray color in the extraction tab

Common Questions

Q: What happens if a field is empty in both the extraction and ground truth?

A: These are not counted in accuracy calculations since there's nothing to compare.

Q: How do you handle international characters and special symbols?

A: Our system preserves international characters (like "é") and handles special characters intelligently based on the field type.

Q: Can I create different reports for different document types or issues?

A: Yes! You can create separate reports for different document types, vendors, or fields to track performance across various scenarios.


Best Practices for Accuracy Reports (expand to read)

Setting Up Your Ground Truth

  1. Create your ground truth data directly in Veryfi Hub using the New Document Details view

    • This ensures consistency and proper formatting

    • Makes it easier to track and manage verified data

  2. Tag documents systematically/consistently

    • Use clear, consistent tags for your ground truth documents

    • Example tags: "verified_vendor_name", "verified_line_items", "2024_Q1_review"

    • This makes filtering and organizing documents much simpler

  3. Verify before measuring

    • Always review and approve your ground truth data before using it in reports

    • Double-check that the verified values are correct

Creating Effective Reports

  1. Give your reports clear names

    • Use descriptive names that indicate purpose and scope

    • Example: "Q1-2024_Vendor-Name-Accuracy" or "Healthcare-Invoice-Line-Items"

  2. Add detailed descriptions

    • Include the scope of what's being measured

    • Note any special conditions or filters applied

    • Document why this report was created

Maintaining Quality

  1. Focus on verified fields only

    • Only measure accuracy for fields you've manually reviewed

  2. Track multiple aspects

    • Create separate reports for different document types/vendors/fields

    • Track accuracy changes over time

Need Help?

If you have specific questions about your Accuracy Reports or need help interpreting the results, please reach out to your Technical Account Manager for detailed guidance or shoot and email to [email protected]

Did this answer your question?