What is Accuracy Report?
Accuracy reports help you measure how well Veryfi models are performing in extracting information from your documents. You can create and view different reports to track performance across various initiatives, vendors, or fields in the "Analytics" section model over model.
The goal of this article is to provide a detailed explanation of the inner workings of the accuracy reports tool, not to elaborate on its goal or how to create reports and use the tool more generally.
What is Ground Truth and why is it important?
Ground truth is the correct, manually reviewed data that Veryfi uses as a benchmark to measure accuracy. It is necessary to determine what the correct output is in order to measure accuracy, so a set of documents that has been manually reviewed for the fields of interest is what we call the “Ground truth”.
Think of it as your "gold standard" data:
It's created by human reviewers who carefully check and verify the correct values
You can tag these documents (data set) for easy filtering and navigation in future
Only evaluate fields and documents you've verified.
Why it is important to evaluate fields and documents you have verified?
Why it is important to evaluate fields and documents you have verified?
Let's say you have an invoice where you've carefully checked and confirmed that the vendor name is "Acme Corp." However, you haven't yet reviewed whether the line items on that invoice are correct.
In this case:
✅ DO: Use this invoice to check how well the Veryfi extracts vendor names
❌ DON'T: Use this same invoice to check line item accuracy (Don't include line items to the list of fields to track accuracy for)
Why? Because while you know for certain the vendor name is correct (you checked it!), you haven't verified if the line items are correct. Using unverified data as "ground truth" could give you misleading accuracy results.
A practical example:
You have an invoice from Staples
You've manually verified:
The vendor name is correct
The invoice date is correct
But you haven't checked:
The line items
The tax amounts
Therefore, only use this invoice to measure accuracy for vendor name and date extraction - not for line items or tax amounts.
Think of it like grading a test - you can only grade the questions you have the correct answers for. If you don't have the answer key for certain questions, you can't accurately score those parts.
How to read reports?
Each report shows:
The field being measured (those you included into the report)
Number of extractions
Model version used and Date of analysis
F1 score for each field
You can click "Show Detail" to see side-by-side comparisons of what the model extracted versus the ground truth.
🧑🏻🔬 The expected number of extractions is based on the ground truth, and the score shown for each field is the F1 score. F1 is a metric commonly used for summarizing the performance of AI models since it takes into account both precision and recall, it is explained in further detail below.
How do we measure accuracy?
At Veryfi we use several sophisticated methods to ensure fair and practical accuracy measurements including but not limited to fuzzy matching, F1 score, Levenshtein distance, and Hunt–Szymanski algorithm.
A. Fuzzy matching (document level)
"Fuzzy matching" is used for certain fields where small differences don't impact usability of a result in most cases, the document-level fields that don’t require an exact match are the following: addresses, phone numbers, and names.
Addresses:
Pre-processing steps:
Remove all instances of "USA" and "United States" from the end of the address
Remove all dashes and last 4 digits from zip codes (e.g. "12345-6789" becomes "12345")
Replace full state names with abbreviations
Replace address terms with abbreviations (e.g. "st" for "street")
Remove common variations like "P.O. Box"
Remove punctuation and spaces from the address
Matching
In short: Consider names matching if they're at least 85% similar
🧑🏻🏫 Explained: We calculate the Levenshtein distance between the ground truth and extraction values and divide them over the average length of both strings, it is determined a match if this value is equal or below 0.15 considered an 85% similarity.
Phone numbers:
Pre-processing steps:
We focus on the actual digits
For numbers with 8+ digits, we match the last 8 digits
For shorter numbers, all digits must match
Matching
If the ground truth number has at least 8 digits, at least the last 8 digits must match. If the phone number is less than 8 digits, all digits must match.
Names:
This applies to the fields “vendor.name", "bill_to.name", "ship_to.name", and "vendor.raw_name”.
Pre-processing steps:
We remove common business prefixes (e.g. "the", "sarl")
Remove common suffixes (e.g. "inc", "llc")
Remove all special characters and spaces, but allow multilingual (Ex: "é" for "e", Chinese characters)
Matching
In short: Consider names matching if they're at least 85% similar
🧑🏻🏫 Explained: We calculate the Levenshtein distance between the ground truth and extraction values and divide them over the average length of both strings, it is determined a match if this value is equal or below 0.15 considered an 85% similarity. Additionally, if both strings start with the same characters, and the overlapping characters are at least 50% of the length of the longer of the two names, it is a match.
Dates:
Date fields in our json response are datetime objects, but only the date portion is used for matching, time is ignored.
B. Fuzzy matching (array-like fields) / Special Handling for Line Items
line_items and tax_lines can appear more than once in a document, and must be matched in order to be compared, the matching algorithm is explained below. All fields for array-like objects use exact matching, except for line_item description.
Line_items.description
Pre-processing:
1. Remove all special characters and spaces, but allow multilingual (Ex: "é" for "e", Chinese characters)
Matching
In short: Descriptions are considered matching if they're at least 90% similar
🧑🏻🏫 Explained: We calculate the Levenshtein distance between the ground truth and extraction values and divide them over the average length of both strings; it is determined a match if this value is equal or below 0.1, considered a 90% similarity. Additionally, if both strings start with the same characters, and the overlapping characters are at least 50% of the length of the longer of the two strings, it is a match.
Array field matching
Sometimes, line_items can be skipped or broken down, so it is necessary to match each line_item in the extraction and the ground truth; We use a single field for this matching, which is the first one from this list to be present in the report: total, description, full_description, price. If no field from this list is in the report, the first field added to the report will be used; the algorithm used for matching is the Hunt-Szymanski.
Hunt-Szymanski algorithm:
Given two sentences, the algorithm will identify the largest common subsequence between them, allowing for gaps to be formed between the two of them. As an example:
Let's say you have two sentences:
This is a good sample sequence for the example
There is a good sample sequence used for an example
An exact match must be achieved between two words is necessary for two words to match, the resulting subsequence would be:
is a good sample sequence for example
Obtained from:
This is a good sample sequence for the example
There is a good sample sequence used for an example
In the case to line item totals, lets say that the ground truth has five line items, with the respective totals: 1.0, 2.0, 3.0, 5.0, 6.0
; and the model prediction has six line items: 1.0, 1.0, 2.0, 4.0, 5.0, 6.0
; The resulting match would be:
1.0 2.0 5.0 6.0
Coming from:
1.0
2.0 3.0 5.0 6.0
1.0
1.0 2.0 4.0 5.0 6.0
The line items in the report would be:
Index | Extracted total | Ground truth total |
1 | 1.0 | 1.0 |
2 | null | 1.0 |
3 | 2.0 | 2.0 |
4 | 3.0 | 4.0 |
5 | 5.0 | 5.0 |
6 | 6.0 | 6.0 |
Levenshtein distance:
This distance is defined as the minimum number of edits necessary to transform one string of text to another, edits include insertions, deletions, and edits of a single character; let's say you have two line item descriptions:
Starbucks large ch@ramel vanilla coldmacchiato
$tarbuckslarge caramel vanilla cold macchiato
The result is indifferent to the string we start with, so let's take the second string and transform it into the second one:
First, we do an edit on the first character, transforming the $
to an S
, and get:
Second string: S
tarbuckslarge caramel vanilla cold macchiato
Original string: Starbucks large ch@ramel vanilla coldmacchiato
Then, we make an insertion of a space between Starbucks
and large
Second string: Starbucks large caramel vanilla cold macchiato
Original string: Starbucks large ch@ramel vanilla coldmacchiato
Then, we insert an h
in caramel
Second string: Starbucks large charamel vanilla cold macchiato
Original string: Starbucks large ch@ramel vanilla coldmacchiato
Then, edit the first a
in charamel
for an @
Second string: Starbucks large ch@ramel vanilla cold macchiato
Original string: Starbucks large ch@ramel vanilla coldmacchiato
Finally, delete the space between cold
and macchiato
Second string: Starbucks large ch@ramel vanilla coldmacchiato
Original string: Starbucks large ch@ramel vanilla coldmacchiato
We have made a total of five edits to go from one string to the other, so the Levenshtein distance between the two strings is 5. If we want to transform it into the metric that we commonly use, we divide by the average length of the two original strings, the first string has a length of 46 characters, and the second string has 45. 5 edits over 45.5 is approximately 0.11; so these strings would match as we use a threshold of 0.1, or 10%.
What does the F1 Score mean?
F1 score is a metric commonly used to describe the performance of AI models, especially useful when dealing with unbalanced datasets, where the likelihood of one result is significantly different from the others. Let's begin with the definitions that we have for our use case:
True Positive: occurs when a ground truth value is not empty and matches the extracted value.
False Positive: occurs when a ground truth value is not empty and does not match the extracted value.
Recall: is computed as
Number of True Positives / Number of cases when the ground truth value is not empty (or null)
Precision: is computed as
Number of True Positives / (Number of True Positives + Number of False Positives)
F1: is computed as
2 * Precision * Recall / (Precision + Recall).
Sample categorization of matching types:
GroundTruth Value | Extracted Value | Is match? | Match type |
5482 Wilshire Blvd Box 1589 Los Angeles, CA 90036 USA | 5482 Wilshire Blvd Box 1589 Los Angeles, CA 90036 |
Yes |
|
5482 Wilshire Blvd Box 1589 Los Angeles, CA 90036 USA | 5482 Wilshire Blvd Los Angeles, CA 90036 |
No |
|
null | 5482 Wilshire Blvd Los Angeles, CA 90036 |
No |
|
5482 Wilshire Blvd Box 1589 Los Angeles, CA 90036 USA |
null |
No |
|
null |
null |
Yes |
|
The values of the objects generated during the array field matching step are all treated as null
For array-like fields, larger documents (with a high number of array values) will have a higher impact on the metrics.
null to null matches are shown in gray color in the extraction tab
Common Questions
Q: What happens if a field is empty in both the extraction and ground truth?
A: These are not counted in accuracy calculations since there's nothing to compare.
Q: How do you handle international characters and special symbols?
A: Our system preserves international characters (like "é") and handles special characters intelligently based on the field type.
Q: Can I create different reports for different document types or issues?
A: Yes! You can create separate reports for different document types, vendors, or fields to track performance across various scenarios.
Best Practices for Accuracy Reports (expand to read)
Best Practices for Accuracy Reports (expand to read)
Setting Up Your Ground Truth
Create your ground truth data directly in Veryfi Hub using the New Document Details view
This ensures consistency and proper formatting
Makes it easier to track and manage verified data
Tag documents systematically/consistently
Use clear, consistent tags for your ground truth documents
Example tags: "verified_vendor_name", "verified_line_items", "2024_Q1_review"
This makes filtering and organizing documents much simpler
Verify before measuring
Always review and approve your ground truth data before using it in reports
Double-check that the verified values are correct
Creating Effective Reports
Give your reports clear names
Use descriptive names that indicate purpose and scope
Example: "Q1-2024_Vendor-Name-Accuracy" or "Healthcare-Invoice-Line-Items"
Add detailed descriptions
Include the scope of what's being measured
Note any special conditions or filters applied
Document why this report was created
Maintaining Quality
Focus on verified fields only
Only measure accuracy for fields you've manually reviewed
Track multiple aspects
Create separate reports for different document types/vendors/fields
Track accuracy changes over time
Need Help?
If you have specific questions about your Accuracy Reports or need help interpreting the results, please reach out to your Technical Account Manager for detailed guidance or shoot and email to [email protected]