Factors that affect Data Extraction
In today's data-driven world, the efficiency and the quality of data extraction are crucial. However, achieving this accuracy can be a challenging task due to a multitude of factors that can influence the quality of data obtained from these documents.
Let’s explore the intricacies of data extraction from receipts and invoices, shedding light on the various factors that can impact its accuracy. From the clarity of printed text to the layout of the document, and from the quality of scanning equipment to the request parameters used for api calls to submit a document. We will delve into the key elements that can make or break the accuracy. The information below can also serve as a self-debugging guide to users who have issues with data extraction.
For clarity let's break down the factors into two groups: Indirect and Direct.
Indirect Factors
Document and Image Quality
Indirect causes are usually associated with document or image quality. Checking image or scan quality is the first step in the debugging process and understanding the root cause of extraction errors.
Poor Document quality can result in:
Misread Characters: Blurry or distorted images can cause OCR to misinterpret characters, leading to incorrect text extraction. This can result in wrong numbers, dates, or words being extracted from the document. e.g the constant case with OCR extracting 7 as 1 or 5 as 3.
Incomplete Extraction: If parts of the text or information are too faint or unclear due to poor quality, OCR might miss extracting those portions altogether. This can lead to incomplete or missing data in the final output. e.g the case with OCR extracting total 533 as 53 (as last digit could be unclear in the image)
Data Corruption: Poor image quality can introduce artifacts or noise in the image, which can be misinterpreted by OCR as valid characters or symbols. This can corrupt the extracted data.
Invalid Entries: OCR might extract gibberish or meaningless text from poor-quality images, especially if the image contains smudges, stains, or illegible ink. This can result in invalid or nonsensical data entries. e.g the case with bleed-though text interfering with line item description.
Samples for Bleed through text:
Misplaced Information: Blurriness, skew, or distortion can cause OCR to misplace extracted information. For example, a blurred decimal point might lead to incorrect amounts being extracted, as OCR might either miss it completely or extract it “dot” vs “comma”.
Loss of Context: Poor-quality images might lack the visual context required for accurate interpretation. This can affect the correct identification and extraction of important details such as transaction dates, vendor names, or item descriptions.
False Positives and Negatives: Poor-quality images can lead to false positives (incorrectly identified information) and false negatives (missed information). This can have cascading effects on downstream processes that rely on accurate data.
Veryfi document requirements
For .pdf documents, 300 DPI is optimal. Optimal image resolution is dependent on the size of print in the image. Generally, 1000px on the smaller dimension is recommended for images to avoid blur in images.
For Images, the image size should be equal to or bigger than 100 pixels (100x100).
Please note, that file compression tools reduce the image size and thus can affect the data extraction accuracy.
Read more about File Requirements
Signals Veryfi return
In the JSON response from Veryfi, there are specific indicators that offer insights into the document's quality and whether you can fully trust the extraction results:
Out-of-the-box signals
Blur Detection
One of the indicators is
is_blurry
. If "is_blurry" is set to true, it suggests that the document may be blurry. In such cases, it's best to exercise caution since the extraction results could be less reliable and may require additional validation.meta.ocr_score
meta.ocr_score
can serve as a signal for image quality score and overall readability of a document. This is a composite score that combines two aspects of OCR text quality average ocr_score of all extracted fields and average ocr_score of all ocr_text.
Build your own document trust score
Confidence Details
a)
ocr_score
(per field) which is part of the confidence details for important fields. If theocr_score
is lower than 0.8, it's a signal that the extraction results may not be as stable. This lower score indicates that you should carefully review and validate the extracted data for accuracy.b)
score
-(per field) a confidence score, represents the confidence of mapping an extracted value to a particular field in JSON.Image Size and Resolution
Consider the image size in terms of width and height. If either the width or height is equal to or less than 500 pixels, it's a factor to take into account.
meta.pages_height
andmeta.pages_width
These indicators serve as valuable tools for assessing document quality and the reliability of extracted data, helping you make informed decisions about the usability of the extraction results.
☎️ Read More: Confidence Details and Blur Detection and Image score
How to self-debug
The first thing to do when you review a document that has an error in data extraction is to ensure the image/document quality is suitable for accurate data extraction.
Examine the document quality, resolution, size for compatibility with requirements
Check for any warping, skew, crumpling, blur, or bleed-through text that might hinder accurate data extraction
Check
ocr_text
if the OCRed values are correctLook at
meta.ocr_score
, if lower than 0.92 ocr text might be corruptedLook at JSON signals Veryfi returns
What to do
If the underlying issue affecting data extraction is, indeed, related to accuracy and the deskewing algorithms used by Veryfi, regrettably, there are limited actions we can take. Instead, we recommend the following steps:
Educate Your End Users
Emphasize the significance of providing high-quality scans and documents to your end users. Share Veryfi's best practices for capturing documents.
Here are a few tips on how to achieve great image quality
Expand this section to read more..
Expand this section to read more..
Here are some best practices for improving the quality of captured receipts to ensure better extraction results.
Steady Camera Technique and Adequate Lighting:
When capturing a receipt, it's important to hold your camera (or phone) steady as you press the button to take the picture. This stability prevents motion-induced blur, which often occurs when taking photos, especially in low-light conditions when the camera automatically slows the shutter speed to compensate for poor lighting.
Background Choice and Avoiding Glare and Shadows:
To enhance receipt detection, consider using a simple or black background to place the receipt on before capturing it. A clean background helps the system identify and crop the receipt accurately, reducing background noise and improving the accuracy of the OCR (Optical Character Recognition) process. Also, position the receipt in a way that minimizes glare or shadows, as these can obscure text and make it challenging for OCR to accurately recognize characters.
Wrinkle Reduction and Use the Right Camera Settings:
Before capturing receipts, flatten any wrinkled or crumpled ones to create a smoother surface, which makes the receipt more readable for OCR, ensuring that characters and data are accurately extracted. Additionally, familiarize yourself with the camera settings on your device, including focus and exposure settings. Adjust these settings as needed to ensure optimal image quality.
Proper Angles, Capture Area, and Capture the Entire Receipt:
Ensure that the receipt is captured from angles within the designated capture area. Make sure no important data or information is left outside the capture area during the photo-taking process. This practice helps the system recognize and process all relevant details on the receipt. Also, ensure that the entire receipt, including all edges, is captured within the frame. This prevents the loss of critical information and ensures complete data extraction.
By following these best practices, you can significantly improve the quality of captured receipts, resulting in more accurate data extraction and better results.
Implement Quality Alerts
Set up a logic on your side that could detect and flag poor-quality submissions based on the signals returned by Veryfi.
This proactive approach can help prevent inaccurate data extraction and allow for necessary corrective actions to be taken. By following these recommendations, you can enhance the quality of data submissions, mitigate accuracy issues, and ultimately optimize the performance of your data extraction processes.
📲 Consider using Veryfi Lens for Mobile or Browser
To ensure the highest quality of document capture, consider implementing Veryfi Lens, available for both mobile and browser platforms that already have all embedded quality checks and tools to capture higher quality receipts.
Learn more: Lens for Mobile & Lens for Browser.
Indirect: Request Parameters and Account Setting
There are some parameters and account settings on the user side that control the way the data extraction process is being handled.
Request Parameters
Request parameters are data sent along with the request to the API Veryfi. These parameters provide additional information to Veryfi, allowing it to understand and process the request correctly. Request parameters are typically included in the URL or the body of an HTTP request.
boost_mode
For example:line_item_quantity
is not extracted and you could exclude the image quality factor, then check if you a using boost mode to speed up the processing. Some fields are being enriched on Veryfi side on top of data extraction and boost_mode, if added to the call skip some enrichment and calculation steps. AMax_pages_to_process
Controls how many pages of the document Veryfi should read and extract. The default limitation is 15 pages. A common case could be that user sets this parameter to 1 or 2 pages which literally means that all pages after will be ignored and if important information is on 3rd page it is expected not to be extracted.Compute
Veryfi uses enrichments on several fields to provide high extraction coverage when the data is not present or extracted from the document. For example, Veryfi will calculate line item quantity when line item price and line item total are present but line item quantity is not extracted via the Machine Learning (ML) model. In some cases, you may not want Veryfi to make the calculations and prefer to control the enrichment behavior in your code.
Read more about Boost mode, Max pages to process, and Compute flag.
You can also find a list of all request parameters in API POST schema.
Settings and fields that you may request Veryfi to add to your account:
Enable Barcode Detection "barcodes" Explained
Disable Rotation & Cropping in case if you already send cropped documents
Turn off "Personalization" which will skip the account context settings during the data extraction.
User Account Settings
User account settings, such as country, currency, categories list, vendor list, can indeed affect the data extraction process in various ways. User account settings are part of the so-called mask model which could provide context to the data extraction process.
Currency / Country / Region
Each account has both options to set a) a default currency and enable or disable auto-currency detection and b)set the country& region.
By default, if auto-currency is not enabled for your user account Veryfi will overwrite the extracted currency to a default one. Also if Veryfi could not find/identify the currency from the document, user account context such as currency and country will be used as a last resort. Profile settings link.
Categories
Veryfi OCR API utilizes smart categorization at both the document and line item levels.
All Veryfi user accounts are being created with a standard list of Expense Categories (COA) that users can customize. Users can both modify the existing list of Categories in their user account or users can send along a list of categories for Veryfi model to pick up the best choice. Read more about Categories
Vendors
Every new vendor Veryfi extracts is being recorded in your Veryfi account Vendors section. If you struggle with vendor extraction you can add or create a new vendor via web portal and next time the model is hesitant about vendor extraction, as a last resort list of your vendor accounts will be taken to match against. Learn more about how to add new Vendor in web portal.
How to self-debug
If image/document quality is not the root cause, please:
Investigate if there were any request parameters added to a POST request that could affect extraction results.
Investigate if there are any account settings that could affect/overwrite the extraction results.
What to do
Test with adjusted request parameters
Test with adjusted account settings for currency/country/vendor
For more convenience, you might be eligible for a Sandbox account Read more about it Dev profiles and additional pairs of API Keys.
Direct Cause
Direct Causes are usually tied to data representation and supported/unsupported document type.
Direct: Document type
Supported/unsupported document type as well as fields maturity and language support.
Veryfi has at least 7 public Data Extraction APIs:
Receipts/Invoices API
Bank Statements API
Bank Checks API
W-2 API
W-9 API
Business Cards API
Contracts API
While it is clear with W-2, W-9, Bank Statements, Contracts, and Bank checks as they have more or less conventional format and expected context, Receipts&Invoices API supports a great variety of document types in this category.
A myriad of documents out there and for some support could be more confident for some less confident.
For example, high confidence could be expected for
Invoices like: Standard Invoices, Service Invoices, Credit Invoices, and Commercial Invoices; Purchase order numbers, Blanked Purchase orders, Contract Purchase orders,
Receipts like: Retail Point of Sale Receipts, Travel Expense Receipts, Invoice Receipts
Slightly less confidence for
Invoices: Proforma invoices, Recurring Invoices, Debit Invoices, Interim Invoices, Freight Invoices, and Expense Reports; Purchase order: Advance Purchase Order, Single-use Purchase Order, Recurring Purchase Order, Drop Ship Purchase Order. Bills: Power, Electricity, Garbage, Insurance and Property Tax Bills
Receipts like: Rental receipts, Utility Bill receipts, Service Receipts, Medical expense receipts Donation receipts, Online Order confirmations, Tickets or Reservation receipts.
Minimal support could be for:
Invoices like: Consolidated Invoice and E-invoices could come in various formats, such as XML or EDI, Bill of Landing, Packing Lists
Note that the document types above and confidence are checked against 170+ fields Veryfi extracts for Receipts/Invoices API. Depending on your specific case documents with minimal support could perform with 100% accuracy based on simple fields set and vice versa.
Direct: Data representation
The model is confident in extraction results for frequent vendors representing particular domains & industries involved. For low-volume cases, extraction results may not be stable until confidence gained by providing more data for similar cases. Each domain/industry/vendors may have its own unique challenges, formats, and terminology.
Confidence in Frequent Vendors:
When it comes to vendors that are frequently encountered within a particular domain or industry, the model tends to exhibit a higher level of confidence in its extraction results.
This confidence is a result of the model being extensively trained and fine-tuned on data related to these commonly encountered vendors. It has learned the nuances, patterns, and typical data structures associated with these vendors.
Challenges of Low-Volume Cases:
In contrast, when dealing with low-volume cases or vendors that are less commonly encountered, the model's confidence in extraction results may not be as stable.
Low-volume cases present unique challenges because the model may not have had the opportunity to encounter and learn from a sufficient amount of data related to these vendors. Consequently, it may struggle to accurately interpret and extract data from such cases.
Gaining Confidence Through Data:
The model's ability to improve its confidence in low-volume cases primarily relies on the availability of more data for similar cases. As it encounters and processes more instances of low-volume vendors, it can learn from the patterns and variations present in the data.
Over time, with exposure to a broader dataset, the model becomes more adept at handling these low-volume cases, leading to increased confidence in the extraction results.
Unique Challenges, Formats, and Terminology:
It's important to recognize that each domain, industry, and vendor may introduce its own set of unique challenges, document formats, and industry-specific terminology.
For instance, an invoice from a healthcare provider may have different data structures and terminologies compared to an invoice from a retail supplier. The model needs to adapt to these variations to ensure accurate extraction.
Continuous Learning and Adaptation:
To address the diversity of challenges posed by different domains, industries, and vendors, the model often undergoes continuous learning and adaptation.
This learning process involves updating the model with new data and refining its algorithms to handle a broader range of cases effectively.
The model's confidence in data extraction results is closely tied to the frequency of encounters with specific vendors, domains, and industries. While it may exhibit high confidence for common cases, it may require more exposure to low-volume or unique cases to achieve stability and accuracy.
What to do
Feedback loop
It is also possible that the model has a vast amount of data representation for a particular case but needs some fine-tuning training on how to extract the data accurately. This is where model training can be beneficial. Based on the volume and information provided, we can customize the training to improve accuracy based on your project requirements. In the case of low-volume vendors, there may be ample feedback available to improve extraction accuracy. However, for frequent vendors, there might be a lack of specific feedback in this particular field, causing the model to be unaware of the correct extraction method.
During model training sessions, all the changes made by users to processed documents are incorporated. This means that any corrections, updates, or feedback provided by users directly contribute to the refinement and enhancement.
Whenever you encounter any inaccuracies or errors in the extracted data, you can correct them using the PUT, POST, and DELETE operations provided by Veryfi's API. By making these corrections, you provide valuable feedback that helps train and refine the Machine Learning models.
Read more about Model Training at Veryfi. If you continue to encounter issues with the new model failing to read specific parts of your documents, we are here to assist you.
How to report Data Extraction issues
Please reach out to our support team at support@veryfi.com. Provide us with comprehensive details and context regarding the problem.
Collect examples of issue (Document ID and/or source document file)
Provide expected results
Provide background on Severity and Priority
Estimated impact on a specific field across multiple vendors
Estimated impact for a specific document/vendor
Attach the original file and JSON response, and specifically highlight the exact parameter that Veryfi is unable to extract. Our data team will review your case.
By collaborating and exchanging data from both ends, we can expedite the process of enhancing accuracy.
Our goal is to ensure that Veryfi consistently delivers accurate results, and we appreciate your proactive engagement in helping us achieve this objective.
Please, reach out to support@veryfi.com if you have any questions.