Custom Text Field Extraction using Regex

Pattern Matching Data Extraction OCR Text

Veryfi’s Data Transformation services now allows you to run Regex to customize the results of Veryfi’s realtime AI Data Extraction.

How this works

Veryfi platform is made up of 3 parts that work together like Swiss watch.

  1. Document Capture for collecting invoices, receipts & bills at the point of engagement. For example; Veryfi Email Engine consumes POS digital receipts at point of issue and Veryfi Lens consumes paper documents at point of engagement using you mobile app.

  2. Data Extraction then analyzes these unstructured documents returning structured data in a JSON. This JSON (also known as Level3 data) can then be used to automate expense management, bill pay, market research into consumer spending behavior, bookkeeping esp. construction job costing etc..

  3. Data Transformation is where we take it up a notch and allow you to run custom conditions over the extracted data to further the value creation. You can think of it as the long tail of data extraction.

In this post we will focus on data transformation and the new regex expression builder.

Data Transformation

Veryfi API Portal Menu

You can access Veryfi Data Transformations within the Veryfi API Portal by opening up the Data Transformation dropdown (as shown left) and selecting “Rules”.

What is Regex

A regular expression (also referred to as regex or regexp) is a sequence of characters (often called a pattern) that specifies a search pattern. The pattern describes regular languages in formal language theory. In short, a regex pattern matches a target string.

You come across regex nearly every day. Regex are used on websites to validate email addresses on signup/login forms, check password strengths in online services, validate addresses and phone numbers on websites and much more.

If you are a developer then this should be simple for you to craft some these Regex. There are many online reference guides and cheat sheets on Regex so we won’t repeat that here and instead focus on how to execute them in Veryfi.

You can validate the Regex you write using a tool like https://regex101.com/

Regex in Rules

Let’s start by adding a new Rule under Data Transformation. Make sure you are here https://hub.veryfi.com/rules/

From the Rules page, press the blue button called “+ Add a Rule”. This opens up a modal where you can add a Rule. From the “Condition” drop down select “Document > OCR Text Contains” and then under “Filter” add your Regex.

You are basically telling the system to run the Regex over the OCR Text which Veryfi also returns in the Data Extraction response.

The OCR Text this Rule will run over is the JSON response from each document’s data extraction. When viewing the JSON for any document you extract you will find a key name “ocr_text”; the value of which will be used to execute your Regex against.

... 
"invoice_number": "",
"line_items": [
{ ... }
],
"ocr_text": "Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World. Hello World.",
"payment_display_name": "Visa", "payment_terms": "",
...


What can I do with this

The sky’s the limit.

Here are few use cases:

  • Construction Bookkeeping is the most complex task because extracting line items from invoices, bills and receipts are a must, not an option, for job costing. Then there are the handwritten job codes and other project specific notes that need to be pulled out for job costing. It’s tedious, repetitive and time draining. On average it takes 25 mins per invoice. Veryfi OCR API get you 95% there including line items in seconds, and project specifics (the remaining 5%) can now be handled by Regex.

  • Market Research is another interesting area that aims to understand consumer spend behavior by analyzing CPG/FMCG receipts products purchased (SKU line items). Consumer packaged goods (CPG) are products that people frequently use and replenish. These items are sometimes called fast-moving consumer goods (FMCG) because of how quickly they sell at Safeway and other retail chains.

    Apart from the usual product SKU line items, market research firms will want to also understand from receipts the shoppers Loyalty ID or even the EAN128 barcode like those featured on Walmart receipts. All great cases for Regex. It’s that long tail opportunity which you can now customize and control.

How does this look in practice

A customer asked Veryfi support to how to extract a business specific value (686125) from each of their receipts and put it into the Notes field. Veryfi support advised the customer to add the following Rule with the shown Regex into the Filter field and then under Action(s) select the Field to apply the Data Transformation to and what to do with it, {match}.

Receipt where Regex needs to be applied
Setup Veryfi Rules using Regex

Then when the customer ran the same receipt past Veryfi’s OCR API, the value they wanted was extracted and put into the Notes field.

Feedback is always welcomed so please let us know what we could do better by emailing us on support@veryfi.com

Did this answer your question?