This is the first blog post of my behind-the-scenes series on the SAP AI Business Services portfolio. Each blog post will focus on one of the services and clarify the following questions:
- What can you do with it?
- What happens behind the scenes?
- How does retraining work?
- What else is possible? – Think outside the box!
Let’s start with the Document Information Extraction Service (DOX). The time and effort spend on manually extracting information from the billions of documents that float around in the business world almost seem ridiculous in today’s time of automation. Especially considering all the advances in AI. The task of extracting text and certain entities within the text from documents, seems easy compared to walking robots, talking phones and face recognition. But it is not as trivial as it sounds. Humans take different factors into consideration when looking at a document. The layout, the content, the language, the structure of the document, the currency or even the channel through which it was received. Most AI solutions only look at the text of the document and some also try to incorporate the layout. With Document Information Extraction SAP provides a service to automatically extract certain information such as invoice numbers, line items, header field information and so on from documents by taking all kinds of factors into account.
What can you do with it?
DOX consists of different machine learning models that work together to solve tasks ranging from OCR (optical character recognition) to bounding box regression, semantic segmentation, and the mapping of metadata. DOX can extract information from business cards, invoices, payment advice, and purchase orders out of the box. Depending on the document type, DOX can extract around 100 fields. To name only a few of those fields: Unit Price, Document Date, Currency Code, Payment Amount, Tax ID, Street Name, Due Date and so on. You can also create templates and use it to extract the information and fields you need from literally any other type of documents. Supported file types are PDF, JPEG, PNG, TIFF, PDF and XML hybrid format, Factur-X and ZUGFeRD standards, and Excel.
What happens behind the scenes?
Depending on the document type and the field that needs to be extracted, different models and algorithms work together. Generally, this is how it works:
Figure 1: The Document Information Extraction generally first looks for bar or QR codes in the document (Step 1), then applies OCR (Step 2), looks for keywords (Step 3), applies DocReader (Step 4) and Chargrid (Step 5) and then does the data enrichment with given metadata (Step 6).
Step 1:
In the first step DOX scans the document for codes. Is a QR code or a barcode detected within the document, DOX extracts the information from the code. This information has priority over values that are possibly detected in later steps.
Step 2:
After that DOX applies SAP’s internal OCR algorithm. You can access the result of the OCR step via the corresponding DOX API endpoint (Figure 2). DOX then searches for certain keywords, for example the invoice number. Furthermore, the Business Entity Recognition Service is used to detect addresses.
Figure 2: OCR Endpoint of the Swagger UI of the Document Information Extraction Service.
Step 3:
In the next step, DOX uses the DocReader algorithm to extract more values. This algorithm especially focuses on the header fields of the document.
Step 4:
The last step of the information extraction task of DOX is done by Chargrid. An algorithm that is based on a fully convolutional neural network and that takes layout information as well as OCR results into account to detect fields from the document. It is especially used for line items in the documents.
OCR
SAP’s internal OCR algorithm is a line-based approach. First a neural net similar to the encoder of the later explained Chargrid algorithm detects the lines of the document. A transformer decoder then extracts the text from each line.
First the lines of the document are detected.
{
"results": {
"1": [
{
"word_boxes": [
{
"bbox": [
[
141,
103
],
[
263,
262
]
],
"content": "�"
},
{
"bbox": [
[
398,
99
],
[
894,
260
]
],
"content": "Sights"
}
],
"bbox": [
[
141,
99
],
[
894,
262
]
]
},
{
"word_boxes": [
{
"bbox": [
[
62,
322
],
[
699,
473
]
],
"content": "Downtown"
},
{
"bbox": [
[
777,
303
],
[
1219,
447
]
],
"content": "Toronto"
},
{
"bbox": [
[
1290,
297
],
[
1386,
429
]
],
"content": "is"
},
{
"bbox": [
[
1458,
265
],
[
1606,
416
]
],
"content": "an"
},
{
"bbox": [
[
1687,
280
],
[
2494,
476
]
],
"content": "easy-to-navigate"
}
],
"bbox": [
[
65,
306
],
[
2490,
484
]
]
},
{
"word_boxes": [
{
"bbox": [
[
58,
502
],
[
333,
634
]
],
"content": "grid,"
},
{
"bbox": [
[
415,
481
],
[
903,
621
]
],
"content": "bounded"
},
{
"bbox": [
[
989,
473
],
[
1115,
600
]
],
"content": "by"
},
{
"bbox": [
[
1185,
468
],
[
1251,
593
]
],
"content": "a"
},
{
"bbox": [
[
1331,
441
],
[
2044,
610
]
],
"content": "hodgepodge"
},
{
"bbox": [
[
2116,
468
],
[
2219,
616
]
],
"content": "of"
},
{
"bbox": [
[
2281,
474
],
[
2494,
626
]
],
"content": "bohe"
}
],
"bbox": [
[
59,
449
],
[
2492,
648
]
]
},
{
"word_boxes": [
{
"bbox": [
[
66,
629
],
[
397,
812
]
],
"content": "mian,"
},
{
"bbox": [
[
486,
624
],
[
834,
806
]
],
"content": "ethnic"
},
{
"bbox": [
[
914,
620
],
[
1116,
801
]
],
"content": "and"
},
{
"bbox": [
[
1206,
613
],
[
1644,
797
]
],
"content": "historic"
},
{
"bbox": [
[
1722,
603
],
[
2484,
790
]
],
"content": "neighborhoods"
}
],
"bbox": [
[
66,
603
],
[
2484,
812
]
]
},
{
"word_boxes": [
{
"bbox": [
[
76,
797
],
[
420,
959
]
],
"content": "Yonge"
},
{
"bbox": [
[
492,
794
],
[
623,
953
]
],
"content": "St,"
},
{
"bbox": [
[
694,
791
],
[
872,
950
]
],
"content": "the"
},
{
"bbox": [
[
955,
784
],
[
1337,
947
]
],
"content": "world's"
},
{
"bbox": [
[
1404,
777
],
[
1863,
940
]
],
"content": "longest,"
},
{
"bbox": [
[
1936,
770
],
[
2337,
933
]
],
"content": "dissects"
},
{
"bbox": [
[
2384,
768
],
[
2502,
926
]
],
"content": "the"
}
],
"bbox": [
[
76,
768
],
[
2502,
959
]
]
},
{
"word_boxes": [
{
"bbox": [
[
76,
973
],
[
310,
1113
]
],
"content": "city:"
},
{
"bbox": [
[
384,
969
],
[
583,
1108
]
],
"content": "any"
},
{
"bbox": [
[
643,
958
],
[
1232,
1104
]
],
"content": "downtown"
},
{
"bbox": [
[
1294,
951
],
[
1626,
1093
]
],
"content": "street"
},
{
"bbox": [
[
1686,
946
],
[
1939,
1086
]
],
"content": "with"
},
{
"bbox": [
[
2006,
943
],
[
2136,
1081
]
],
"content": "an"
},
{
"bbox": [
[
2211,
938
],
[
2395,
1078
]
],
"content": "East"
},
{
"bbox": [
[
2437,
937
],
[
2503,
1074
]
],
"content": "or"
}
],
"bbox": [
[
76,
937
],
[
2503,
1113
]
]
},
{
"word_boxes": [
{
"bbox": [
[
86,
1133
],
[
366,
1269
]
],
"content": "West"
},
{
"bbox": [
[
418,
1123
],
[
1063,
1265
]
],
"content": "designation"
},
{
"bbox": [
[
1122,
1118
],
[
1442,
1255
]
],
"content": "refers"
},
{
"bbox": [
[
1493,
1116
],
[
1607,
1250
]
],
"content": "to"
},
{
"bbox": [
[
1661,
1113
],
[
1805,
1248
]
],
"content": "its"
},
{
"bbox": [
[
1862,
1106
],
[
2294,
1245
]
],
"content": "position"
},
{
"bbox": [
[
2339,
1103
],
[
2510,
1238
]
],
"content": "rela-"
}
],
"bbox": [
[
86,
1103
],
[
2510,
1269
]
]
},
{
"word_boxes": [
{
"bbox": [
[
85,
1294
],
[
294,
1429
]
],
"content": "tive"
},
{
"bbox": [
[
346,
1292
],
[
455,
1426
]
],
"content": "to"
},
{
"bbox": [
[
517,
1286
],
[
866,
1423
]
],
"content": "Yonge."
},
{
"bbox": [
[
928,
1280
],
[
1287,
1418
]
],
"content": "Unlike"
},
{
"bbox": [
[
1343,
1276
],
[
1606,
1412
]
],
"content": "New"
},
{
"bbox": [
[
1657,
1271
],
[
1954,
1407
]
],
"content": "York,"
},
{
"bbox": [
[
2004,
1266
],
[
2280,
1402
]
],
"content": "there"
},
{
"bbox": [
[
2321,
1264
],
[
2390,
1398
]
],
"content": "is"
},
{
"bbox": [
[
2429,
1263
],
[
2517,
1396
]
],
"content": "no"
}
],
"bbox": [
[
85,
1263
],
[
2517,
1429
]
]
},
{
"word_boxes": [
{
"bbox": [
[
92,
1447
],
[
699,
1578
]
],
"content": "distinction"
},
{
"bbox": [
[
793,
1441
],
[
1251,
1570
]
],
"content": "between"
},
{
"bbox": [
[
1338,
1437
],
[
1526,
1563
]
],
"content": "the"
},
{
"bbox": [
[
1614,
1430
],
[
2193,
1560
]
],
"content": "directions"
},
{
"bbox": [
[
2266,
1428
],
[
2353,
1552
]
],
"content": "of"
},
{
"bbox": [
[
2419,
1426
],
[
2522,
1551
]
],
"content": "av"
}
],
"bbox": [
[
92,
1426
],
[
2522,
1578
]
]
},
{
"word_boxes": [
{
"bbox": [
[
97,
1618
],
[
426,
1758
]
],
"content": "enues"
},
{
"bbox": [
[
474,
1615
],
[
675,
1753
]
],
"content": "and"
},
{
"bbox": [
[
721,
1609
],
[
1114,
1750
]
],
"content": "streets:"
},
{
"bbox": [
[
1162,
1602
],
[
1635,
1744
]
],
"content": "Spadina"
},
{
"bbox": [
[
1692,
1598
],
[
1902,
1737
]
],
"content": "Ave"
},
{
"bbox": [
[
1979,
1594
],
[
2201,
1733
]
],
"content": "runs"
},
{
"bbox": [
[
2242,
1590
],
[
2531,
1729
]
],
"content": "north-"
}
],
"bbox": [
[
97,
1590
],
[
2531,
1758
]
]
},
{
"word_boxes": [
{
"bbox": [
[
101,
1774
],
[
452,
1914
]
],
"content": "south,"
},
{
"bbox": [
[
541,
1771
],
[
719,
1909
]
],
"content": "but"
},
{
"bbox": [
[
810,
1764
],
[
1305,
1906
]
],
"content": "Danforth"
},
{
"bbox": [
[
1395,
1761
],
[
1612,
1900
]
],
"content": "Ave"
},
{
"bbox": [
[
1729,
1757
],
[
1976,
1896
]
],
"content": "runs"
},
{
"bbox": [
[
2057,
1750
],
[
2535,
1892
]
],
"content": "east-west"
}
],
"bbox": [
[
101,
1750
],
[
2535,
1914
]
]
},
{
"word_boxes": [
{
"bbox": [
[
114,
1927
],
[
507,
2071
]
],
"content": "There's"
},
{
"bbox": [
[
570,
1925
],
[
775,
2069
]
],
"content": "also"
},
{
"bbox": [
[
835,
1925
],
[
888,
2068
]
],
"content": "a"
},
{
"bbox": [
[
949,
1923
],
[
1263,
2067
]
],
"content": "street"
},
{
"bbox": [
[
1329,
1921
],
[
1662,
2066
]
],
"content": "called"
},
{
"bbox": [
[
1742,
1919
],
[
2171,
2064
]
],
"content": "Avenue"
},
{
"bbox": [
[
2235,
1918
],
[
2389,
2061
]
],
"content": "Rd."
},
{
"bbox": [
[
2435,
1917
],
[
2539,
2061
]
],
"content": "Go"
}
],
"bbox": [
[
114,
1917
],
[
2538,
2071
]
]
},
{
"word_boxes": [
{
"bbox": [
[
131,
2084
],
[
485,
2218
]
],
"content": "figure!"
}
],
"bbox": [
[
131,
2084
],
[
485,
2218
]
]
}
]
}
}
OCR Output of the image above using the DOX API
DocReader
DocReader is a data-driven approach and was trained only on images and key-value string pairs. No bounding boxes are needed. As this is the information historically available due to the human-based extractions over the past years, DocReader has a huge amount of training data available. DocReader takes an image and an extraction key (for example: “invoice number”) as input. The algorithm can also search for multiple extraction keys at the same time. The model is not trained from scratch but uses the weights of Chardgrid to initialize the model weights (the Chargrid algorithm is explained further down). The training dataset consists of 1.5 million scanned single-page invoices and the human-based extractions from these documents. DocReader can also infer the currency from either a symbol found on the document or the address. It also derives information from the document template and layout information as well as the interplay between fields.
DocReader consists of 3 modules: the encoder – the attention layer – the decoder.
Encoder:
The encoder of DocReader works exactly like the first part of the encoder of ChargridOCR. That is, a feed-forward neural net that consists of several convolutional blocks. Each block uses stride-2 convolutions to decrease the resolution by a factor of 8, resulting into the memory of the network (source). The input of the encoder is a black and white version of the document (image). To make use of the spatial information of the document, DocReader then adds coordinate positions encoded as one-hot vectors to the memory as shown in Figure 3.
Figure 3: Network Structure of DocReader’s Encoder and Spatial Aware Memory.
Decoder:
The decoder (Figure 4) of the DocReader consists of a recurrent neural network, to be exact an LSTM (Long short-term memory), coupled with an attention layer. The attention layer is an augmented and conditioned version of the sum-attention layer. The decoder receives the spatially augmented memory and the key (for example “invoice number”) as input. The key is 1-hot encoded. Multiple keys can be extracted at the same time. The attention layer uses the location information encoded in the spatial aware memory as well as the previous character and the previous attention state as input and outputs the new attention area. Using that information together with the previous state, the decoder then outputs a character.
Figure 4: Network Architecture of DocReader’s Decoder with Attention Layer.
Figure 5 shows how the attention area of the attention layer is mapped back onto the input document.
Figure 5: Attention weights projected back onto input document.
CharGrid
CharGrid uses the output of SAP’s internal OCR algorithm as input. Chargrid then creates a representation of the input document by treating each character as one channel (resulting in around 50-60 characters: letters, numbers and symbols) in comparison to standard CNNs where color channels are often used (RGB). This way Chargrid can easily reduce the size of the document as well as include font size into its predictions.
This representation of the document is used as input to the fully convolutional neural network to perform semantic segmentation on the Chargrid and predict a class label for each character-pixel on the document. The network is strictly feed forward with one encoder and two decoder stages. The encoder has 5 blocks, each with a convolution of 3×3. As there might be multiple instances of the same class, Chargrid has two decoders that perform class label prediction and segmentation, as well as the prediction of bounding boxes from object detection for line items (Figure 6).
Figure 6: Network Architecture of Chargrid.
Step 5:
Finally, DOX also performs Data Enrichment. Through certain information found in the document, DOX can enrich the result by mapping your metadata to the extracted fields.
Retraining – How does retraining work?
Technically DOX is not a retrainable service yet. But if DOX does not work on your kind of document and your document type always shows the same structure, you can use the DOX UI to create a template. To create a template, you will name the fields to be extracted and select them on your document with bounding boxes as shown in this video by my colleague Antonio Maradiaga on ID cards.
Use Cases – think out of the box!
By using templates, you can basically use DOX on any type of structured document. You can also use it for HR documents, for receipts, to compare documents or like my colleague in this blog post for ID cards and passports and so on. But also, to create tags or a text representation for documents to later for example search through your archived documents, DOX can be super useful! Think of all the analyses you can run on the extracted information! DOX can also be used to simply extract all the text from a document and then run your custom analyses on the extracted plain text. Let me know in the comments which other use cases you can think of!
Useful Links and Sources:
[1] ChargridOCR [2] Chargrid [3] DocReader
SAP Community – Document Information Extraction
SAP Help Page – Document Information Extraction
SAP Tutorials – Document Information Extraction