CopyPastor

You can use the `imageAction` [configuration](https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios#configure-indexers-for-image-processing) for extracting page number and content from each page.
By using this you will get a field name called `pageNumber` and you will get content from each page as separate document in secondary index.
Below is the sample definitions.
**Primary index** fields.
![enter image description here](https://i.imgur.com/2BFL9WF.png)
**Skillset definition** - using OCR skillset the text is extracted from each page and done projection on secondary index.
```json { "name": "skillset1", "description": "", "skills": [ { "@odata.type": "#Microsoft.Skills.Vision.OcrSkill", "name": "#1", "description": "", "context": "/document/normalized_images/*", "inputs": [ { "name": "image", "source": "/document/normalized_images/*", "inputs": [] } ], "outputs": [ { "name": "text", "targetName": "text" } ], "defaultLanguageCode": "en", "detectOrientation": true, "lineEnding": "Space" } ], "@odata.etag": "\"0x8DCFB1DC59EC9ED\"", "indexProjections": { "selectors": [ { "targetIndexName": "desidx", "parentKeyFieldName": "parent_id", "sourceContext": "/document/normalized_images/*", "mappings": [ { "name": "text", "source": "/document/normalized_images/*/text" }, { "name": "pageNumber", "source": "/document/normalized_images/*/pageNumber" }, { "name": "metadata_storage_path", "source": "/document/metadata_storage_name" } ] } ], "parameters": { "projectionMode": "skipIndexingParentDocuments" } } } ``` Whenever you given image action as `generateNormalizedImagePerPage` each page data will be in `/document/normalized_images/*` context.
**Indexer**
```json { "@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity", "@odata.etag": "\"0x8DCFB1DD13443EF\"", "name": "azureblob-indexer", "description": "", "dataSourceName": "ds", "skillsetName": "skillset1", "targetIndexName": "srcidx", "disabled": null, "schedule": null, "parameters": { "batchSize": null, "maxFailedItems": 0, "maxFailedItemsPerBatch": 0, "base64EncodeKeys": null, "configuration": { "dataToExtract": "contentAndMetadata", "parsingMode": "default", "imageAction": "generateNormalizedImagePerPage" } }, "fieldMappings": [ { "sourceFieldName": "metadata_storage_path", "targetFieldName": "metadata_storage_path", "mappingFunction": { "name": "base64Encode", "parameters": null } } ], "outputFieldMappings": [], "cache": null, "encryptionKey": null } ```
**Secondary index** fields
![enter image description here](https://i.imgur.com/UjEbype.png)

Output:
![enter image description here](https://i.imgur.com/1FWwsyq.png)
Here, is the more properties of [image action](https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios#about-normalized-images).

You no need to do document extraction, by default the indexer does those things and gives the content and metadata in `/document` context.
For getting language code you can give the field name as `/document/metadata_language` .
You pass this as inputs for conditional skill and do OCR for further improvement.
Alter you conditional skill like below.
```json { "@odata.type": "#Microsoft.Skills.Util.ConditionalSkill", "name": "Language Check", "description": "Check if language code is 'Unknown'", "context": "/document", "inputs": [ { "name": "condition", "source": "= $(/document//document/metadata_language) == '(Unknown)'" }, { "name": "whenTrue", "source": "/document/normalized_images/*" //here you can also use /document/content }, { "name": "whenFalse", "source": "= null" } ], "outputs": [ { "name": "output", "targetName": "imagesForOcr" } ] }, ```
**Note**: Make sure you give indexer configuration to exract content and metadata. Below is the indexer definition.
```json { "@odata.context": "https://jgsaisearch.search.windows.net/$metadata#indexers/$entity", "@odata.etag": "\"0x8DC91AC57C625BF\"", "name": "azureblob-indexer", "description": "", "dataSourceName": "ds", "skillsetName": "skillset1718943474465", "targetIndexName": "azureblob-index", "disabled": null, "schedule": null, "parameters": { "batchSize": null, "maxFailedItems": 0, "maxFailedItemsPerBatch": 0, "base64EncodeKeys": null, "configuration": { "dataToExtract": "contentAndMetadata", "parsingMode": "default" } }, "fieldMappings": [ { "sourceFieldName": "metadata_storage_path", "targetFieldName": "metadata_storage_path", "mappingFunction": { "name": "base64Encode", "parameters": null } } ], "outputFieldMappings": [ { "sourceFieldName": "/document/myLanguageCode", "targetFieldName": "lang_code" } ], "cache": null, "encryptionKey": null } ```

Output:
![enter image description here](https://i.imgur.com/8zE9xvV.png)

CopyPastor

Possible Plagiarism

Original Post