You can use the `imageAction` [configuration](https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios#configure-indexers-for-image-processing) for extracting page number and content from each page.
By using this you will get a field name called `pageNumber` and you will get content from each page as separate document in secondary index.
Below is the sample definitions.
**Primary index** fields.
![enter image description here](https://i.imgur.com/2BFL9WF.png)
**Skillset definition** - using OCR skillset the text is extracted from each page and done projection on secondary index.
```json
{
"name": "skillset1",
"description": "",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"name": "#1",
"description": "",
"context": "/document/normalized_images/*",
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*",
"inputs": []
}
],
"outputs": [
{
"name": "text",
"targetName": "text"
}
],
"defaultLanguageCode": "en",
"detectOrientation": true,
"lineEnding": "Space"
}
],
"@odata.etag": "\"0x8DCFB1DC59EC9ED\"",
"indexProjections": {
"selectors": [
{
"targetIndexName": "desidx",
"parentKeyFieldName": "parent_id",
"sourceContext": "/document/normalized_images/*",
"mappings": [
{
"name": "text",
"source": "/document/normalized_images/*/text"
},
{
"name": "pageNumber",
"source": "/document/normalized_images/*/pageNumber"
},
{
"name": "metadata_storage_path",
"source": "/document/metadata_storage_name"
}
]
}
],
"parameters": {
"projectionMode": "skipIndexingParentDocuments"
}
}
}
```
Whenever you given image action as `generateNormalizedImagePerPage` each page data will be in `/document/normalized_images/*` context.
**Indexer**
```json
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DCFB1DD13443EF\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "ds",
"skillsetName": "skillset1",
"targetIndexName": "srcidx",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "default",
"imageAction": "generateNormalizedImagePerPage"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
}
],
"outputFieldMappings": [],
"cache": null,
"encryptionKey": null
}
```
**Secondary index** fields
![enter image description here](https://i.imgur.com/UjEbype.png)
Output:
![enter image description here](https://i.imgur.com/1FWwsyq.png)
Here, is the more properties of [image action](https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-image-scenarios#about-normalized-images).
You no need to do document extraction, by default the indexer does those things and gives the content and metadata in `/document` context.
For getting language code you can give the field name as
`/document/metadata_language` .
You pass this as inputs for conditional skill and do OCR for further improvement.
Alter you conditional skill like below.
```json
{
"@odata.type": "#Microsoft.Skills.Util.ConditionalSkill",
"name": "Language Check",
"description": "Check if language code is 'Unknown'",
"context": "/document",
"inputs": [
{
"name": "condition",
"source": "= $(/document//document/metadata_language) == '(Unknown)'"
},
{
"name": "whenTrue",
"source": "/document/normalized_images/*" //here you can also use /document/content
},
{
"name": "whenFalse",
"source": "= null"
}
],
"outputs": [
{
"name": "output",
"targetName": "imagesForOcr"
}
]
},
```
**Note**: Make sure you give indexer configuration to exract content and metadata.
Below is the indexer definition.
```json
{
"@odata.context": "https://jgsaisearch.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DC91AC57C625BF\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "ds",
"skillsetName": "skillset1718943474465",
"targetIndexName": "azureblob-index",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "default"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "metadata_storage_path",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/myLanguageCode",
"targetFieldName": "lang_code"
}
],
"cache": null,
"encryptionKey": null
}
```
Output:
![enter image description here](https://i.imgur.com/8zE9xvV.png)