You have to specify the `source_dir`. Within your script then you can import the modules as you normally do.
> **source_dir** (str or PipelineVariable) – Path (absolute, relative or an
> S3 URI) to a directory with any other training source code
> dependencies aside from the entry point file (default: None). If
> source_dir is an S3 URI, it must point to a tar.gz file. Structure
> within this directory are preserved when training on Amazon SageMaker.
Look at the [documentation in general for Processing][1] (you have to use [FrameworkProcessor][2] and not the specific ones like SKLearnProcessor).
P.S.: The answer is similar to that of the question "[How to install additional packages in sagemaker pipeline][3]".
Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and also eventually the requirements.txt file.
The structure of the folder then will be:
BASE_DIR/
|─ helper_functions/
| |─ your_utils.py
|─ requirements.txt
|─ preprocess.py
Within your preprocess.py, you will call the scripts in a simple way with:
```python
from helper_functions.your_utils import your_class, your_func
```
------
So, your code becomes:
```python
from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
BASE_DIR = your_script_dir_path
sklearn_processor = FrameworkProcessor(
estimator_cls=SKLearn,
framework_version=framework_version,
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=base_job_name,
sagemaker_session=pipeline_session,
role=role
)
step_args = sklearn_processor.run(
inputs=[your_inputs],
outputs=[your_outputs],
code="preprocess.py",
source_dir=BASE_DIR,
arguments=[your_arguments],
)
step_process = ProcessingStep(
name="ProcessingName",
step_args=step_args
)
```
It's a good practice to keep the folders for the various steps separate for each and don't create overlaps.
[1]: https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor.run
[2]: https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor
[3]: https://stackoverflow.com/questions/74550808/how-to-install-additional-packages-in-sagemaker-pipeline/74551264#74551264
For each job within the pipeline you should have separate requirements (so you install only the stuff you need in each step and have full control over it).
To do this, you need to use the `source_dir` parameter:
> **source_dir** (str or PipelineVariable) – Path (absolute, relative or an
> S3 URI) to a directory with any other training source code
> dependencies aside from the entry point file (default: None). If
> source_dir is an S3 URI, it must point to a tar.gz file. Structure
> within this directory are preserved when training on Amazon SageMaker.
Look at the [documentation in general for Processing][1] (you have to use [FrameworkProcessor][2]).
Within the specified folder, there must be the script (in your case preprocess.py), any other files/modules that may be needed, and the `requirements.txt` file.
The structure of the folder then will be:
BASE_DIR/
|- requirements.txt
|- preprocess.py
It is the common requirements file, nothing different. And it will be used automatically at the start of the instance, without any instruction needed.
------
So, your code becomes:
```python
from sagemaker.processing import FrameworkProcessor
from sagemaker.sklearn import SKLearn
from sagemaker.workflow.steps import ProcessingStep
from sagemaker.processing import ProcessingInput, ProcessingOutput
sklearn_processor = FrameworkProcessor(
estimator_cls=SKLearn,
framework_version='0.23-1',
instance_type=processing_instance_type,
instance_count=processing_instance_count,
base_job_name=f"{base_job_prefix}/job-name",
sagemaker_session=pipeline_session,
role=role
)
step_args = sklearn_processor.run(
outputs=[
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
],
code="preprocess.py",
source_dir=BASE_DIR,
arguments=["--input-data", input_data],
)
step_process = ProcessingStep(
name="PreprocessSidData",
step_args=step_args
)
```
Note that I changed both the `code` parameter and the `source_dir`. It's a good practice to keep the folders for the various steps separate so you have a requirements.txt for each and don't create overlaps.
[1]: https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor.run
[2]: https://sagemaker.readthedocs.io/en/stable/api/training/processing.html#sagemaker.processing.FrameworkProcessor