CopyPastor

**There is a serverless solution using AWS Glue!** (I nearly died figuring this out)
No need for EC2 or dodging memory limits of Lambda.
**This solution is two parts:**
1. A lambda function that is triggered by S3 upon upload of a ZIP file and creates a GlueJobRun - passing the S3 Object key as an argument to Glue. 2. A Glue Job that unzips files (in memory!) and uploads back to S3.
See my code below which unzips the ZIP file and places the contents back into the same bucket (configurable).
Please upvote if helpful :)
**Lambda Script (python3) that calls a Glue Job called YourGlueJob**
import boto3 import urllib.parse glue = boto3.client('glue') def lambda_handler(event, context): bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8') print(key) try: newJobRun = glue.start_job_run( JobName = 'YourGlueJob', Arguments = { '--bucket':bucket, '--key':key, } ) print("Successfully created unzip job") return key except Exception as e: print(e) print('Error starting unzip job for' + key) raise e
**AWS Glue Job Script to unzip the files**
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) import boto3 import zipfile import io from contextlib import closing s3 = boto3.client('s3') s3r = boto3.resource('s3') bucket = args["bucket"] key = args["key"] obj = s3r.Object( bucket_name=bucket, key=key ) buffer = io.BytesIO(obj.get()["Body"].read()) z = zipfile.ZipFile(buffer) list = z.namelist() for filerr in list: print(filerr) y=z.open(filerr) arcname = key + filerr x = io.BytesIO(y.read()) s3.upload_fileobj(x, bucket, arcname) y.close() print(list) job.commit()

A bit late but for those who come after me...
**There is a serverless solution using AWS Glue!** (I nearly died figuring this out)
**This solution is two parts:**
1. A lambda function that is triggered by S3 upon upload of a ZIP file and creates a GlueJobRun - passing the S3 Object key as an argument to Glue. 2. A Glue Job that unzips files (in memory!) and uploads back to S3.
See my code below which unzips the ZIP file and places the contents back into the same bucket (configurable).
Please upvote if helpful :)
**Lambda Script (python3) that calls a Glue Job called YourGlueJob**
import boto3 import urllib.parse glue = boto3.client('glue') def lambda_handler(event, context): bucket = event['Records'][0]['s3']['bucket']['name'] key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8') print(key) try: newJobRun = glue.start_job_run( JobName = 'YourGlueJob', Arguments = { '--bucket':bucket, '--key':key, } ) print("Successfully created unzip job") return key except Exception as e: print(e) print('Error starting unzip job for' + key) raise e
AWS Glue Job Script to unzip the files
import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) import boto3 import zipfile import io from contextlib import closing s3 = boto3.client('s3') s3r = boto3.resource('s3') bucket = args["bucket"] key = args["key"] obj = s3r.Object( bucket_name=bucket, key=key ) buffer = io.BytesIO(obj.get()["Body"].read()) z = zipfile.ZipFile(buffer) list = z.namelist() for filerr in list: print(filerr) y=z.open(filerr) arcname = key + filerr x = io.BytesIO(y.read()) s3.upload_fileobj(x, bucket, arcname) y.close() print(list) job.commit()

CopyPastor

Possible Plagiarism

Original Post