**There is a serverless solution using AWS Glue!** (I nearly died figuring this out)
No need for EC2 or dodging memory limits of Lambda.
**This solution is two parts:**
1. A lambda function that is triggered by S3 upon upload of a ZIP file and creates a GlueJobRun - passing the S3 Object key as an argument to Glue.
2. A Glue Job that unzips files (in memory!) and uploads back to S3.
See my code below which unzips the ZIP file and places the contents back into the same bucket (configurable).
Please upvote if helpful :)
**Lambda Script (python3) that calls a Glue Job called YourGlueJob**
import boto3
import urllib.parse
glue = boto3.client('glue')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
print(key)
try:
newJobRun = glue.start_job_run(
JobName = 'YourGlueJob',
Arguments = {
'--bucket':bucket,
'--key':key,
}
)
print("Successfully created unzip job")
return key
except Exception as e:
print(e)
print('Error starting unzip job for' + key)
raise e
**AWS Glue Job Script to unzip the files**
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
import boto3
import zipfile
import io
from contextlib import closing
s3 = boto3.client('s3')
s3r = boto3.resource('s3')
bucket = args["bucket"]
key = args["key"]
obj = s3r.Object(
bucket_name=bucket,
key=key
)
buffer = io.BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
list = z.namelist()
for filerr in list:
print(filerr)
y=z.open(filerr)
arcname = key + filerr
x = io.BytesIO(y.read())
s3.upload_fileobj(x, bucket, arcname)
y.close()
print(list)
job.commit()
A bit late but for those who come after me...
**There is a serverless solution using AWS Glue!** (I nearly died figuring this out)
**This solution is two parts:**
1. A lambda function that is triggered by S3 upon upload of a ZIP file and creates a GlueJobRun - passing the S3 Object key as an argument to Glue.
2. A Glue Job that unzips files (in memory!) and uploads back to S3.
See my code below which unzips the ZIP file and places the contents back into the same bucket (configurable).
Please upvote if helpful :)
**Lambda Script (python3) that calls a Glue Job called YourGlueJob**
import boto3
import urllib.parse
glue = boto3.client('glue')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
print(key)
try:
newJobRun = glue.start_job_run(
JobName = 'YourGlueJob',
Arguments = {
'--bucket':bucket,
'--key':key,
}
)
print("Successfully created unzip job")
return key
except Exception as e:
print(e)
print('Error starting unzip job for' + key)
raise e
AWS Glue Job Script to unzip the files
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],)
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
import boto3
import zipfile
import io
from contextlib import closing
s3 = boto3.client('s3')
s3r = boto3.resource('s3')
bucket = args["bucket"]
key = args["key"]
obj = s3r.Object(
bucket_name=bucket,
key=key
)
buffer = io.BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
list = z.namelist()
for filerr in list:
print(filerr)
y=z.open(filerr)
arcname = key + filerr
x = io.BytesIO(y.read())
s3.upload_fileobj(x, bucket, arcname)
y.close()
print(list)
job.commit()