Is there a way to do streaming decompression of single-file zip archives?

I currently have arbitrarily large zipped archives (single file per archive) in s3. I would like to be able to process the files by iterating over them without having to actually download the files to disk or into memory.

A simple example:

import boto

def count_newlines(bucket_name, key_name):

conn = boto.connect_s3()

b = conn.get_bucket(bucket_name)

# key is a .zip file

key = b.get_key(key_name)

count = 0

for chunk in key:

# How should decompress happen?

count += decompress(chunk).count('\n')

return count

This answer demonstrates a method of doing the same thing with gzip'd files. Unfortunately, I haven't been able to get the same technique to work using the zipfile module, as it seems to require random access to the entire file being unzipped.

解决方案

You can use https://pypi.python.org/pypi/tubing, it even has built in s3 source support using boto3.

from tubing.ext import s3

from tubing import pipes, sinks

output = s3.S3Source(bucket, key) \

| pipes.Gunzip() \

| pipes.Split(on=b'\n') \

| sinks.Objects()

print len(output)

If you didn't want to store the entire output in the returned sink, you could make your own sink that just counts. The impl would look like:

class CountWriter(object):

def __init__(self):

self.count = 0

def write(self, chunk):

self.count += len(chunk)

Counter = sinks.MakeSink(CountWriter)

Logo

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。

更多推荐