把parquet类型的超大文件拆分成小文件(python)

【代码】把parquet类型的超大文件拆分成小文件(python)

go_flush

2284人浏览 · 2023-05-29 20:45:03

go_flush · 2023-05-29 20:45:03 发布

遇到parquet类型的超大文件，在下载数据时，内存溢出的解决办法，以文件块的方式读取，但是parquet的文件是有格式的，导致无法使用。那就使用python的生成器的方式。脚本支持转成 xls, parquet, csv, json, dict 这几种类型
代码如下

# -*- coding: utf-8 -*-
# @Author : Gary

import pandas as pd
import pyarrow.parquet as pq
import os



def read_large_parquet(parquet_path: str, size: int = 65536):
    parquet_file = pq.ParquetFile(parquet_path)
    for batch in parquet_file.iter_batches(batch_size=size):
        batch_df = batch.to_pandas()
        yield batch_df


def split_small_data(df: pd.DataFrame, data_type: str, file_id: int = 0, result: str = "out"):
    """
    :param df:
    :param data_type: xls, parquet, csv, json, dict
    :param file_id:
    :param result:
    :return:
    """
    os.makedirs(result, exist_ok=True)
    file_name = os.path.join(result, "%04d.%s" % (file_id, data_type))
    df.to_dict()
    method_type = f"to_{data_type.lower()}"
    if not hasattr(df, method_type):
        print(f"Please enter the correct file format, error file_data --> {data_type}")
    else:
        make_func = getattr(df, method_type)
        make_func(file_name)
        print(f">>>>>{file_name} Generated<<<<<")


def run():
    parquet_path = "data/filtered_large.parquet"
    num = 0
    result = "res"
    n = 1 # 分割块的大小
    for df in read_large_parquet(parquet_path, 65536 * n):
        split_small_data(df, "parquet", num, result)
        num += 1


if __name__ == '__main__':
    run()

天启AI社区

GitCode 天启AI是一款由 GitCode 团队打造的智能助手，基于先进的LLM（大语言模型）与多智能体 Agent 技术构建，致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话，还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力，真正做到“一句话，让 Al帮你完成复杂任务”。

更多推荐