Python 文件操作

文件是存放在外部介质 (如硬盘、U盘) 上的一组完整信息的集合。这些信息可为各种文字、图形、图像、电影、音乐，甚至包括病毒程序等。

韩未零

851人浏览 · 2024-10-04 15:31:19

韩未零 · 2024-10-04 15:31:19 发布

什么是文件？

文件是存放在外部介质 (如硬盘、U盘) 上的一组完整信息的集合。这些信息可为各种文字、
图形、图像、电影、音乐，甚至包括病毒程序等。

两种重要的文件类型

文本文件（Text File）。文本文件是可直接阅读的，使用记事本打开即可看到文件的内容。
二进制文件（Binary File）。这类文件将数据按照它的进制编码的形式存储。如BMP。由于这类文件内容是二进制编码，使用记事本打开是显然是乱码，BMP可用图片查看器解码。

文本文件与二进制文件的优缺点

Python3中的字符串类型

bytes

转换：bytes->str ： decode('utf8')

str

转换： str->bytes： encode('utf8')

注意：encode编码时可指定任何合适的编码方式，但decode解码时，一定需要对应的编码方式。

文件的基本概念

文件的缓冲机制

读操作：不会直接对磁盘进行读取，而是先打开数据流，将磁盘上的文件信息拷贝到缓冲区内，然后程序再从缓冲区中读取所需数据。

写操作：不会马上写入磁盘中，而是先写入缓冲区，只有在缓冲区已满或“关闭文件”时，才会将数
据写入磁盘。

文件缓存区

计算机系统为要处理的文件在内存中单独开辟出来的一个存储区间，在读写该文件时，做为数据交换的临时“存储中转站”。

缓冲机制的优点

文件缓冲区能够有效地减少对外部设备的频繁访问，减少内存与外设间的数据交换，填补内、外设备的速度差异，提高数据读写的效率。

文件的基本操作

访问文件的操作过程

打开文件
读取文件（将信息读取到内存）
写入文件
关闭文件（保存文件并且释放内存空间）

打开文件（open）

help（open）

Help on built-in function open in module io:

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)
    Open file and return a stream.  Raise OSError upon failure.
    
    file is either a text or byte string giving the name (and the path
    if the file isn't in the current working directory) of the file to
    be opened or an integer file descriptor of the file to be
    wrapped. (If a file descriptor is given, it is closed when the
    returned I/O object is closed, unless closefd is set to False.)
    
    mode is an optional string that specifies the mode in which the file
    is opened. It defaults to 'r' which means open for reading in text
    mode.  Other common values are 'w' for writing (truncating the file if
    it already exists), 'x' for creating and writing to a new file, and
    'a' for appending (which on some Unix systems, means that all writes
    append to the end of the file regardless of the current seek position).
    In text mode, if encoding is not specified the encoding used is platform
    dependent: locale.getpreferredencoding(False) is called to get the
    current locale encoding. (For reading and writing raw bytes use binary
    mode and leave encoding unspecified.) The available modes are:
    
    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'x'       create a new file and open it for writing
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode
    't'       text mode (default)
    '+'       open a disk file for updating (reading and writing)
    'U'       universal newline mode (deprecated)
    ========= ===============================================================
    
    The default mode is 'rt' (open for reading text). For binary random
    access, the mode 'w+b' opens and truncates the file to 0 bytes, while
    'r+b' opens the file without truncation. The 'x' mode implies 'w' and
    raises an `FileExistsError` if the file already exists.
    
    Python distinguishes between files opened in binary and text modes,
    even when the underlying operating system doesn't. Files opened in
    binary mode (appending 'b' to the mode argument) return contents as
    bytes objects without any decoding. In text mode (the default, or when
    't' is appended to the mode argument), the contents of the file are
    returned as strings, the bytes having been first decoded using a
    platform-dependent encoding or using the specified encoding if given.
    
    'U' mode is deprecated and will raise an exception in future versions
    of Python.  It has no effect in Python 3.  Use newline to control
    universal newlines mode.
    
    buffering is an optional integer used to set the buffering policy.
    Pass 0 to switch buffering off (only allowed in binary mode), 1 to select
    line buffering (only usable in text mode), and an integer > 1 to indicate
    the size of a fixed-size chunk buffer.  When no buffering argument is
    given, the default buffering policy works as follows:
    
    * Binary files are buffered in fixed-size chunks; the size of the buffer
      is chosen using a heuristic trying to determine the underlying device's
      "block size" and falling back on `io.DEFAULT_BUFFER_SIZE`.
      On many systems, the buffer will typically be 4096 or 8192 bytes long.
    
    * "Interactive" text files (files for which isatty() returns True)
      use line buffering.  Other text files use the policy described above
      for binary files.
    
    encoding is the name of the encoding used to decode or encode the
    file. This should only be used in text mode. The default encoding is
    platform dependent, but any encoding supported by Python can be
    passed.  See the codecs module for the list of supported encodings.
    
    errors is an optional string that specifies how encoding errors are to
    be handled---this argument should not be used in binary mode. Pass
    'strict' to raise a ValueError exception if there is an encoding error
    (the default of None has the same effect), or pass 'ignore' to ignore
    errors. (Note that ignoring encoding errors can lead to data loss.)
    See the documentation for codecs.register or run 'help(codecs.Codec)'
    for a list of the permitted encoding error strings.
    
    newline controls how universal newlines works (it only applies to text
    mode). It can be None, '', '\n', '\r', and '\r\n'.  It works as
    follows:
    
    * On input, if newline is None, universal newlines mode is
      enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
      these are translated into '\n' before being returned to the
      caller. If it is '', universal newline mode is enabled, but line
      endings are returned to the caller untranslated. If it has any of
      the other legal values, input lines are only terminated by the given
      string, and the line ending is returned to the caller untranslated.
    
    * On output, if newline is None, any '\n' characters written are
      translated to the system default line separator, os.linesep. If
      newline is '' or '\n', no translation takes place. If newline is any
      of the other legal values, any '\n' characters written are translated
      to the given string.
    
    If closefd is False, the underlying file descriptor will be kept open
    when the file is closed. This does not work when a file name is given
    and must be True in that case.
    
    A custom opener can be used by passing a callable as *opener*. The
    underlying file descriptor for the file object is then obtained by
    calling *opener* with (*file*, *flags*). *opener* must return an open
    file descriptor (passing os.open as *opener* results in functionality
    similar to passing None).
    
    open() returns a file object whose type depends on the mode, and
    through which the standard file operations such as reading and writing
    are performed. When open() is used to open a file in a text mode ('w',
    'r', 'wt', 'rt', etc.), it returns a TextIOWrapper. When used to open
    a file in a binary mode, the returned class varies: in read binary
    mode, it returns a BufferedReader; in write binary and append binary
    modes, it returns a BufferedWriter, and in read/write mode, it returns
    a BufferedRandom.
    
    It is also possible to use a string or bytearray as a file for both
    reading and writing. For strings StringIO can be used like a file
    opened in a text mode, and for bytes a BytesIO can be used like a file
    opened in a binary mode.

file : 要打开的文件名( str )
mode: 打开文件的方式( str ) => text, bytes
encoding: 文件编码方式（str）
errors: 当发生编码错误时的处理方式（str），'ignore'或'strict'(默认)
buffering: 缓存方式 ( int）

打开一个文件 example.txt

文件编码

encoding:文件的编码方式(str)

encoding的默认值：None，不同的平台打开的方式不一样

encoding is the name of the encodin gused to decode or encode the file.This should only
be used in text mode.The default encoding is platform dependent。

为什么需要编码

对于计算机来说，所有信息都是由0和1组成的二进制。

人类无法仅用二进制就来完成计算机的各种操作

字符编码解决人与计算机之间的沟通问题。

常见编码

mode

（rwxa任选其一, tb任选其一，+是可选项）

关闭文件

使用 with 语句

f.close()

读取文件内容

使用数据库连接库（如 sqlite3、mysql-connector-python 等）与相应的数据库进行交互。

import sqlite3
 
# 连接到SQLite数据库（假设有一个名为 example.db 的数据库）
conn = sqlite3.connect('example.db')
 
# 创建一个游标对象
cursor = conn.cursor()
 
# 执行SQL查询语句
cursor.execute("SELECT * FROM users")
 
# 检索所有行
rows = cursor.fetchall()
 
# 打印每一行
for row in rows:
    print(row)
 
# 关闭连接
conn.close()

readlines 是 Python 中用于读取文件的方法之一，它用于逐行读取文件内容，并将每一行作为字符串存储在一个列表中。

with open('file.txt', 'r') as file:
    lines = file.readlines()
 
# lines 现在是一个包含每一行文本的列表
print(lines)
# 输出：
# ['Hello, this is line 1.\n', 'This is line 2.\n', 'And this is line 3.\n']
 
# 访问特定行
print(lines[0].strip())  # 输出：Hello, this is line 1.

readline 是 Python 中用于读取文件的方法之一，它用于逐行读取文件内容，并返回文件中的一行作为字符串。

with open('file.txt', 'r') as file:
    line1 = file.readline()
    line2 = file.readline()
    line3 = file.readline()
 
print(line1)  # 输出：Hello, this is line 1.
print(line2)  # 输出：This is line 2.
print(line3)  # 输出：And this is line 3.

readlines 和 readline 是 Python 中用于读取文件的两种不同方法，它们之间有一些重要的区别：

readlines 一次性读取整个文件的所有行，并返回一个包含所有行的列表。
readline 逐行读取文件，每次调用返回文件中的一行，适用于处理大型文件，减少内存占用。
readlines 返回包含换行符的每一行，而 readline 返回单独的行，需要手动去除换行符。

返回类型：readlines 方法返回一个包含文件所有行的列表，其中每个元素都是文件中的一行文本字符串。
使用情况：适用于处理包含多行文本的文件，可以一次性将整个文件加载到内存中。这种方法适用于文件较小，可以完全装入内存的情况。

返回类型： readline 方法每次调用只返回文件中的一行作为字符串。如果再次调用，将返回下一行。当文件读取完毕后，返回空字符串 ‘’。
使用情况：适用于逐行处理大型文件，可以有效地降低内存使用。因为它一次只读取一行，可以在循环中逐行处理文件，而不必将整个文件加载到内存中。

  with open('file.txt', 'r') as file:
      line = file.readline()
      while line != '':
          print(line.strip())  # 去除换行符
          line = file.readline()

选择使用哪个方法取决于文件的大小和处理需求。如果文件较小，可以完全装入内存，使用 readlines；如果文件较大，可以逐行处理，使用 readline。

写入文件

f.write('sth')

写入数据库

使用数据库连接库（如 sqlite3、mysql-connector-python 等）与相应的数据库进行交互。

import sqlite3

# 连接到SQLite数据库（假设有一个名为 example.db 的数据库）
conn = sqlite3.connect('example.db')

# 创建一个游标对象
cursor = conn.cursor()

# 执行SQL插入语句
cursor.execute("INSERT INTO users (name, age, occupation) VALUES (?, ?, ?)", ('jack', 20, 'Math'))

# 提交更改
conn.commit()

# 关闭连接
conn.close()

为什么不实时写入磁盘

硬盘是慢设备，频率读写会增大磁盘压力，产生瓶颈

什么时候会写入磁盘

f.flush()
f.close()
buffer设置（默认：io.DEFAULT_BUFFER_SIZE )

0 => 实时写入（binary mode）
1 => 行缓存（ text mode） => \n
其他数字n => 缓冲区大小n : 2*4096

文件对象其他方法

Python采用的编码方式

判断文件编码-chardet模块

安装：pip install chardet

检测：

天启AI社区

GitCode 天启AI是一款由 GitCode 团队打造的智能助手，基于先进的LLM（大语言模型）与多智能体 Agent 技术构建，致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话，还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力，真正做到“一句话，让 Al帮你完成复杂任务”。

更多推荐