在spark 3.2.0版本以下,如果python的udf函数,在运行时候崩溃了,引发了 segmentation fault 异常时候,spark executor的错误日志,很模糊的只显示了一行日志:

python worker exited unexpectedly (crashed)

因为进程coredump时候,常规语言层面的try catch异常是无法捕捉的,这对排查问题,非常不友好, 这个问题在spark 3.2版本已经得到修复,具体issue参考:[SPARK-36062] Try to capture faulthanlder when a Python worker crashes. - ASF JIRA

在低于3.2.0版本的spark里面,可以把这个特性移值过来,我在spark 3.0.1版本里面尝试去Cherry-Pick合并,发现有很多冲突,最后为了稳妥,还是选择了手动合并,这样以来,如果python进程再崩溃,我们看到上面的executor的错误日志,就会变成如下的非常详细的日志:

23/02/28 18:44:06 ERROR Executor: Exception in task 2.0 in stage 4.0 (TID 8)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault

Current thread 0x00007f247aafe740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 922 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 684 in _load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 343 in load_dynamic
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/imp.py", line 243 in load_module
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24 in swig_import_helper
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/tensorflow/__init__.py", line 24 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 5 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/load_backend.py", line 90 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/backend/__init__.py", line 1 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/conv_utils.py", line 9 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/utils/__init__.py", line 6 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/keras/__init__.py", line 3 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 941 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/data1/emr/yarn/local/usercache/test/appcache/application_1653035898918_2410947/container_e32_1653035898918_2410947_01_000003/f0becb3851ea1c4ce2647d4b63c4e2a7/lib/python3.6/site-packages/cloudpickle/cloudpickle.py", line 908 in subimport

这样排查问题就很方便了,可以清晰的看到是那个依赖库导致的

之所以能够捕捉segmentation fault进程崩溃异常,是利用了python 3.3版本之后的新功能 faulthandler 库,当故障、超时或收到用户信号时,利用本模块内的函数可转储 Python 跟踪信息。

Python官网一个小例子:

python3 -c "import ctypes; ctypes.string_at(0)"
Segmentation fault

python3 -q -X faulthandler
	>>> import ctypes
	>>> ctypes.string_at(0)
	Fatal Python error: Segmentation fault

	Current thread 0x00007fb899f39700 (most recent call first):
	File "/home/python/cpython/Lib/ctypes/__init__.py", line 486 in string_at
	File "<stdin>", line 1 in <module>
	Segmentation fault

感兴趣参考:faulthandler —— 转储 Python 的跟踪信息 — Python 3.11.2 文档

Logo

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。

更多推荐