Pyspark特征工程--PCA
PCA:主成分分析class pyspark.ml.feature.PCA(k=None, inputCol=None, outputCol=None)主成分分析是设法将原来众多具有一定相关性(比如P个指标),重新组合成一组新的互相无关的综合指标来代替原来的指标。PCA 训练模型以将向量投影到前 k 个主成分的低维空间model.explainedVariance:返回由每个主成分解释的方差比例向
·
PCA:主成分分析
class pyspark.ml.feature.PCA(k=None, inputCol=None, outputCol=None)
主成分分析是设法将原来众多具有一定相关性(比如P个指标),重新组合成一组新的互相无关的综合指标来代替原来的指标。
PCA 训练模型以将向量投影到前 k 个主成分的低维空间
model.explainedVariance:返回由每个主成分解释的方差比例向量
01.创建数据
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")\
.config("spark.ui.showConsoleProgress","false")\
.appName("PCA").master("local[*]").getOrCreate()
#%%
from pyspark.ml.linalg import Vectors
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
(Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
(Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
df.show()
输出结果:
+--------------------+
| features|
+--------------------+
| (5,[1,3],[1.0,7.0])|
|[2.0,0.0,3.0,4.0,...|
|[4.0,0.0,0.0,6.0,...|
+--------------------
02.详细查看
[Row(features=SparseVector(5, {1: 1.0, 3: 7.0})),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]))]
03.查看结构,这里稀疏向量和密集向共存
df.printSchema()
输出结果:
root
|-- features: vector (nullable = true)
04.使用PCA主成分分析
from pyspark.ml.feature import PCA
pca = PCA(k=2,inputCol="features",outputCol="res")
model = pca.fit(df)
model.transform(df).show()
输出结果:
+--------------------+--------------------+
| features| res|
+--------------------+--------------------+
| (5,[1,3],[1.0,7.0])|[1.64857282308838...|
|[2.0,0.0,3.0,4.0,...|[-4.6451043317815...|
|[4.0,0.0,0.0,6.0,...|[-6.4288805356764...|
+--------------------+--------------------+
05.详细查看
model.transform(df).head(3)
输出结果:
[Row(features=SparseVector(5, {1: 1.0, 3: 7.0}), res=DenseVector([1.6486, -4.0133])),
Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), res=DenseVector([-4.6451, -1.1168])),
Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), res=DenseVector([-6.4289, -5.338]))]
06.解释向量
model.explainedVariance
输出结果:
DenseVector([0.7944, 0.2056])

GitCode 天启AI是一款由 GitCode 团队打造的智能助手,基于先进的LLM(大语言模型)与多智能体 Agent 技术构建,致力于为用户提供高效、智能、多模态的创作与开发支持。它不仅支持自然语言对话,还具备处理文件、生成 PPT、撰写分析报告、开发 Web 应用等多项能力,真正做到“一句话,让 Al帮你完成复杂任务”。
更多推荐
所有评论(0)