spark数据预处理3

数据预处理3

3.1 描述性统计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pyspark.sql.types as typ

fields = [
('custID', typ.IntegerType()),
('gender', typ.IntegerType()),
('state', typ.IntegerType()),
('cardholder', typ.IntegerType()),
('balance', typ.IntegerType()),
('numTrans', typ.IntegerType()),
('numIntTrans', typ.IntegerType()),
('creditLine', typ.IntegerType()),
('fraudRisk', typ.IntegerType()),
]

schema = typ.StructType([
typ.StructField(e[0], e[1], True) for e in fields
])

fraud_df = spark.read.csv("/pydata/ccFraud.gz", header='true', schema = schema, sep=',')
1
fraud_df.printSchema()
root
 |-- custID: integer (nullable = true)
 |-- gender: integer (nullable = true)
 |-- state: integer (nullable = true)
 |-- cardholder: integer (nullable = true)
 |-- balance: integer (nullable = true)
 |-- numTrans: integer (nullable = true)
 |-- numIntTrans: integer (nullable = true)
 |-- creditLine: integer (nullable = true)
 |-- fraudRisk: integer (nullable = true)
1
fraud_df.groupBy('gender').count().show()
+------+-------+
|gender|  count|
+------+-------+
|     1|6178231|
|     2|3821769|
+------+-------+

⚠️ 这里男女比例失衡,在实际生产场景中应该重视这个问题

.descrive()

这里可以选定某些列 Dataframe的.describe()方法属于,RDD不可用

1
2
3
numerical = ['balance', 'numTrans', 'numIntTrans']
desc = fraud_df.describe(numerical)
desc.show()
+-------+-----------------+------------------+-----------------+
|summary|          balance|          numTrans|      numIntTrans|
+-------+-----------------+------------------+-----------------+
|  count|         10000000|          10000000|         10000000|
|   mean|     4109.9199193|        28.9351871|        4.0471899|
| stddev|3996.847309737077|26.553781024522852|8.602970115863767|
|    min|                0|                 0|                0|
|    max|            41485|               100|               60|
+-------+-----------------+------------------+-----------------+
  • 所有特征成正态分布;最大值是最小值的多倍
  • 变异系数(均值与标准差之比)非常高(接近或大于1),意味着这是一个广泛的观测数据。

如何检查偏度?(这里对“平衡”特征进行检查):

1
fraud_df.agg({'balance':'skewness'}).show()
+------------------+
| skewness(balance)|
+------------------+
|1.1818315552995033|
+------------------+
1
fraud_df.agg({'balance':'stddev'}).show()
+-----------------+
|  stddev(balance)|
+-----------------+
|3996.847309737077|
+-----------------+

列表中的聚合函数包括:

  • avg()
  • count()
  • countDistinct()
  • first()
  • kurtosis()
  • max()
  • mean()
  • skewness()
  • stddev()
  • stddev_pop()
  • stddev_samp()
  • sum()
  • sunDistinct()
  • var_pop()
  • var_samp()
  • variance()

3.2 相关性

模型应该只包括哪些与目标高度相关的特征,因此检查特征的相关性是非常重要的。其次之间的高度相关特征(即共线(collinear)性),可能会导致模型的不可预知性为或者可能进行不必要的复杂化。

这里的.corr(…)方法支持Pearson(皮尔森)相关系数,并且只能计算两两相关性:

1
fraud_df.corr('balance', 'numTrans')
0.00044523140172659576

为了创建相关矩阵,可以使用以下脚本:

1
2
3
4
5
6
7
8
9
10
11
numerical = ['balance', 'numTrans', 'numIntTrans']

corr=[]

for i in range(0, len(numerical)):
temp = [None] * i
for j in range(i, len(numerical)):
temp.append(fraud_df.corr(numerical[i], numerical[j]))
corr.append(temp)

# 这里令temp = [None] * i是为了填充左下角矩阵
1
2
3
4
5
import pandas as pd

corrpd = pd.DataFrame(corr)
corrpd.columns = corrpd.index = numerical
corrpd































balance numTrans numIntTrans
balance 1.0 0.000445 0.000271
numTrans NaN 1.000000 -0.000281
numIntTrans NaN NaN 1.000000

3.3 可视化

3.3.1 直方图

当数据量非常大时,需要先对数据进行聚合:

1
hists = fraud_df.select('balance').rdd.flatMap(lambda x: x).histogram(20)
1
2
3
4
5
6
7
8
9
from matplotlib import pyplot as plt
%matplotlib inline

data = {
'bins':hists[0][:-1],
'freq':hists[1]
}
plt.bar(data['bins'], data['freq'], width=2000)
plt.title('Histogram of \'balance\'')

3.3.2 散点图

因为Pyspark不支持在服务器端的任何可视化模块,并且视图同时绘制数十亿的观测数据是非常不切实际的。在这个例子中,把欺诈数据集作为一个阶层抽取0.02%大约2000个观测数据。

使用df.sampleBy(…)方法随机抽取样本集

1
2
3
numerical = ['balance', 'numTrans', 'numIntTrans']

data_sample = fraud_df.sampleBy('gender', {1: 0.001, 2: 0.001}).select(numerical)
1
data_sample.describe().show()
+-------+------------------+------------------+-----------------+
|summary|           balance|          numTrans|      numIntTrans|
+-------+------------------+------------------+-----------------+
|  count|              2032|              2032|             2032|
|   mean| 3962.410433070866|28.546751968503937|3.989665354330709|
| stddev|3876.8504670568145|25.611704520599815|8.235480127222067|
|    min|                 0|                 0|                0|
|    max|             28000|               100|               60|
+-------+------------------+------------------+-----------------+
1
data = data_sample.rdd.map(lambda x: [x[0],x[1],x[2]]).collect()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from pyecharts import Scatter3D
import random
# data = [d1,d3,d2]
range_color = [
'#313695', '#4575b4', '#74add1', '#abd9e9', '#e0f3f8', '#ffffbf',
'#fee090', '#fdae61', '#f46d43', '#d73027', '#a50026']
scatter3D = Scatter3D("3D 散点图示例", width=1200, height=800)
scatter3D.add("3D",
data, is_visualmap=True,
visual_range_color=range_color,
grid3d_opacity=0.5,
is_grid3d_rotate = True,
xaxis3d_name='balance',
yaxis3d_name='numTrans',
zaxis3d_name='numIntTrans'
)
scatter3D.render()