Spark算子
Transformation
mapValues
K-V 对map的V进行操作
eg :对每个V加1
1 | val rdd = sc.parallelize(List(("x",1),("ys",3),("ab",5),("ab",6))) |
1 | rdd.mapValues(_+1).foreach(println) |
flatMapValues
K-V 对V进行操作, 把一个pair里的values变成一个数组,然后k-v1 k-v2
1 | val rdd = sc.parallelize(Array(("spark","1,2,3"),("hadoop","4,5,6"),("flink","7,8"))) |
keys
返回K-V中所有的key
1 | //源码 |
values
返回rdd中每个tuple的value
1 | //源码 |
keyBy
单value转成(k,v)
1 | val rdd = sc.parallelize(List("a","ab","ccdd","dde")) |
reduceByKey
1 | 入参: func: (V, V) => V) |
有shuffle, 被拆成2个stage
底层走combiner(预聚合), 愈合后的数据走shuffle
1 | val rdd = sc.parallelize(List("ab,ac","ac,bb,aa","bb")) |
groupByKey
1 | 返回RDD[(K, Iterable[V])] |
有shuffle, 但是不会做combiner
1 | val rdd = sc.parallelize(List("ab,ac","ac,bb,aa","bb")) |
groupBy
返回: RDD[(K, Iterable[T])]
作用于单value, 可自定义分组规则
1 | val rdd = sc.parallelize(1 to 10) |
底层使用map把单value转成(k,v)再使用groupByKey实现分组
1 | val rdd = sc.parallelize(List("ab","ac","ac","bb","aa","bb")) |
sortBy
排序, 默认升序(第二个参数true)
1 | val rdd = sc.parallelize(List(("jim",20),("tom",40),("abby",10))) |
sortByKey
只针对K-V类型, 根据K排序
1 | val rdd = sc.parallelize(List(("jim",20),("tom",40),("abby",10))) |