TypechoJoeTheme

IT技术分享

统计

RDD 常用 API 汇聚

2018-04-14
/
0 评论
/
694 阅读
/
正在检测是否收录...
04/14

countByKey

与count非常相似,但分别为每个不同的Key计算由双元组构成的RDD的个数。即根据Key来计数,看Key出现了几次。

定义

def countByKey(): Map[K, Long]

示例

val rdd = sc.parallelize(List((3, "3"), (3, "33"), (5, "55"), (3, "333")), 2)
rdd.countByKey
res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)

keyBy

将原来的数据作为Value,通过函数对其加一个Key。sortBy底层使用了keyBy。

定义

def keyBy[K](f: T => K): RDD[(K, T)]

示例

val rdd = sc.parallelize(List("dog", "lion", "rat", "elephant"), 3)
a.keyBy(_.length).collect  #等同于rdd.map(x => (x.length, x))
res: Array[(Int, String)] = Array((3,dog), (4,lion), (3,rat), (8,elephant))

foldByKey

与reduceByKey差不多,只不过可以加一个初始值

定义

def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]

示例

val rdd = sc.parallelize(List("dog", "lion", "cat", "eagle"), 2)
rdd.keyBy(_.length).foldByKey("")(_ + _).collect
res: Array[(Int, String)] = Array((4,lion), (3,dogcat), (5,eagle))
Spark
朗读
赞 · 0
版权属于:

IT技术分享

本文链接:

https://idunso.com/archives/2644/(转载时请注明本文出处及文章链接)