顿搜
飞过闲红千叶,夕岸在哪
类目归类
与count非常相似,但分别为每个不同的Key计算由双元组构成的RDD的个数。即根据Key来计数,看Key出现了几次。
定义
def countByKey(): Map[K, Long]示例
val rdd = sc.parallelize(List((3, "3"), (3, "33"), (5, "55"), (3, "333")), 2)
rdd.countByKey
res: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)将原来的数据作为Value,通过函数对其加一个Key。sortBy底层使用了keyBy。
定义
def keyBy[K](f: T => K): RDD[(K, T)]示例
val rdd = sc.parallelize(List("dog", "lion", "rat", "elephant"), 3)
a.keyBy(_.length).collect #等同于rdd.map(x => (x.length, x))
res: Array[(Int, String)] = Array((3,dog), (4,lion), (3,rat), (8,elephant))与reduceByKey差不多,只不过可以加一个初始值
定义
def foldByKey(zeroValue: V)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, numPartitions: Int)(func: (V, V) => V): RDD[(K, V)]
def foldByKey(zeroValue: V, partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)]示例
val rdd = sc.parallelize(List("dog", "lion", "cat", "eagle"), 2)
rdd.keyBy(_.length).foldByKey("")(_ + _).collect
res: Array[(Int, String)] = Array((4,lion), (3,dogcat), (5,eagle))