Spark中会导致shuffle的算子

1、repartition类的操作:比如repartition、repartitionAndSortWithinPartitions、coalesce等
2、byKey类的操作:比如reduceByKey、groupByKey、sortByKey、countByKey、combineByKey、aggregateByKey、foldByKey等
3、join类的操作:比如join、cogroup等

  1. 重分区: 一般会shuffle,因为需要在整个集群中,对之前所有的分区的数据进行随机,均匀的打乱,然后把数据放入下游新的指定数量的分区内
  2. byKey类的操作:因为你要对一个key,进行聚合操作,那么肯定要保证集群中,所有节点上的,相同的key,一定是到同一个节点上进行处理
  3. join类的操作:两个rdd进行join,就必须将相同join
    key的数据,shuffle到同一个节点上,然后进行相同key的两个rdd数据的笛卡尔乘积

https://blog.csdn.net/weixin_41624046/article/details/88065581

Spark算子:RDD键值转换操作–groupByKey、reduceByKey、reduceByKeyLocally

Spark操作 aggregate、aggregateByKey 实例

Spark算子:RDD键值转换操作–combineByKey、foldByKey

-----

  • 去重
    1. def distinct() 参考Spark distinct去重原理 (distinct会导致shuffle)

    2. def distinct(numPartitions: Int)

  • 聚合
    1. def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]

    2. def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]

    3. def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]

    4. def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]

    5. def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]

    6. def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]

    7. def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

    8. def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]

    9. def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]

  • 排序
    1. def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]

    2. def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

 

  • 重分区
  1. def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)

  2. def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)

 

  • 集合或者表操作
  1. def intersection(other: RDD[T]): RDD[T]

  2. def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

  3. def intersection(other: RDD[T], numPartitions: Int): RDD[T]

  4. def subtract(other: RDD[T], numPartitions: Int): RDD[T]

  5. def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]

  6. def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]

  7. def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]

  8. def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]

  9. def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

  10. def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

  11. def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]

  12. def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

https://kuncle.github.io/blog/spark/Spark%E7%9A%84shuffle%E7%AE%97%E5%AD%90

 

相关推荐
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页