RDD用法与实例(十四):closure和accumulators的区别和实例

RDD用法与实例(十四):closure和accumulators的区别和实例

1.RDD的特性:

1.persistent
2.lazy transformation

2.Cluster mode集群模式

Only one master/worker can run on the same machine, but a machine can be both a master and a worker

3.where to run

Most run on drivers
transformations run on executors
actions - executors and drivers

example :
Example: Let’s say you want to combine two RDDs: a, b. You remember that rdd.collect() returns a list, and in Python you can combine two lists with +A naïve implementation would be:

a = RDDa.collect() driver
b = RDDb.collect() driver
RDDc = sc.parallelize(a+b) executor

Where does this code run?
In the first line, all distributed data for a and b is sent to driver. What if a and/or b is very large? Driver could run out of memory. Also, it takes a long time to send the data to the driver.In the third line, all data is sent from driver to executors.

The correct way:

RDDc = RDDa.union(RDDb)

This runs completely at executors.

4.Closure – transformation

Any global variables used by those executors and are copies

This closure is serialized and sent to each executor from the driver when an action is invoked.

5.Accumulators – action

only driver read its value

closure和accumulator的对比

odd = sc.accumulator(0)
even = 0
def count(element):
global even
if element % 2 == 0:
even += 1
else: odd.add(1)
sc.parallelize([1, 6, 7, 8, 3, 4, 4, 2]).foreach(count) print odd,even

输出结果为3,0 因为even根本没有真正执行过,都是executor自己玩自己的。
RDD用法与实例(十四):closure和accumulators的区别和实例
不适合在一些transformation类型的操作中使用,很容易有问题。

上一篇:[LeetCode] 1609. Even Odd Tree


下一篇:20.11.13 leetcode328