pyspark系列2-linux安装pyspark

文章目录

一.安装Java和Scale

1.1 安装java

因为我这个环境是CDH 6.3.1版本,已经安装了JDK,此次略过。

[root@hp1 ~]# javac -version
javac 1.8.0_181

1.2 安装Scala

1.2.1 安装

代码:

官网地址:https://www.scala-lang.org/download/
 
wget https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz
tar -zxvf scala-2.13.1.tgz
mv scala-2.13.1 scala

测试地址:

[root@hp1 local]# cd /home/
[root@hp1 home]# ls
backup  cloudera-host-monitor.bak3  cloudera-service-monitor.moved  csv  hdfs  shell
[root@hp1 home]# mkdir software
[root@hp1 home]# cd software/
[root@hp1 software]# ls
[root@hp1 software]# 
[root@hp1 software]# pwd
/home/software
[root@hp1 software]# 
[root@hp1 software]# https://www.scala-lang.org/download/^M
: 没有那个文件或目录cala-lang.org/download/
[root@hp1 software]# 
[root@hp1 software]# 
[root@hp1 software]# pwd
/home/software
[root@hp1 software]# 
[root@hp1 software]# 
[root@hp1 software]# wget https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz
--2021-04-08 10:37:47--  https://downloads.lightbend.com/scala/2.13.1/scala-2.13.1.tgz
正在解析主机 downloads.lightbend.com (downloads.lightbend.com)... 13.35.121.34, 13.35.121.81, 13.35.121.50, ...
正在连接 downloads.lightbend.com (downloads.lightbend.com)|13.35.121.34|:443... 已连接。
  发出 HTTP 请求,正在等待回应... 200 OK
长度:19685743 (19M) [application/octet-stream]
正在保存至: “scala-2.13.1.tgz”

100%[================================================================================================================================================================>] 19,685,743  9.92MB/s 用时 1.9s   

2021-04-08 10:37:50 (9.92 MB/s) - 已保存 “scala-2.13.1.tgz” [19685743/19685743])

[root@hp1 software]# tar -zxvf scala-2.13.1.tgz
scala-2.13.1/
scala-2.13.1/lib/
scala-2.13.1/lib/scala-compiler.jar
scala-2.13.1/lib/scalap-2.13.1.jar
scala-2.13.1/lib/scala-reflect.jar
scala-2.13.1/lib/jansi-1.12.jar
scala-2.13.1/lib/jline-2.14.6.jar
scala-2.13.1/lib/scala-library.jar
scala-2.13.1/doc/
scala-2.13.1/doc/licenses/
scala-2.13.1/doc/licenses/mit_jquery.txt
scala-2.13.1/doc/licenses/bsd_scalacheck.txt
scala-2.13.1/doc/licenses/bsd_asm.txt
scala-2.13.1/doc/licenses/apache_jansi.txt
scala-2.13.1/doc/licenses/bsd_jline.txt
scala-2.13.1/doc/LICENSE.md
scala-2.13.1/doc/License.rtf
scala-2.13.1/doc/README
scala-2.13.1/doc/tools/
scala-2.13.1/doc/tools/scaladoc.html
scala-2.13.1/doc/tools/scalap.html
scala-2.13.1/doc/tools/css/
scala-2.13.1/doc/tools/css/style.css
scala-2.13.1/doc/tools/scala.html
scala-2.13.1/doc/tools/index.html
scala-2.13.1/doc/tools/images/
scala-2.13.1/doc/tools/images/scala_logo.png
scala-2.13.1/doc/tools/images/external.gif
scala-2.13.1/doc/tools/scalac.html
scala-2.13.1/doc/tools/fsc.html
scala-2.13.1/bin/
scala-2.13.1/bin/fsc
scala-2.13.1/bin/scalap.bat
scala-2.13.1/bin/scala
scala-2.13.1/bin/scaladoc.bat
scala-2.13.1/bin/fsc.bat
scala-2.13.1/bin/scala.bat
scala-2.13.1/bin/scaladoc
scala-2.13.1/bin/scalap
scala-2.13.1/bin/scalac
scala-2.13.1/bin/scalac.bat
scala-2.13.1/LICENSE
scala-2.13.1/man/
scala-2.13.1/man/man1/
scala-2.13.1/man/man1/scalac.1
scala-2.13.1/man/man1/scala.1
scala-2.13.1/man/man1/scaladoc.1
scala-2.13.1/man/man1/fsc.1
scala-2.13.1/man/man1/scalap.1
scala-2.13.1/NOTICE
[root@hp1 software]# 
[root@hp1 software]# 
[root@hp1 software]# 
[root@hp1 software]# mv scala-2.13.1 scala

1.2.2 配置

vim /etc/profile
 
SCALA_HOME=/home/software/scala
PATH=$SCALA_HOME/bin:$PATH
 
source /etc/profile

1.2.3 启动

[root@hp1 software]# scala
Welcome to Scala 2.13.1 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181).
Type in expressions for evaluation. Or try :help.

scala> 

scala> 

二.安装Apache Spark

因为我这个环境是CDH 6.3.1版本,已经安装了spark,此次略过。
完美,pyspark也都安装了,厉害了。

查找pyspark路径

[root@hp1 ~]# which pyspark
/usr/bin/pyspark
[root@hp1 ~]# 

运行pyspark报错

[root@hp1 software]# pyspark      
Python 2.7.5 (default, Apr  2 2020, 13:16:51) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/04/08 11:03:36 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1042 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
        at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:346)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:186)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:511)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
21/04/08 11:03:36 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/04/08 11:03:36 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
21/04/08 11:03:36 WARN spark.SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
21/04/08 11:03:36 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1042 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
        at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:346)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:186)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:511)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
21/04/08 11:03:36 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/04/08 11:03:36 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/shell.py", line 41, in <module>
    spark = SparkSession._create_shell_session()
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/sql/session.py", line 594, in _create_shell_session
    return SparkSession.builder.getOrCreate()
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 354, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 123, in __init__
    conf, jsc, profiler_cls)
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 185, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/pyspark/context.py", line 293, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: Required executor memory (1024), overhead (384 MB), and PySpark memory (0 MB) is above the max threshold (1042 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
        at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:346)
        at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:180)
        at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:60)
        at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:186)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:511)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)

[root@hp1 software]# 

网上搜到解决方案,鉴于测试机器内存较小,修改配置如下:

yarn.app.mapreduce.am.resource.mb =2g

yarn.nodemanager.resource.memory-mb=4g

yarn.scheduler.maximum-allocation-mb=2g

修改完成后重启服务使配置生效,重新测试

[root@hp1 software]# pyspark
Python 2.7.5 (default, Apr  2 2020, 13:16:51) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/04/08 11:08:07 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
21/04/08 11:08:07 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.3.1
      /_/

Using Python version 2.7.5 (default, Apr  2 2020 13:16:51)
SparkSession available as 'spark'.
>>> 

三.pyspark案例

用pyspark写一个wordcount程序
代码:
wordcount.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import time
from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
    #设置环境变量
    os.environ['JAVA_HOME'] = '/usr/java/jdk1.8.0_181'                      # java环境配置
    os.environ['HADOOP_HOME'] = '/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop'          # hadoop安装目录
    os.environ['SPARK_HOME'] = '/opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/spark'  # 设置spark安装目录

    spark_conf = SparkConf()\
        .setAppName('Python_Spark_WordCount')\
        .setMaster('local[2]')
    # 设置Spark程序运行的地方,此处设置运行在本地模式,启动2个线程分析数据

    sc = SparkContext(conf=spark_conf) # 获取SparkContext实例对象, 用于读取要处理的数据和Job执行
    # 设置日志级别  Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
    sc.setLogLevel('WARN')
    #  <SparkContext master=local[2] appName=Python_Spark_WordCount>
    print (sc)

    """
    创建RDD,读取要分析的数据:
        -1. 方式一:从本地集合(列表、元组、字典)进行并行化创建
        -2. 方式二:从外部文件系统读取数据(HDFS、LocalFS)
    """
    # 第一种方式:从集合并行创建RDD
    def local_rdd(spark_context):
        datas = ['hadoop spark', 'spark hive spark spark', 'spark hadoop python hive', ' ']
        return spark_context.parallelize(datas)     # Create RDD

    # 第二种方式:从本地文件系统中读取
    def hdfs_rdd(spark_context):
        return spark_context.textFile("/user/rdedu/wc.data")  # 从文件中读取数据

    # rdd = local_rdd(sc)    #方法1
    rdd = hdfs_rdd(sc)       #方法2
    print rdd.count()
    print rdd.first()

    # =============词频统计=======================================
    word_count_rdd = rdd\
        .filter(lambda line: len(line.strip()) != 0)\
        .flatMap(lambda line: line.strip().split(" "))\
        .map(lambda word: (word, 1))\
        .reduceByKey(lambda a, b: a + b)          # 将Key相同的Value进行合并

    for word, count in word_count_rdd.collect():  # collect()函数将rdd转换为列表
        print word, ', ', count

    print "===================================="

    # 依据统计的count值降序排序
    sort_rdd = word_count_rdd\
        .map(lambda (word, count): (count, word))\
        .sortByKey(ascending=False)
    print sort_rdd.collect()

    # def top(self, num, key=None):
    print word_count_rdd.top(3, key=lambda (word, count): count)

    # def takeOrdered(self, num, key=None) -> Bottom
    print word_count_rdd.takeOrdered(3, key=lambda (word, count): count)

    # 为了查看Spark程序运行是的WEB UI界面,让线程休眠一段时间
    time.sleep(100)

    # SparkContext Stop
    sc.stop()

测试记录:

[root@hp1 software]# spark-submit wordcount.py 
21/04/08 14:13:13 INFO spark.SparkContext: Running Spark version 2.4.0-cdh6.3.1
21/04/08 14:13:13 INFO logging.DriverLogger: Added a local log appender at: /tmp/spark-11001905-0583-4ed6-a7f5-787ae6d9565c/__driver_logs__/driver.log
21/04/08 14:13:13 INFO spark.SparkContext: Submitted application: Python_Spark_WordCount
21/04/08 14:13:13 INFO spark.SecurityManager: Changing view acls to: root
21/04/08 14:13:13 INFO spark.SecurityManager: Changing modify acls to: root
21/04/08 14:13:13 INFO spark.SecurityManager: Changing view acls groups to: 
21/04/08 14:13:13 INFO spark.SecurityManager: Changing modify acls groups to: 
21/04/08 14:13:13 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
21/04/08 14:13:13 INFO util.Utils: Successfully started service 'sparkDriver' on port 42666.
21/04/08 14:13:13 INFO spark.SparkEnv: Registering MapOutputTracker
21/04/08 14:13:13 INFO spark.SparkEnv: Registering BlockManagerMaster
21/04/08 14:13:13 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/04/08 14:13:13 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/04/08 14:13:13 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-190fe2cf-05dc-415b-ba34-168b596ddbfd
21/04/08 14:13:13 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
21/04/08 14:13:13 INFO spark.SparkEnv: Registering OutputCommitCoordinator
21/04/08 14:13:13 INFO util.log: Logging initialized @1925ms
21/04/08 14:13:13 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: 2018-09-05T05:11:46+08:00, git hash: 3ce520221d0240229c862b122d2b06c12a625732
21/04/08 14:13:13 INFO server.Server: Started @2002ms
21/04/08 14:13:13 INFO server.AbstractConnector: Started ServerConnector@575fe4eb{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
21/04/08 14:13:13 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3cf93f90{/jobs,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f6c4275{/jobs/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@40aaa4fa{/jobs/job,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6c969bc1{/jobs/job/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@313403cc{/stages,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@242cec64{/stages/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51a6582b{/stages/stage,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b01516d{/stages/stage/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@38853d26{/stages/pool,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5c4278a{/stages/pool/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@73315507{/storage,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@22eae835{/storage/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@33178d43{/storage/rdd,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@17cb14dc{/storage/rdd/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6eabf24f{/environment,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b7514ef{/environment/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1407ff57{/executors,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5ba564ff{/executors/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6aadcc4e{/executors/threadDump,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@59cc6f98{/executors/threadDump/json,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@721adeb1{/static,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3e08dd87{/,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@29849cf7{/api,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3c51f401{/jobs/job/kill,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@252c4495{/stages/stage/kill,null,AVAILABLE,@Spark}
21/04/08 14:13:13 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hp1:4040
21/04/08 14:13:13 INFO executor.Executor: Starting executor ID driver on host localhost
21/04/08 14:13:13 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33083.
21/04/08 14:13:13 INFO netty.NettyBlockTransferService: Server created on hp1:33083
21/04/08 14:13:13 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/04/08 14:13:13 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:13 INFO storage.BlockManagerMasterEndpoint: Registering block manager hp1:33083 with 366.3 MB RAM, BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:13 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:13 INFO storage.BlockManager: external shuffle service port = 7337
21/04/08 14:13:13 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hp1, 33083, None)
21/04/08 14:13:14 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5fba4081{/metrics/json,null,AVAILABLE,@Spark}
21/04/08 14:13:14 INFO scheduler.EventLoggingListener: Logging events to hdfs://nameservice1/user/spark/applicationHistory/local-1617862393786
21/04/08 14:13:14 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
21/04/08 14:13:14 INFO util.Utils: Extension com.cloudera.spark.lineage.NavigatorAppListener not being initialized.
21/04/08 14:13:14 INFO logging.DriverLogger$DfsAsyncWriter: Started driver log file sync to: /user/spark/driverLogs/local-1617862393786_driver.log
<SparkContext master=local[2] appName=Python_Spark_WordCount>
1
'hadoop spark', 'spark hive spark spark', 'spark hadoop python hive', ' '
python ,  1
'spark ,  2
spark ,  1
hive ,  1
' ,  2
hive', ,  1
'hadoop ,  1
spark', ,  2
hadoop ,  1
====================================
[(2, u"'spark"), (2, u"'"), (2, u"spark',"), (1, u'python'), (1, u'spark'), (1, u'hive'), (1, u"hive',"), (1, u"'hadoop"), (1, u'hadoop')]
[(u"'spark", 2), (u"'", 2), (u"spark',", 2)]
[(u'python', 1), (u'spark', 1), (u'hive', 1)]




[root@hp1 software]# 
[root@hp1 software]# 

参考:

1.https://blog.csdn.net/u013227399/article/details/102897606
2.https://www.cnblogs.com/erlou96/p/12933548.html

上一篇:python – spark-submit和pyspark有什么区别?


下一篇:linux环境安装pyspark