Flink源码阅读（一）——Flink on Yarn的Per-job模式源码简析

2023-01-25 09:55:56

一、前言

　　个人感觉学习Flink其实最不应该错过的博文是Flink社区的博文系列，里面的文章是不会让人失望的。强烈安利：https://ververica.cn/developers-resources/。　　

　　本文是自己第一次尝试写源码阅读的文章，会努力将原理和源码实现流程结合起来。文中有几个点目前也是没有弄清楚，若是写在一篇博客里，时间跨度太大，但又怕后期遗忘，所以先记下来，后期进一步阅读源码后再添上，若是看到不完整版博文的看官，对不住！

　　文中若是写的不准确的地方欢迎留言指出。

　　源码系列基于Flink 1.9

二、Per-job提交任务原理

　　Flink on Yarn模式下提交任务整体流程图如下（图源自Flink社区，链接见Ref [1]）

图1 Flink Runtime层架构图

　 2.1. Runtime层架构简介

　　Flink采取的是经典的master-salve模式，图中的AM（ApplicationMater）为master，TaskManager是salve。

　　AM中的Dispatcher用于接收client提交的任务和启动相应的JobManager ；JobManager用于任务的接收，task的分配、管理task manager等；ResourceManager主要用于资源的申请和分配。

　　这里有点需要注意：Flink本身也是具有ResourceManager和TaskManager的，这里虽然是on Yarn模式，但Flink本身也是拥有一套资源管理架构，虽然各个组件的名字一样，但这里yarn只是一个资源的提供者，若是standalone模式，资源的提供者就是物理机或者虚拟机了。　

　　2.2. Flink on Yarn 的Per-job模式提交任务的整体流程：

　　1）执行Flink程序，就类似client，主要是将代码进行优化形成JobGraph，向yarn的ResourceManager中的ApplicationManager申请资源启动AM（ApplicationMater）,AM所在节点是Yarn上的NodeManager上；

　　2）当AM起来之后会启动Dispatcher、ResourceManager，其中Dispatcher会启动JobManager，ResourceManager会启动slotManager用于slot的管理和分配；

　　3）JobManager向ResourceManager（RM）申请资源用于任务的执行，最初TaskManager还没有启动，此时，RM会向yarn去申请资源，获得资源后，会在资源中启动TaskManager，相应启动的slot会向slotManager中注册，然后slotManager会将slot分配给只需资源的task，即向JobManager注册信息，然后JobManager就会将任务提交到对应的slot中执行。其实Flink on yarn的session模式和Per-job模式最大的区别是，提交任务时RM已向Yarn申请了固定大小的资源，其TaskManager是已经启动的。

　　资源分配如详细过程图下：

图2 slot管理图，源自Ref[1]

　　更详细的过程解析，强烈推荐Ref [2]，是阿里Flink大牛写的，本博客在后期的源码分析过程也多依据此博客。

三、源码简析

　　提交任务语句

./flink run -m yarn-cluster ./flinkExample.jar

　　1、Client端提交任务阶段分析

　　flink脚本的入口类是org.apache.flink.client.cli.CliFrontend。

　　1）在CliFronted类的main()方法中，会加载flnk以及一些全局的配置项之后，根据命令行参数run，调用run()->runProgram()->deployJobCluster()，具体的代码如下：

private <T> void runProgram(

            CustomCommandLine<T> customCommandLine,

            CommandLine commandLine,

            RunOptions runOptions,

            PackagedProgram program) throws ProgramInvocationException, FlinkException {

        final ClusterDescriptor<T> clusterDescriptor = customCommandLine.createClusterDescriptor(commandLine);

        try {

            final T clusterId = customCommandLine.getClusterId(commandLine);

            final ClusterClient<T> client;

            // directly deploy the job if the cluster is started in job mode and detached

            if (clusterId == null && runOptions.getDetachedMode()) {

                int parallelism = runOptions.getParallelism() == -1 ? defaultParallelism : runOptions.getParallelism();

　　　　　　　　　　//构建JobGraph

                final JobGraph jobGraph = PackagedProgramUtils.createJobGraph(program, configuration, parallelism);

                final ClusterSpecification clusterSpecification = customCommandLine.getClusterSpecification(commandLine);
　　　　　　　　　　//将任务提交到yarn上

                client = clusterDescriptor.deployJobCluster(

                    clusterSpecification,

                    jobGraph,

                    runOptions.getDetachedMode());

                logAndSysout("Job has been submitted with JobID " + jobGraph.getJobID());

                ......................

            } else{........}

　　2）提交任务会调用YarnClusterDescriptor 类中deployJobCluster()->AbstractYarnClusterDescriptor类中deployInteral()，该方法会一直阻塞直到ApplicationMaster/JobManager在yarn上部署成功，其中最关键的调用是对startAppMaster()方法的调用，代码如下：

 protected ClusterClient<ApplicationId>     deployInternal(

             ClusterSpecification clusterSpecification,

             String applicationName,

             String yarnClusterEntrypoint,

             @Nullable JobGraph jobGraph,

             boolean detached) throws Exception {

         //1、验证集群是否可以访问

         //2、若用户组是否开启安全认证

         //3、检查配置以及vcore是否满足flink集群申请的需求

         //4、指定的对列是否存在

         //5、检查内存是否满足flink JobManager、NodeManager所需

         //....................................

         //Entry

         ApplicationReport report = startAppMaster(

                 flinkConfiguration,

                 applicationName,

                 yarnClusterEntrypoint,

                 jobGraph,

                 yarnClient,

                 yarnApplication,

                 validClusterSpecification);

         //6、获取flink集群端口、地址信息

         //..........................................

     }

　3）方法AbstractYarnClutserDescriptor.startAppMaster()主要是将配置文件和相关文件上传至分布式存储如HDFS，以及向Yarn上提交任务等，源码分析如下：

 public ApplicationReport startAppMaster(

             Configuration configuration,

             String applicationName,

             String yarnClusterEntrypoint,

             JobGraph jobGraph,

             YarnClient yarnClient,

             YarnClientApplication yarnApplication,

             ClusterSpecification clusterSpecification) throws Exception {

         // .......................

         //1、上传conf目录下logback.xml、log4j.properties

         //2、上传环境变量中FLINK_PLUGINS_DIR ,FLINK_LIB_DIR包含的jar

         addEnvironmentFoldersToShipFiles(systemShipFiles);

         //...........

         //3、设置applications的高可用的方案，通过设置AM重启次数，默认为1

         //4、上传ship files、user jars、

         //5、为TaskManager设置slots、heap memory

         //6、上传flink-conf.yaml

         //7、序列化JobGraph后上传

         //8、登录权限检查

         //.................

         //获得启动AM container的Java命令

         final ContainerLaunchContext amContainer = setupApplicationMasterContainer(

                 yarnClusterEntrypoint,

                 hasLogback,

                 hasLog4j,

                 hasKrb5,

                 clusterSpecification.getMasterMemoryMB());

         //9、为aAM启动绑定环境参数以及classpath和环境变量

         //..........................

         final String customApplicationName = customName != null ? customName : applicationName;

         //10、应用名称、应用类型、用户提交的应用ContainerLaunchContext

         appContext.setApplicationName(customApplicationName);

         appContext.setApplicationType(applicationType != null ? applicationType : "Apache Flink");

         appContext.setAMContainerSpec(amContainer);

         appContext.setResource(capability);

         if (yarnQueue != null) {

             appContext.setQueue(yarnQueue);

         }

         setApplicationNodeLabel(appContext);

         setApplicationTags(appContext);

         //11、部署失败删除yarnFilesDir

         // add a hook to clean up in case deployment fails

         Thread deploymentFailureHook = new DeploymentFailureHook(yarnClient, yarnApplication, yarnFilesDir);

         Runtime.getRuntime().addShutdownHook(deploymentFailureHook);

         LOG.info("Submitting application master " + appId);

         //Entry

         yarnClient.submitApplication(appContext);

         LOG.info("Waiting for the  cluster to be allocated");

         final long startTime = System.currentTimeMillis();

         ApplicationReport report;

         YarnApplicationState lastAppState = YarnApplicationState.NEW;

         //12、阻塞等待直到running

         loop: while (true) {

             //...................

             //每隔250ms通过YarnClient获取应用报告

             Thread.sleep(250);

         }

         //...........................

         //13、部署成功删除shutdown回调

         // since deployment was successful, remove the hook

         ShutdownHookUtil.removeShutdownHook(deploymentFailureHook, getClass().getSimpleName(), LOG);

         return report;

     }

　　4）应用提交的Entry是YarnClientImpl.submitApplication()，该方法在于调用了ApplicationClientProtocolPBClientImpl.submitApplication()，其具体代码如下：

 public SubmitApplicationResponse submitApplication(SubmitApplicationRequest request) throws YarnException, IOException {

 //取出报文

         SubmitApplicationRequestProto requestProto = ((SubmitApplicationRequestPBImpl)request).getProto();

         try {

 //将报文发送发送到服务端，并将返回结果构成response

             return new SubmitApplicationResponsePBImpl(this.proxy.submitApplication((RpcController)null, requestProto));

         } catch (ServiceException var4) {

             RPCUtil.unwrapAndThrowException(var4);

             return null;

         }

     }

　　报文就会通过RPC到达服务端，服务端处理报文的方法是ApplicationClientProtocolPBServiceImpl.submitApplication()，方法中会重新构建报文，然后通过ClientRMService.submitApplication()将应用请求提交到Yarn上的RMAppManager去提交任务（在Yarn的分配过后面会专门写一系列的博客去说明）。

　　至此，client端的流程就走完了，应用请求已提交到Yarn的ResourceManager上了，下面着重分析Flink Cluster启动流程。

　　2、Flink Cluster启动流程分析

　　1）在ClientRMService类的submitApplication()方法中，会先检查任务是否已经提交（通过applicationID）、Yarn的queue是否为空等，然后将请求提交到RMAppManager（ARN RM内部管理应用生命周期的组件），若提交成功会输出Application with id {applicationId.getId()} submitted by user {user}的信息，具体分析如下：

 public SubmitApplicationResponse submitApplication(

             SubmitApplicationRequest request) throws YarnException {

         ApplicationSubmissionContext submissionContext = request

                 .getApplicationSubmissionContext();

         ApplicationId applicationId = submissionContext.getApplicationId();

         // ApplicationSubmissionContext needs to be validated for safety - only

         // those fields that are independent of the RM's configuration will be

         // checked here, those that are dependent on RM configuration are validated

         // in RMAppManager.

         //这里仅验证不属于RM的配置，属于RM的配置将在RMAppManager验证

         //1、检查application是否已提交

         //2、检查提交的queue是否为null，是，则设置为默认queue（default）

         //3、检查是否设置application名，否，则为默认（N/A）

         //4、检查是否设置application类型，否，则为默认（YARN）；是，若名字长度大于给定的长度（20），则会截断

         //.............................

         try {

             // call RMAppManager to submit application directly

             //直接submit任务

             rmAppManager.submitApplication(submissionContext,

                     System.currentTimeMillis(), user);

             //submit成功

             LOG.info("Application with id " + applicationId.getId() +

                     " submitted by user " + user);

             RMAuditLogger.logSuccess(user, AuditConstants.SUBMIT_APP_REQUEST,

                     "ClientRMService", applicationId);

         } catch (YarnException e) {

             //失败会抛出异常

         }

         //..................

     }

　　2）RMAppManager类的submitApplication()方法主要是创建RMApp和向ResourceScheduler申请AM container，该部分直到在NodeManager上启动AM container都是Yarn本身所为，其中具体过程在这里不详细分析，详细过程后期会分析，这里仅给出入口，代码如下：

 protected void submitApplication(

             ApplicationSubmissionContext submissionContext, long submitTime,

             String user) throws YarnException {

         ApplicationId applicationId = submissionContext.getApplicationId();

         //1、创建RMApp，若具有相同的applicationId会抛出异常

         RMAppImpl application =

                 createAndPopulateNewRMApp(submissionContext, submitTime, user);

         ApplicationId appId = submissionContext.getApplicationId();

         //security模式有simple和kerberos，在配置文件中配置

         //开始kerberos

         if (UserGroupInformation.isSecurityEnabled()) {

             //..................

         } else {

             //simple模式

             // Dispatcher is not yet started at this time, so these START events

             // enqueued should be guaranteed to be first processed when dispatcher

             // gets started.

             //2、向ResourceScheduler（可插拔的资源调度器）提交任务？？？？？？？？？？

             this.rmContext.getDispatcher().getEventHandler()

                     .handle(new RMAppEvent(applicationId, RMAppEventType.START));

         }

     }

　　3）Flink在Per-job模式下，AM container加载运行的入口是YarnJobClusterEntryPoint中的main()方法，源码分析如下：

 public static void main(String[] args) {

         // startup checks and logging

         //1、输出环境信息如用户、环境变量、Java版本等，以及JVM参数

         EnvironmentInformation.logEnvironmentInfo(LOG, YarnJobClusterEntrypoint.class.getSimpleName(), args);

         //2、注册处理各种SIGNAL的handler:记录到日志

         SignalHandler.register(LOG);

         //3、注册JVM关闭保障的shutdown hook：避免JVM退出时被其他shutdown hook阻塞

         JvmShutdownSafeguard.installAsShutdownHook(LOG);

         Map<String, String> env = System.getenv();

         final String workingDirectory = env.get(ApplicationConstants.Environment.PWD.key());

         Preconditions.checkArgument(

                 workingDirectory != null,

                 "Working directory variable (%s) not set",

                 ApplicationConstants.Environment.PWD.key());

         try {

             //4、输出Yarn运行的用户信息

             YarnEntrypointUtils.logYarnEnvironmentInformation(env, LOG);

         } catch (IOException e) {

             LOG.warn("Could not log YARN environment information.", e);

         }

         //5、加载flink的配置

         Configuration configuration = YarnEntrypointUtils.loadConfiguration(workingDirectory, env, LOG);

         YarnJobClusterEntrypoint yarnJobClusterEntrypoint = new YarnJobClusterEntrypoint(

                 configuration,

                 workingDirectory);

         //6、Entry  创建并启动各类内部服务

         ClusterEntrypoint.runClusterEntrypoint(yarnJobClusterEntrypoint);

     }

　　4）后续的调用过程：ClusterEntrypoint类中runClusterEntrypoint()->startCluster()->runCluster()，该过程比较简单，这里着实分析runCluster()方法，如下：

 //#ClusterEntrypint.java

     private void runCluster(Configuration configuration) throws Exception {

         synchronized (lock) {

             initializeServices(configuration);

             // write host information into configuration

             configuration.setString(JobManagerOptions.ADDRESS, commonRpcService.getAddress());

             configuration.setInteger(JobManagerOptions.PORT, commonRpcService.getPort());

             //1、创建dispatcherResour、esourceManager对象，其中有从本地重新创建JobGraph的过程

             final DispatcherResourceManagerComponentFactory<?> dispatcherResourceManagerComponentFactory = createDispatcherResourceManagerComponentFactory(configuration);

             //2、Entry 启动RpcService、HAService、BlobServer、HeartbeatServices、MetricRegistry、ExecutionGraphStore等

             clusterComponent = dispatcherResourceManagerComponentFactory.create(

                     configuration,

                     commonRpcService,

                     haServices,

                     blobServer,

                     heartbeatServices,

                     metricRegistry,

                     archivedExecutionGraphStore,

                     new RpcMetricQueryServiceRetriever(metricRegistry.getMetricQueryServiceRpcService()),

                     this);

             //............

         }

     }

　　4）在create()方法中，会启动Flink的诸多组件，其中与提交任务强相关的是Dispatcher、ResourceManager，具体代码如下：

 public DispatcherResourceManagerComponent<T> create(

             Configuration configuration,

             RpcService rpcService,

             HighAvailabilityServices highAvailabilityServices,

             BlobServer blobServer,

             HeartbeatServices heartbeatServices,

             MetricRegistry metricRegistry,

             ArchivedExecutionGraphStore archivedExecutionGraphStore,

             MetricQueryServiceRetriever metricQueryServiceRetriever,

             FatalErrorHandler fatalErrorHandler) throws Exception {

         LeaderRetrievalService dispatcherLeaderRetrievalService = null;

         LeaderRetrievalService resourceManagerRetrievalService = null;

         WebMonitorEndpoint<U> webMonitorEndpoint = null;

         ResourceManager<?> resourceManager = null;

         JobManagerMetricGroup jobManagerMetricGroup = null;

         T dispatcher = null;

         try {

             dispatcherLeaderRetrievalService = highAvailabilityServices.getDispatcherLeaderRetriever();

             resourceManagerRetrievalService = highAvailabilityServices.getResourceManagerLeaderRetriever();

             final LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever = new RpcGatewayRetriever<>(

                     rpcService,

                     DispatcherGateway.class,

                     DispatcherId::fromUuid,

                     10,

                     Time.milliseconds(50L));

             final LeaderGatewayRetriever<ResourceManagerGateway> resourceManagerGatewayRetriever = new RpcGatewayRetriever<>(

                     rpcService,

                     ResourceManagerGateway.class,

                     ResourceManagerId::fromUuid,

                     10,

                     Time.milliseconds(50L));

             final ExecutorService executor = WebMonitorEndpoint.createExecutorService(

                     configuration.getInteger(RestOptions.SERVER_NUM_THREADS),

                     configuration.getInteger(RestOptions.SERVER_THREAD_PRIORITY),

                     "DispatcherRestEndpoint");

             final long updateInterval = configuration.getLong(MetricOptions.METRIC_FETCHER_UPDATE_INTERVAL);

             final MetricFetcher metricFetcher = updateInterval == 0

                     ? VoidMetricFetcher.INSTANCE

                     : MetricFetcherImpl.fromConfiguration(

                     configuration,

                     metricQueryServiceRetriever,

                     dispatcherGatewayRetriever,

                     executor);

             webMonitorEndpoint = restEndpointFactory.createRestEndpoint(

                     configuration,

                     dispatcherGatewayRetriever,

                     resourceManagerGatewayRetriever,

                     blobServer,

                     executor,

                     metricFetcher,

                     highAvailabilityServices.getWebMonitorLeaderElectionService(),

                     fatalErrorHandler);

             log.debug("Starting Dispatcher REST endpoint.");

             webMonitorEndpoint.start();

             final String hostname = getHostname(rpcService);

             jobManagerMetricGroup = MetricUtils.instantiateJobManagerMetricGroup(

                     metricRegistry,

                     hostname,

                     ConfigurationUtils.getSystemResourceMetricsProbingInterval(configuration));

             //1、返回的是new YarnResourceManager

             /*调度过程：AbstractDispatcherResourceManagerComponentFactory

                         ->ActiveResourceManagerFactory

                         ->YarnResourceManagerFactory

              */

             ResourceManager<?> resourceManager1 = resourceManagerFactory.createResourceManager(

                     configuration,

                     ResourceID.generate(),

                     rpcService,

                     highAvailabilityServices,

                     heartbeatServices,

                     metricRegistry,

                     fatalErrorHandler,

                     new ClusterInformation(hostname, blobServer.getPort()),

                     webMonitorEndpoint.getRestBaseUrl(),

                     jobManagerMetricGroup);

             resourceManager = resourceManager1;

             final HistoryServerArchivist historyServerArchivist = HistoryServerArchivist.createHistoryServerArchivist(configuration, webMonitorEndpoint);

             //2、在此反序列化获取JobGraph实例；返回new MiniDispatcher

             dispatcher = dispatcherFactory.createDispatcher(

                     configuration,

                     rpcService,

                     highAvailabilityServices,

                     resourceManagerGatewayRetriever,

                     blobServer,

                     heartbeatServices,

                     jobManagerMetricGroup,

                     metricRegistry.getMetricQueryServiceGatewayRpcAddress(),

                     archivedExecutionGraphStore,

                     fatalErrorHandler,

                     historyServerArchivist);

             log.debug("Starting ResourceManager.");

             //启动resourceManager，此过程中会经历以下阶段

             //leader选举->(ResourceManager.java中)

             // ->grantLeadership(...)

             // ->tryAcceptLeadership(...)

             // ->slotManager的启动

             resourceManager.start();

             resourceManagerRetrievalService.start(resourceManagerGatewayRetriever);

             log.debug("Starting Dispatcher.");

             //启动Dispatcher，经历以下阶段：

             //leader选举->(Dispatcher.java中)grantLeadership->tryAcceptLeadershipAndRunJobs

             // ->createJobManagerRunner->startJobManagerRunner->jobManagerRunner.start()

             //

             //->(JobManagerRunner.java中)start()->leaderElectionService.start(...)

             //->grantLeadership(...)->verifyJobSchedulingStatusAndStartJobManager(...)

             //->startJobMaster(leaderSessionId)这里的startJobmaster应该是启动的JobManager

             //

             //->(JobManagerRunner.java中)jobMasterService.start(...)

             //->(JobMaster.java)startJobExecution(...)

             // ->{startJobMasterServices()在该方法中会启动slotPool->resourceManagerLeaderRetriever.start(...)}

             //->startJobExecution(...)->

             dispatcher.start();

             dispatcherLeaderRetrievalService.start(dispatcherGatewayRetriever);

             return createDispatcherResourceManagerComponent(

                     dispatcher,

                     resourceManager,

                     dispatcherLeaderRetrievalService,

                     resourceManagerRetrievalService,

                     webMonitorEndpoint,

                     jobManagerMetricGroup);

         } catch (Exception exception) {

             // clean up all started components

             //失败会清除已启动的组件

             //..............

         }

     }

　　5）此后，JobManager中的slotPool会向SlotManager申请资源，而SlotManager则向Yarn的ResourceManager申请，申请到后会启动TaskManager，然后将slot信息注册到slotManager和slotPool中，详细过程在此就不展开分析了，留作后面分析。

四、总结

　　该博客中还有诸多不完善的地方，需要自己后进一步的阅读源码、弄清设计架构后等一系列之后才能有更好的完善，此外，后期也会对照着Flink 的Per-job模式下任务提交的详细日志进一步验证。

　　若是文中有描述不清的，非常建议参考以下博文；若是存在不对的地方，非常欢迎大伙留言指出，谢谢了！

Ref

[1]https://files.alicdn.com/tpsservice/7bb8f513c765b97ab65401a1b78c8cb8.pdf

[2]https://yq.aliyun.com/articles/719262?spm=a2c4e.11153940.0.0.3ea9469ei7H3Wx#

[3]https://www.jianshu.com/p/52da8b2e4ccc

码农公寓

一、前言

二、Per-job提交任务原理

2.1. Runtime层架构简介

2.2. Flink on Yarn 的Per-job模式提交任务的整体流程：

三、源码简析

1、Client端提交任务阶段分析

2、Flink Cluster启动流程分析

四、总结

相关文章

　 2.1. Runtime层架构简介

　　2.2. Flink on Yarn 的Per-job模式提交任务的整体流程：

　　1、Client端提交任务阶段分析

　　2、Flink Cluster启动流程分析