TaskScheduler详解及源码介绍

大家好,又见面了,我是你们的朋友全栈君。

TaskScheduler负责提交任务,并且请求集群管理器调度任务。

  1. 提交任务
  2. 请求集群管理器调度任务 Spark的集群管理器有三种:独立集群管理器、Hadoop Yarn、Apache Mesos。可以参考Spark集群管理器介绍-博客园了解一下。

1 创建TaskScheduler:createTaskScheduler

创建TaskScheduler的源代码为SparkContext.createTaskScheduler,如下所示。该方法会根据master的配置匹配部署模式,每种部署模式中都会创建两个类(TaskSchedulerImpl、SchedulerBackend)的实例,只是TaskSchedulerImpl都相同,SchedulerBackend不同。

/**
* Create a task scheduler based on a given master URL.
* Return a 2-tuple of the scheduler backend and the task scheduler.
*/
private def createTaskScheduler(
sc: SparkContext,
master: String,
deployMode: String): (SchedulerBackend, TaskScheduler) = {
import SparkMasterRegex._
// When running locally, don't try to re-execute tasks on failure.
val MAX_LOCAL_TASK_FAILURES = 1
master match {
case "local" =>
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_REGEX(threads) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
val threadCount = if (threads == "*") localCpuCount else threads.toInt
if (threadCount <= 0) {
throw new SparkException(s"Asked to run locally with $threadCount threads")
}
val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
// local[*, M] means the number of cores on the computer with M failures
// local[N, M] means exactly N threads with M failures
val threadCount = if (threads == "*") localCpuCount else threads.toInt
val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
scheduler.initialize(backend)
(backend, scheduler)
case SPARK_REGEX(sparkUrl) =>
val scheduler = new TaskSchedulerImpl(sc)
val masterUrls = sparkUrl.split(",").map("spark://" + _)
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
(backend, scheduler)
case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
// Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
val memoryPerSlaveInt = memoryPerSlave.toInt
if (sc.executorMemory > memoryPerSlaveInt) {
throw new SparkException(
"Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
memoryPerSlaveInt, sc.executorMemory))
}
val scheduler = new TaskSchedulerImpl(sc)
val localCluster = new LocalSparkCluster(
numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
val masterUrls = localCluster.start()
val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
scheduler.initialize(backend)
backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
localCluster.stop()
}
(backend, scheduler)
case masterUrl =>
val cm = getClusterManager(masterUrl) match {
case Some(clusterMgr) => clusterMgr
case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
}
try {
val scheduler = cm.createTaskScheduler(sc, masterUrl)
val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
cm.initialize(scheduler, backend)
(backend, scheduler)
} catch {
case se: SparkException => throw se
case NonFatal(e) =>
throw new SparkException("External scheduler cannot be instantiated", e)
}
}
}

2 TaskScheduler的实现类:TaskSchedulerImpl

TaskSchedulerImpl的源代码如下:

private[spark] class TaskSchedulerImpl(
val sc: SparkContext,
val maxTaskFailures: Int,
isLocal: Boolean = false)
extends TaskScheduler with Logging {
...
}

TaskSchedulerImpl的构造过程:

  1. 从SparkConf中读取配置信息,包括每个任务分配的CPU数、调度模式(调度模式又FAIR和FIFO两种,默认为FIFO,可以修改属性spark.scheduler.mode来改变)等。源代码为:
  val conf = sc.conf
// How often to check for speculative tasks
val SPECULATION_INTERVAL_MS = conf.getTimeAsMs("spark.speculation.interval", "100ms")
// Duplicate copies of a task will only be launched if the original copy has been running for
// at least this amount of time. This is to avoid the overhead of launching speculative copies
// of tasks that are very short.
val MIN_TIME_TO_SPECULATION = 100
private val speculationScheduler =
ThreadUtils.newDaemonSingleThreadScheduledExecutor("task-scheduler-speculation")
// Threshold above which we warn user initial TaskSet may be starved
val STARVATION_TIMEOUT_MS = conf.getTimeAsMs("spark.starvation.timeout", "15s")
// CPUs to request per task
val CPUS_PER_TASK = conf.getInt("spark.task.cpus", 1)
// default scheduler is FIFO
private val schedulingModeConf = conf.get(SCHEDULER_MODE_PROPERTY, SchedulingMode.FIFO.toString)
private lazy val barrierSyncTimeout = conf.get(config.BARRIER_SYNC_TIMEOUT)
  1. 创建TaskResultGetter,它的作用是通过线程池对Worker上的Executor发送的Task的执行结果进行处理。
  //源码:TaskSchedulerImpl的部分源码
// This is a var so that we can reset it for testing purposes.
private[spark] var taskResultGetter = new TaskResultGetter(sc.env, this)

其中,该线程池由Executors.newFixedThreadPool创建,默认4个线程,线程名字以task-result-getter开头,线程工厂默认是Executors.defaultThreadFactory。根据下面两部分源码逐个解释:

  • 默认4个线程 TaskResultGetter类中的THREADS常量,通过字符串”spark.resultGetter.threads”得到值4。后面将THREADS作为参数,传入进ThreadUtils.newDaemonFixedThreadPool方法,再传入进Executors.newFixedThreadPool方法,依次向下传入,最终传入进ThreadPoolExecutor的构造方法,作为参数corePoolSize,通过this.corePoolSize = corePoolSize;设置线程池的默认线程数为4。
  • 线程名字以task-result-getter开头 在方法newDaemonFixedThreadPool方法的英文注释中说得很清楚。“Thread names are formatted as prefix-ID, where ID is a unique, sequentially assigned integer.”即线程名的格式是prefix-ID,其中prefix即TaskResultGetter类中通过语句ThreadUtils.newDaemonFixedThreadPool(THREADS, "task-result-getter")传入的字符串”task-result-getter”,ID是惟一的、按顺序分配的整数。
  • 线程工厂默认是Executors.defaultThreadFactory 说实话,没找到哎 TaskResultGetter —> protected val getTaskResultExecutor: ExecutorService = ThreadUtils.newDaemonFixedThreadPool(THREADS, "task-result-getter") —> def newDaemonFixedThreadPool(nThreads: Int, prefix: String): ThreadPoolExecutor = { //使用到的线程工厂 val threadFactory = namedThreadFactory(prefix) ... } —> def namedThreadFactory(prefix: String): ThreadFactory = { new ThreadFactoryBuilder().setDaemon(true).setNameFormat(prefix + "-%d").build() } —> public ThreadFactory build() {return build(this);} —> private static ThreadFactory build(ThreadFactoryBuilder builder) { ... final ThreadFactory backingThreadFactory = builder.backingThreadFactory != null ? builder.backingThreadFactory : Executors.defaultThreadFactory(); Thread thread = backingThreadFactory.newThread(runnable); ... } —> //源码来自:Executors.java static class DefaultThreadFactory implements ThreadFactory {...} 追根溯源到最后,最初的线程工厂实例即backingThreadFactory,它是类Executors.defaultThreadFactory的实例。
//源码:TaskResultGetter的部分源码
/**
Runs a thread pool that deserializes and remotely fetches (if necessary) task results.
*/
private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedulerImpl)
extends Logging {
//默认创建4个线程。
private val THREADS = sparkEnv.conf.getInt("spark.resultGetter.threads", 4)
// Exposed for testing.
protected val getTaskResultExecutor: ExecutorService =
ThreadUtils.newDaemonFixedThreadPool(THREADS, "task-result-getter")
...
}
//源码:newDaemonFixedThreadPool方法的源码,来自ThreadUtils.scala文件。
/**
* Wrapper over newFixedThreadPool. Thread names are formatted as prefix-ID, where ID is a
* unique, sequentially assigned integer.
*/
def newDaemonFixedThreadPool(nThreads: Int, prefix: String): ThreadPoolExecutor = {
val threadFactory = namedThreadFactory(prefix)
Executors.newFixedThreadPool(nThreads, threadFactory).asInstanceOf[ThreadPoolExecutor]
}
//源码:ThreadPoolExecutor
/**
* Creates a new {@code ThreadPoolExecutor} with the given initial
* parameters.
*
* @param corePoolSize the number of threads to keep in the pool, even
*        if they are idle, unless {@code allowCoreThreadTimeOut} is set
* @param maximumPoolSize the maximum number of threads to allow in the
*        pool
* @param keepAliveTime when the number of threads is greater than
*        the core, this is the maximum time that excess idle threads
*        will wait for new tasks before terminating.
* @param unit the time unit for the {@code keepAliveTime} argument
* @param workQueue the queue to use for holding tasks before they are
*        executed.  This queue will hold only the {@code Runnable}
*        tasks submitted by the {@code execute} method.
* @param threadFactory the factory to use when the executor
*        creates a new thread
* @param handler the handler to use when execution is blocked
*        because the thread bounds and queue capacities are reached
* @throws IllegalArgumentException if one of the following holds:<br>
*         {@code corePoolSize < 0}<br>
*         {@code keepAliveTime < 0}<br>
*         {@code maximumPoolSize <= 0}<br>
*         {@code maximumPoolSize < corePoolSize}
* @throws NullPointerException if {@code workQueue}
*         or {@code threadFactory} or {@code handler} is null
*/
public ThreadPoolExecutor(int corePoolSize,
int maximumPoolSize,
long keepAliveTime,
TimeUnit unit,
BlockingQueue<Runnable> workQueue,
ThreadFactory threadFactory,
RejectedExecutionHandler handler) {
if (corePoolSize < 0 ||
maximumPoolSize <= 0 ||
maximumPoolSize < corePoolSize ||
keepAliveTime < 0)
throw new IllegalArgumentException();
if (workQueue == null || threadFactory == null || handler == null)
throw new NullPointerException();
this.acc = System.getSecurityManager() == null ?
null :
AccessController.getContext();
this.corePoolSize = corePoolSize;
this.maximumPoolSize = maximumPoolSize;
this.workQueue = workQueue;
this.keepAliveTime = unit.toNanos(keepAliveTime);
this.threadFactory = threadFactory;
this.handler = handler;
}

Spark的任务调度模式有两种FAIR(公平调度)、FIFO(先入先出),任务的最终调度实际都是落实到接口SchedulerBackend的具体实现上。 在SchedulerBackEnd中到底是怎么调度的呢???,看过几篇文章太简略,笔者需要再学习一下,后面继续补充。关于此有一篇不错的文章,即Spark理论学习笔记(一)-一笑之奈何-博客园

3 TaskScheduler的启动

TaskScheduler的启动源码:

SparkContext.scala中启动TaskScheduler的代码:

    //源码来自SparkContext.scala
// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
// constructor
_taskScheduler.start()

_taskScheduler是TaskSchedulerImpl的实例,TaskSchedulerImpl是TaskScheduler的实现类,并且TaskSchedulerImpl重写了TaskScheduler中的start()方法,所以_taskScheduler.start()实际启动了TaskSchedulerImpl中的start()方法。由下面的源码可知,TaskScheduler在启动的时候,实际调用了backend的start()方法。

//源码来自TaskSchedulerImpl.scala
override def start() {
backend.start()
}

**那么backend启动后,又做了什么呢?**接着看第四章

4 TaskScheduler提交任务

在下面的源码中,backend.reviveOffers()这一句用于提交任务。调用taskScheduler的submitTasks方法,最终会转到调用backend的reviveOffers方法。所谓的taskScheduler调度任务,实际是由schedulerBackend调度任务。

  //源码来自:TaskSchedulerImpl.scala
override def submitTasks(taskSet: TaskSet) {
...
}

4 SchedulerBackend

/**
* Used when running a local version of Spark where the executor, backend, and master all run in
* the same JVM. It sits behind a [[TaskSchedulerImpl]] and handles launching tasks on a single
* Executor (created by the [[LocalSchedulerBackend]]) running locally.
*/
private[spark] class LocalSchedulerBackend(
conf: SparkConf,
scheduler: TaskSchedulerImpl,
val totalCores: Int)
extends SchedulerBackend with ExecutorBackend with Logging {
...
override def start() {
val rpcEnv = SparkEnv.get.rpcEnv
val executorEndpoint = new LocalEndpoint(rpcEnv, userClassPath, scheduler, this, totalCores)
localEndpoint = rpcEnv.setupEndpoint("LocalSchedulerBackendEndpoint", executorEndpoint)
listenerBus.post(SparkListenerExecutorAdded(
System.currentTimeMillis,
executorEndpoint.localExecutorId,
new ExecutorInfo(executorEndpoint.localExecutorHostname, totalCores, Map.empty)))
launcherBackend.setAppId(appId)
launcherBackend.setState(SparkAppHandle.State.RUNNING)
}
...
}
/**
* A backend interface for scheduling systems that allows plugging in different ones under
* TaskSchedulerImpl. We assume a Mesos-like model where the application gets resource offers as
* machines become available and can launch tasks on them.
*/
private[spark] trait SchedulerBackend {
private val appId = "spark-application-" + System.currentTimeMillis
def start(): Unit
def stop(): Unit
def reviveOffers(): Unit
def defaultParallelism(): Int
...
}

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/183089.html原文链接:https://javaforall.cn

原文地址:https://cloud.tencent.com/developer/article/2153638

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


学习编程是顺着互联网的发展潮流,是一件好事。新手如何学习编程?其实不难,不过在学习编程之前你得先了解你的目的是什么?这个很重要,因为目的决定你的发展方向、决定你的发展速度。
IT行业是什么工作做什么?IT行业的工作有:产品策划类、页面设计类、前端与移动、开发与测试、营销推广类、数据运营类、运营维护类、游戏相关类等,根据不同的分类下面有细分了不同的岗位。
女生学Java好就业吗?女生适合学Java编程吗?目前有不少女生学习Java开发,但要结合自身的情况,先了解自己适不适合去学习Java,不要盲目的选择不适合自己的Java培训班进行学习。只要肯下功夫钻研,多看、多想、多练
Can’t connect to local MySQL server through socket \'/var/lib/mysql/mysql.sock问题 1.进入mysql路径
oracle基本命令 一、登录操作 1.管理员登录 # 管理员登录 sqlplus / as sysdba 2.普通用户登录
一、背景 因为项目中需要通北京网络,所以需要连vpn,但是服务器有时候会断掉,所以写个shell脚本每五分钟去判断是否连接,于是就有下面的shell脚本。
BETWEEN 操作符选取介于两个值之间的数据范围内的值。这些值可以是数值、文本或者日期。
假如你已经使用过苹果开发者中心上架app,你肯定知道在苹果开发者中心的web界面,无法直接提交ipa文件,而是需要使用第三方工具,将ipa文件上传到构建版本,开...
下面的 SQL 语句指定了两个别名,一个是 name 列的别名,一个是 country 列的别名。**提示:**如果列名称包含空格,要求使用双引号或方括号:
在使用H5混合开发的app打包后,需要将ipa文件上传到appstore进行发布,就需要去苹果开发者中心进行发布。​
+----+--------------+---------------------------+-------+---------+
数组的声明并不是声明一个个单独的变量,比如 number0、number1、...、number99,而是声明一个数组变量,比如 numbers,然后使用 nu...
第一步:到appuploader官网下载辅助工具和iCloud驱动,使用前面创建的AppID登录。
如需删除表中的列,请使用下面的语法(请注意,某些数据库系统不允许这种在数据库表中删除列的方式):
前不久在制作win11pe,制作了一版,1.26GB,太大了,不满意,想再裁剪下,发现这次dism mount正常,commit或discard巨慢,以前都很快...
赛门铁克各个版本概览:https://knowledge.broadcom.com/external/article?legacyId=tech163829
实测Python 3.6.6用pip 21.3.1,再高就报错了,Python 3.10.7用pip 22.3.1是可以的
Broadcom Corporation (博通公司,股票代号AVGO)是全球领先的有线和无线通信半导体公司。其产品实现向家庭、 办公室和移动环境以及在这些环境...
发现个问题,server2016上安装了c4d这些版本,低版本的正常显示窗格,但红色圈出的高版本c4d打开后不显示窗格,
TAT:https://cloud.tencent.com/document/product/1340