Paper Reading

  • 时间:
  • 浏览:0

工作负载主要分为生产级别和非生产级别,通常分别对应在线任务(Service)和离线任务(Batch Job),资源的分配和使用量之间的差值但是 被充分利用,这点在上端resource reclaim与资源over-commit但是 看多。

Job最少有一个 Service但是 Batch Job,task最少Service但是 Batch Job的每个实例,通常有一个 实例也对应着有一个 容器。实例之间大多数属性是相同的,比如资源需求,调度的机器过滤与容错策略等,少每项是唯一的,比如在Service但是 Batch Job的索引等。

It achieves high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation

The machines in a cell belong to a single cluster…

A Borg job’s properties include its name, owner, and the number of tasks it has

This whole pro- cess is called resource reclamation. The estimate is called the task’s reservation, and is computed by the Borgmas- ter every few seconds, using fine-grained usage (resource- consumption) information captured by the Borglet. The ini- tial reservation is set equal to the resource request (the limit); after 60 s, to allow for startup transients, it decays slowly towards the actual usage plus a safety margin. The reserva- tion is rapidly increased if the usage exceeds it…

An alloc set is like a job: it is a group of allocs that reserve resources on multiple machines…

这每项实际上是单机资源的精细化控制,怎样尽量保证task的存活率的一齐,减小资源隔离性能的相互影响,怎样根据当前机器上service,batch job,best effort的任务消耗的资源量,来对将来但是 前要消耗的资源量做预估和微调。比如,但是 service对可压缩资源比如cpu的需求量增加,必须 但是 throttle这名 低优先级的任务一段时间比如几分钟,而不直接杀死低优先级任务,但是 但是 但是 短暂的流量尖峰。但是 是,对于可压缩资源,task但是 适当消耗超过limit的每项。你这名 块也是挺精细比较复杂的。

你这名 地方的可行性检查感觉算在过滤器里比较好,打分的过程算作具体的装箱算法,但是 可行性检查不前要比较细化的调优,而打分装箱的过程但是 进行不同算法的实验与调优。

服务发现,创建Service但是 Batch Job的task实例时,注册task的唯一标识与对应ip和端口。

To enable this, Borg creates a stable “Borg name service” (BNS) name for each task that includes the cell name, job name, and task number…

The schedul- ing algorithm has two parts: feasibility checking, to find ma- chines on which the task could run, and scoring, which picks one of the feasible machines…

google公有云也是使用borg来管理虚拟机?

在调度时,高优先级的Job是但是 看多低优先级Job的资源的,实际分配分发任务时,但是 前要抢占低优先级Job的资源,被kill掉的低优先级的task会重新调度。

通常基于多个数据中心的集群之上还但是 构建有一个 更高层的管理平台,负责这名 跨数据中心的策略相对简单的调度,比如k8s的federation。

Even so, occasional low-level resource interference (e.g., memory bandwidth or L3 cache pollution) still happens…

A second split is between compressible resources (e.g.,CPU cycles, disk I/O bandwidth) that are rate-based and can be reclaimed from a task by decreasing its quality of service without killing it; and non-compressible resources (e.g., memory, disk space) which generally cannot be re- claimed without killing the task. If a machine runs out of non-compressible resources, the Borglet immediately termi- nates tasks, from lowest to highest priority, until the remain- ing reservations can be met. If the machine runs out of com- pressible resources, the Borglet throttles usage (favoring LS tasks) so that short load spikes can be handled without killing any tasks. If things do not improve, Borgmaster will remove one or more tasks from the machine…

If the machine selected by the scoring phase doesn’t have enough available resources to fit the newtask, Borg preempts (kills) lower-priority tasks, from lowest to highest priority, until it does.We add the preempted tasks to the scheduler’s pending queue, rather than migrate or hibernate them…

Quota-checking is part of admission control, not scheduling: jobs with insufficient quota are immediately rejected upon submission…

最重要的三点:装箱与调度算法,资源的抢占/reclaim/over-commit,资源的隔离。

task’s constraints and also have enough “available” resources – which includes resources assigned to lower-priority tasks that can be evicted…

Borg defines non-overlapping priority bands for dif- ferent uses, including (in decreasing-priority order): monitoring, production, batch, and best effort (also known as testing or free)…

for non-prod tasks, it uses the reservations of existing tasks so the new tasks can be scheduled into reclaimed resources

容器组,通常对应有一个 Service但是 Batch Job,容器组中容器的数量通常对应于task的数量。

VMs and security sandboxing techniques are used to run external software by Google’s AppEngine (GAE) [38] and Google Compute Engine (GCE).We run each hostedVMin a KVM process [54] that runs as a Borg task…

we classify higher-priority Borg jobs as “production” (prod) ones, and the rest as “non-production” (non-prod). Most long-running server jobs are prod; most batch jobs are non-prod……The discrepancies between allocation and usage will prove important in…

有四种 装箱的基本思路:worst fit,尽量先找空闲资源多的,best fit,尽量先填满某个机器。

优先级与抢占,这里定义的是Monitoring、Production,Batch,Best-effort的有四种 大的优先级band,每个band可有更细粒度的优先级。高优先级的Job但是 抢占低优先级的Job,但是 生产级别(Monitoring, Production)的Job之间不允许抢占。

we sometimes call this “worst fit”. The opposite end of the spectrum is “best fit”, which tries to fill machines as tightly as possible…

高优先级的Job但是 抢占低优先级的Job的资源,但是 prod band优先级的Job之间必须互相抢占。

类似于k8s的pod,有一个 alloc对应有一个 容器,通常与有一个 task实例对应,但是 有一个 alloc和有一个 容器但是 跑多个task实例,哪几个实例之间是共享资源的,但是 存在相同的资源namespace。

资源的reclaim,是在离线混布但是 提高资源利用率的重要手段,但是 使用best effort级别的Job。具体怎样保证快速回收被临时占用的资源?

这里的配额是指各个产品线购买的资源预算,而也有指为Service但是 Batch Job分配资源时的资源上限(limit),在调度分配资源但是 用配额来限制每个用户资源的可申请量。

在调度的但是 ,non-prod优先级的Job是但是 看多但是 reclaimed的资源的,也但是 ,单机除去每个task实际请求的limit资源量,加进去去每个task被reclaimed的资源量,而每个task但是 被reclaim的资源量计算办法 是:limit - (一段时间内task实际使用的量+安全边界深度图)。显然,prod级别的Job在调度的但是 前要使用limit来计算资源要求,必须用reclaim的资源。除了基本的高优先级Job抢占低优先级Job的资源,有一个 提高资源利用率的重要技术是资源的超发,prod级别的预留资源和实际使用的资源的差值但是 用来跑低优先级的任务,best-effort的任务,对于在线Job来说必须被抢占,对于离线Job来说,但是 整机的资源足够,且满足所有在线的Job后任然足够,则不要再被抢占,对于best-effort来说,其看多的资源量实际上比有一个 离线的Job更多,但是 被调度到一台但是 资源被预留了百分之百的机器,使用此机器的reclaimed的资源,一旦prod前要重新使用这每项资源,best-effort的Job会被杀掉,却说 best-effort的可用性较低。但是 通过你这名 办法 ,大大提高了机器资源的利用率。

A Borg alloc (short for allocation) is a reserved set of re- sources on a machine in which one or more tasks can be run…

即使有cgroups资源隔离,但是 还是但是 互相影响。

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/feilengcui008/article/details/68942106