balance_dirty_pages_ratelimited分析

news/2024/10/8 14:00:17/文章来源:https://www.cnblogs.com/linhaostudy/p/18402841

balance_dirty_pages_ratelimited分析

  • nr_dirtied_pause:当前task的脏页门限;
  • dirty_exceeded:全局的脏页数超过门限或者该bdi的脏页数超过门限;(dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); )
  • bdp_ratelimits:percpu变量,当前CPU的脏页数
  • ratelimit_pages:CPU的脏页门限

调用balance_dirty_pages的条件有:
1:当前task的脏页数量大于ratelimit ,(如果dirty_exceeded为0,则为current->nr_dirtied_pause;如果dirty_exceeded为1,则最大为32KB)

2:当前CPU的脏页数超过了门限值ratelimit_pages;

3:当前脏页数+退出线程遗留的脏页超过了门限;

void balance_dirty_pages_ratelimited(struct address_space *mapping)
{struct backing_dev_info *bdi = inode_to_bdi(mapping->host);int ratelimit;int *p;if (!bdi_cap_account_dirty(bdi))return;ratelimit = current->nr_dirtied_pause;  /* 门限:初始值为32表示128KB */if (bdi->dirty_exceeded)                /* 如果该值设置了,则需要通过降低平衡触发的门限来加速脏页回收 */ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));  /* 重新修改门限,最大为32KB,初始值128KB,加快回收 */preempt_disable();/** This prevents one CPU to accumulate too many dirtied pages without* calling into balance_dirty_pages(), which can happen when there are* 1000+ tasks, all of them start dirtying pages at exactly the same* time, hence all honoured too large initial task->nr_dirtied_pause.*//* 即保证当前线程脏页数超过门限,或者当前CPU超过门限,都要回收 */p =  this_cpu_ptr(&bdp_ratelimits);  /* 当前CPU的脏页计数 */if (unlikely(current->nr_dirtied >= ratelimit))  /* 如果当前线程脏页数超过门限值,则肯定会触发下面的回收流程。同时重新计算当前CPU的脏页数 */*p = 0;else if (unlikely(*p >= ratelimit_pages)) {     /* 默认值为32页 */ /* 当前线程的脏页数未超过门限值,但是当前CPU的脏页数超过CPU脏页门限值,则设置门限为0,肯定会触发回收。同时重新计算当前CPU的脏页数 */*p = 0;ratelimit = 0;}/** Pick up the dirtied pages by the exited tasks. This avoids lots of* short-lived tasks (eg. gcc invocations in a kernel build) escaping* the dirty throttling and livelock other long-run dirtiers.*/p = this_cpu_ptr(&dirty_throttle_leaks);   /* 退出的线程,也放在这里处理 */if (*p > 0 && current->nr_dirtied < ratelimit) {  unsigned long nr_pages_dirtied;nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);*p -= nr_pages_dirtied;current->nr_dirtied += nr_pages_dirtied;}preempt_enable();if (unlikely(current->nr_dirtied >= ratelimit))    /* 当前线程脏页超过门限值 */balance_dirty_pages(mapping, current->nr_dirtied);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

正常情况下应该是周期回收和背景回收,不会占用当前task的时间。但是当dirty > dirty_freerun_ceiling(thresh, bg_thresh) 即脏页数大于直接回收门限和背景回收门限的1/2时,需要将当前CPU休眠一会,让回收线程工作。

但是dirty <= dirty_freerun_ceiling(thresh, bg_thresh),也会动态的调整nr_dirtied_pause ,号让其更好的回收,调整的策略为:

static unsigned long dirty_poll_interval(unsigned long dirty,unsigned long thresh)
{/*  */if (thresh > dirty)  /*  */return 1UL << (ilog2(thresh - dirty) >> 1);return 1;  /* 脏页数超过门限值,则返回1页就需要回收 */
}

至于为什么这么做,可以参考如下解析:
/*
Ideally if we know there are N dirtiers, it’s safe to let each task
poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will
rush high at next second. So sqrt is used to tolerate larger N on
increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf “%4d\t%4d\n”, mb, Math.sqrt(pages)}

1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit
won’t be exceeded as long as there are less than 16 (or 512) concurrent
dd’s.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).
When the dirty pages are floating in range [freerun, limit],
“[PATCH 14/18] writeback: control dirty pause time” will independently
adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for
large memory servers, which have large (thresh-freerun) gaps.

*/

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
{/* 可用内存并不是系统所有内存,而是free pages + reclaimable pages(文件页) */const unsigned long available_memory = global_dirtyable_memory();unsigned long background;unsigned long dirty;struct task_struct *tsk;if (vm_dirty_bytes)dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);elsedirty = (vm_dirty_ratio * available_memory) / 100;if (dirty_background_bytes)background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);elsebackground = (dirty_background_ratio * available_memory) / 100;if (background >= dirty)background = dirty / 2;tsk = current;if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {   /* 如果设置了该属性PF_LESS_THROTTLE或者是实时线程,门限稍微提高1/4 */background += background / 4;dirty += dirty / 4;}*pbackground = background;*pdirty = dirty;trace_global_dirty_state(background, dirty);
}static unsigned long global_dirtyable_memory(void)
{unsigned long x;/* 可用内存并不是系统所有内存,而是free pages + file pages(文件页) */x = global_page_state(NR_FREE_PAGES);x -= min(x, dirty_balance_reserve);x += global_page_state(NR_INACTIVE_FILE);x += global_page_state(NR_ACTIVE_FILE);if (!vm_highmem_is_dirtyable)x -= highmem_dirtyable_memory(x);return x + 1;	/* Ensure that we never return 0 */
}

1:如果可回收+正在回写脏页数量 < background和显式回写阈值的均值此次先不启动回写,否则启动background回写
2:如果可回收的脏页数大于背景回收门限值,则触发背景回收执行;

static void balance_dirty_pages(struct address_space *mapping,unsigned long pages_dirtied)
{unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */unsigned long background_thresh;unsigned long dirty_thresh;long period;long pause;long max_pause;long min_pause;int nr_dirtied_pause;bool dirty_exceeded = false;unsigned long task_ratelimit;unsigned long dirty_ratelimit;unsigned long pos_ratio;struct backing_dev_info *bdi = inode_to_bdi(mapping->host);bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; //单独门限值回收unsigned long start_time = jiffies;for (;;) {unsigned long now = jiffies;unsigned long uninitialized_var(bdi_thresh);unsigned long thresh;unsigned long uninitialized_var(bdi_dirty);unsigned long dirty;unsigned long bg_thresh;/** Unstable writes are a feature of certain networked* filesystems (i.e. NFS) in which data may have been* written to the server's write cache, but has not yet* been flushed to permanent storage.*/nr_reclaimable = global_page_state(NR_FILE_DIRTY) +global_page_state(NR_UNSTABLE_NFS);  /* 全局 文件脏页  + 网络文件系统 */  /* = file_dirty + unstable_nfs */nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); /*全局 文件总的脏页+包括正在回写 */  /* = file_dirty + writeback + unstable_nfs */global_dirty_limits(&background_thresh, &dirty_thresh);//获取两个门限值if (unlikely(strictlimit)) {  /* 单独bdi回收 */bdi_dirty_limits(bdi, dirty_thresh, background_thresh,&bdi_dirty, &bdi_thresh, &bg_thresh);dirty = bdi_dirty;thresh = bdi_thresh;} else {                       /* 全局回收 */dirty = nr_dirty;          /* 全局 文件总的脏页+包括正在回写 */thresh = dirty_thresh;bg_thresh = background_thresh;}/** Throttle it only when the background writeback cannot* catch-up. This avoids (excessively) small writeouts* when the bdi limits are ramping up in case of !strictlimit.** In strictlimit case make decision based on the bdi counters* and limits. Small writeouts when the bdi limits are ramping* up are the price we consciously pay for strictlimit-ing.*//* 小于直接回收文件和背景回收的/2, 不占用本线程时间;否则说明背景回收没有运行,需要占用本线程时间,  */if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {  //(thresh + bg_thresh) / 2; 不回收current->dirty_paused_when = now;current->nr_dirtied = 0;                 /* 脏页数量重新置0 */current->nr_dirtied_pause =dirty_poll_interval(dirty, thresh);   /* 重新设置线程脏页门限 */break;}if (unlikely(!writeback_in_progress(bdi)))  /* 唤醒真正的回写线程 */bdi_start_background_writeback(bdi);if (!strictlimit)bdi_dirty_limits(bdi, dirty_thresh, background_thresh,&bdi_dirty, &bdi_thresh, NULL);//nr_dirty > dirty_thresh/** 如果是单个bdi独自回收,当前bdi的 脏页超过门限即回收;* 如果是整个系统回收,当前bdi超过门限且系统的脏页也要超超过门限;*/dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); //超过门限if (dirty_exceeded && !bdi->dirty_exceeded)bdi->dirty_exceeded = 1;                        //超过门限,后面需要加速回收bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,nr_dirty, bdi_thresh, bdi_dirty,start_time);dirty_ratelimit = bdi->dirty_ratelimit;pos_ratio = bdi_position_ratio(bdi, dirty_thresh,background_thresh, nr_dirty,bdi_thresh, bdi_dirty);task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>RATELIMIT_CALC_SHIFT;max_pause = bdi_max_pause(bdi, bdi_dirty);min_pause = bdi_min_pause(bdi, max_pause,task_ratelimit, dirty_ratelimit,&nr_dirtied_pause);if (unlikely(task_ratelimit == 0)) {period = max_pause;pause = max_pause;goto pause;}period = HZ * pages_dirtied / task_ratelimit;pause = period;if (current->dirty_paused_when)pause -= now - current->dirty_paused_when;/** For less than 1s think time (ext3/4 may block the dirtier* for up to 800ms from time to time on 1-HDD; so does xfs,* however at much less frequency), try to compensate it in* future periods by updating the virtual time; otherwise just* do a reset, as it may be a light dirtier.*/if (pause < min_pause) {trace_balance_dirty_pages(bdi,dirty_thresh,background_thresh,nr_dirty,bdi_thresh,bdi_dirty,dirty_ratelimit,task_ratelimit,pages_dirtied,period,min(pause, 0L),start_time);if (pause < -HZ) {current->dirty_paused_when = now;current->nr_dirtied = 0;} else if (period) {current->dirty_paused_when += period;current->nr_dirtied = 0;} else if (current->nr_dirtied_pause <= pages_dirtied)current->nr_dirtied_pause += pages_dirtied;break;}if (unlikely(pause > max_pause)) {/* for occasional dropped task_ratelimit */now += min(pause - max_pause, max_pause);pause = max_pause;}pause:trace_balance_dirty_pages(bdi,dirty_thresh,background_thresh,nr_dirty,bdi_thresh,bdi_dirty,dirty_ratelimit,task_ratelimit,pages_dirtied,period,pause,start_time);__set_current_state(TASK_KILLABLE);io_schedule_timeout(pause);//有可能会切出去,但最大超过200mscurrent->dirty_paused_when = now + pause;current->nr_dirtied = 0;current->nr_dirtied_pause = nr_dirtied_pause;/** This is typically equal to (nr_dirty < dirty_thresh) and can* also keep "1000+ dd on a slow USB stick" under control.*/if (task_ratelimit)break;/** In the case of an unresponding NFS server and the NFS dirty* pages exceeds dirty_thresh, give the other good bdi's a pipe* to go through, so that tasks on them still remain responsive.** In theory 1 page is enough to keep the comsumer-producer* pipe going: the flusher cleans 1 page => the task dirties 1* more page. However bdi_dirty has accounting errors.  So use* the larger and more IO friendly bdi_stat_error.*/if (bdi_dirty <= bdi_stat_error(bdi))break;if (fatal_signal_pending(current))break;}if (!dirty_exceeded && bdi->dirty_exceeded)  //如果不超过门限,则置0bdi->dirty_exceeded = 0;if (writeback_in_progress(bdi))  //正在回收,则退出return;/** In laptop mode, we wait until hitting the higher threshold before* starting background writeout, and then write out all the way down* to the lower threshold.  So slow writers cause minimal disk activity.** In normal mode, we start background writeout at the lower* background_thresh, to keep the amount of dirty memory low.*//** 节能模式,起到什么作用呢??*/if (laptop_mode)return;if (nr_reclaimable > background_thresh) //可回收的页面大于background_thresh,则触发线程异步回收bdi_start_background_writeback(bdi);
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/794021.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Cisco Catalyst 9000 Series Switches, IOS XE Release 17.15.1 ED

Cisco Catalyst 9000 Series Switches, IOS XE Release 17.15.1 EDCisco Catalyst 9000 Series Switches, IOS XE Release 17.15.1 ED 思科 Catalyst 9000 交换产品系列 IOS XE 系统软件 请访问原文链接:https://sysin.org/blog/cisco-catalyst-9000/,查看最新版。原创作品,…

dbeaver导出表结构和数据,无需二次操作

1. 对某个数据库右键(示例demo)→工具→转储数据库 2.接着按下面进行操作:3.创建跟上面同名字的数据库: 右键数据库名字-》工具-》执行脚本 导入数据,执行sql文件时报错unknown command \\. 在额外的命令参数中添加下面命令即可: --default-character-set=utf8

Day01 MarkDown语法学习

MarkDown语法学习 标题 #+空格 一级标题 ##+空格 二级标题字体 粗体 **粗体** 斜体 *斜体* 斜体加粗 ***斜体加粗*** 删除线 ~~删除~~ 引用引用 > 引用分割线---或者***图片![截图2](https://cdn.luogu.com.cn/upload/usericon/1.png) 超链接 我的博客 [我的博客](https://w…

Graph Edge Partitioning via Neighborhood Heuristic

目录概符号说明Vertex vs Edge partitioningNE (Neighbor Expansion)代码Zhang C., Wei F., Liu Q., Tang Z. G. and Li Z. Graph edge partitioning via neighborhood heuristic. KDD, 2017.概 本文提出了一种图分割方法 (edge partitioning), 保证只有少量的重复结点. 符号说…

P11020 「LAOI-6」Radiation 题解

一道简单的构造题,其实不用想的十分复杂的说。 首先,最多发射的宇宙射线 \(sum\) 也最多为 \(sum_{max}=min(m,n)\) 也就是说,无论如何摆放石子,也只能达到这个数量。那么我们的目的便变成了如何让石子变成这一个形状。如上图,在一个 \(3\times6\) 的矩阵中,其实只要三颗…

适合科研的团队协作工具:8款实用评测

本文介绍的8款工具如下:1.Worktile;2.PingCode;3.蓝湖;4.智方科研管理系统;5.九云办公;6.和鲸ModelWhale;7.有道云协作;8.Maxhub。在科研项目中,团队协作软件的选择总是让人头疼。市面上有太多工具,不知道哪款更适合自己?每个软件都宣传自己效率高、功能全,但真正好…

精选10款团队协作工具,让合作更高效

本文将介绍10款团队协作工具:1.Worktile;2.PingCode;3.哨子办公;4.智办事;5.曲奇云盘;6.小钉贴;7.协同易;8.BoardMix;9.CORNERSTONE;10.ORGOS。团队合作中总是有很多信息来回传递,却没有一个统一的平台来管理任务和沟通,这不仅让工作效率大打折扣,还可能让团队成员…

1-2Java基本数据类型

Java基本数据类型 变量就是申请内存来存储值。也就是说,当创建变量的时候,需要在内存中申请空间。 内存管理系统根据变量的类型为变量分配存储空间,分配的空间只能用来储存该类型数据。因此,通过定义不同类型的变量,可以在内存中储存整数、小数或者字符。 Java 的两大数据…

知识库软件对比:10款适合团队的工具揭秘

本文将介绍10款知识库软件:1.PingCode; 2. Worktile; 3. 亿方云; 4. 掘金文档; 5. 问道文档; 6. 海豚智库; 7. 麦客; 8. Helpjuice; 9. Confluence; 10. FlowUs。如今,团队协作越来越依赖于高效的工具,而一个简单、易用的知识库软件能极大提升工作效率。面对市场上…

南方科技大学院士分析

网页信息获取分析报告 1.Python获取页面信息 这里需要爬取的是南方科技大学研究生院-师资概况页面,使用的是requests和BeautifulSoup方法 以下是要爬取的页面import requests from bs4 import BeautifulSoup import pandas as pd import matplotlib.pyplot as plt import seab…

VNC简明教程

VNC的安装方法 VNC是一款局域网远程工具。 安装包: https://cry33.lanzoum.com/b00oc0kmj密码:3zum 激活码: FBV9V-7Z3V9-MED3U-47SEU-85T3A 安装过程很简单,一直点下一步就行。激活有两种方式,第一种是邮箱激活,第二种是激活码激活。我们选择第二种激活方式,直接将上面的…

图解六种防火墙规则

图解六种防火墙规则