balance_dirty_pages_ratelimited分析

news/2025/4/19 3:56:37/文章来源:https://www.cnblogs.com/linhaostudy/p/18402841

balance_dirty_pages_ratelimited分析

  • nr_dirtied_pause:当前task的脏页门限;
  • dirty_exceeded:全局的脏页数超过门限或者该bdi的脏页数超过门限;(dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); )
  • bdp_ratelimits:percpu变量,当前CPU的脏页数
  • ratelimit_pages:CPU的脏页门限

调用balance_dirty_pages的条件有:
1:当前task的脏页数量大于ratelimit ,(如果dirty_exceeded为0,则为current->nr_dirtied_pause;如果dirty_exceeded为1,则最大为32KB)

2:当前CPU的脏页数超过了门限值ratelimit_pages;

3:当前脏页数+退出线程遗留的脏页超过了门限;

void balance_dirty_pages_ratelimited(struct address_space *mapping)
{struct backing_dev_info *bdi = inode_to_bdi(mapping->host);int ratelimit;int *p;if (!bdi_cap_account_dirty(bdi))return;ratelimit = current->nr_dirtied_pause;  /* 门限:初始值为32表示128KB */if (bdi->dirty_exceeded)                /* 如果该值设置了,则需要通过降低平衡触发的门限来加速脏页回收 */ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));  /* 重新修改门限,最大为32KB,初始值128KB,加快回收 */preempt_disable();/** This prevents one CPU to accumulate too many dirtied pages without* calling into balance_dirty_pages(), which can happen when there are* 1000+ tasks, all of them start dirtying pages at exactly the same* time, hence all honoured too large initial task->nr_dirtied_pause.*//* 即保证当前线程脏页数超过门限,或者当前CPU超过门限,都要回收 */p =  this_cpu_ptr(&bdp_ratelimits);  /* 当前CPU的脏页计数 */if (unlikely(current->nr_dirtied >= ratelimit))  /* 如果当前线程脏页数超过门限值,则肯定会触发下面的回收流程。同时重新计算当前CPU的脏页数 */*p = 0;else if (unlikely(*p >= ratelimit_pages)) {     /* 默认值为32页 */ /* 当前线程的脏页数未超过门限值,但是当前CPU的脏页数超过CPU脏页门限值,则设置门限为0,肯定会触发回收。同时重新计算当前CPU的脏页数 */*p = 0;ratelimit = 0;}/** Pick up the dirtied pages by the exited tasks. This avoids lots of* short-lived tasks (eg. gcc invocations in a kernel build) escaping* the dirty throttling and livelock other long-run dirtiers.*/p = this_cpu_ptr(&dirty_throttle_leaks);   /* 退出的线程,也放在这里处理 */if (*p > 0 && current->nr_dirtied < ratelimit) {  unsigned long nr_pages_dirtied;nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);*p -= nr_pages_dirtied;current->nr_dirtied += nr_pages_dirtied;}preempt_enable();if (unlikely(current->nr_dirtied >= ratelimit))    /* 当前线程脏页超过门限值 */balance_dirty_pages(mapping, current->nr_dirtied);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

正常情况下应该是周期回收和背景回收,不会占用当前task的时间。但是当dirty > dirty_freerun_ceiling(thresh, bg_thresh) 即脏页数大于直接回收门限和背景回收门限的1/2时,需要将当前CPU休眠一会,让回收线程工作。

但是dirty <= dirty_freerun_ceiling(thresh, bg_thresh),也会动态的调整nr_dirtied_pause ,号让其更好的回收,调整的策略为:

static unsigned long dirty_poll_interval(unsigned long dirty,unsigned long thresh)
{/*  */if (thresh > dirty)  /*  */return 1UL << (ilog2(thresh - dirty) >> 1);return 1;  /* 脏页数超过门限值,则返回1页就需要回收 */
}

至于为什么这么做,可以参考如下解析:
/*
Ideally if we know there are N dirtiers, it’s safe to let each task
poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will
rush high at next second. So sqrt is used to tolerate larger N on
increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf “%4d\t%4d\n”, mb, Math.sqrt(pages)}

1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit
won’t be exceeded as long as there are less than 16 (or 512) concurrent
dd’s.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).
When the dirty pages are floating in range [freerun, limit],
“[PATCH 14/18] writeback: control dirty pause time” will independently
adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for
large memory servers, which have large (thresh-freerun) gaps.

*/

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
{/* 可用内存并不是系统所有内存,而是free pages + reclaimable pages(文件页) */const unsigned long available_memory = global_dirtyable_memory();unsigned long background;unsigned long dirty;struct task_struct *tsk;if (vm_dirty_bytes)dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);elsedirty = (vm_dirty_ratio * available_memory) / 100;if (dirty_background_bytes)background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);elsebackground = (dirty_background_ratio * available_memory) / 100;if (background >= dirty)background = dirty / 2;tsk = current;if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {   /* 如果设置了该属性PF_LESS_THROTTLE或者是实时线程,门限稍微提高1/4 */background += background / 4;dirty += dirty / 4;}*pbackground = background;*pdirty = dirty;trace_global_dirty_state(background, dirty);
}static unsigned long global_dirtyable_memory(void)
{unsigned long x;/* 可用内存并不是系统所有内存,而是free pages + file pages(文件页) */x = global_page_state(NR_FREE_PAGES);x -= min(x, dirty_balance_reserve);x += global_page_state(NR_INACTIVE_FILE);x += global_page_state(NR_ACTIVE_FILE);if (!vm_highmem_is_dirtyable)x -= highmem_dirtyable_memory(x);return x + 1;	/* Ensure that we never return 0 */
}

1:如果可回收+正在回写脏页数量 < background和显式回写阈值的均值此次先不启动回写,否则启动background回写
2:如果可回收的脏页数大于背景回收门限值,则触发背景回收执行;

static void balance_dirty_pages(struct address_space *mapping,unsigned long pages_dirtied)
{unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */unsigned long background_thresh;unsigned long dirty_thresh;long period;long pause;long max_pause;long min_pause;int nr_dirtied_pause;bool dirty_exceeded = false;unsigned long task_ratelimit;unsigned long dirty_ratelimit;unsigned long pos_ratio;struct backing_dev_info *bdi = inode_to_bdi(mapping->host);bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; //单独门限值回收unsigned long start_time = jiffies;for (;;) {unsigned long now = jiffies;unsigned long uninitialized_var(bdi_thresh);unsigned long thresh;unsigned long uninitialized_var(bdi_dirty);unsigned long dirty;unsigned long bg_thresh;/** Unstable writes are a feature of certain networked* filesystems (i.e. NFS) in which data may have been* written to the server's write cache, but has not yet* been flushed to permanent storage.*/nr_reclaimable = global_page_state(NR_FILE_DIRTY) +global_page_state(NR_UNSTABLE_NFS);  /* 全局 文件脏页  + 网络文件系统 */  /* = file_dirty + unstable_nfs */nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); /*全局 文件总的脏页+包括正在回写 */  /* = file_dirty + writeback + unstable_nfs */global_dirty_limits(&background_thresh, &dirty_thresh);//获取两个门限值if (unlikely(strictlimit)) {  /* 单独bdi回收 */bdi_dirty_limits(bdi, dirty_thresh, background_thresh,&bdi_dirty, &bdi_thresh, &bg_thresh);dirty = bdi_dirty;thresh = bdi_thresh;} else {                       /* 全局回收 */dirty = nr_dirty;          /* 全局 文件总的脏页+包括正在回写 */thresh = dirty_thresh;bg_thresh = background_thresh;}/** Throttle it only when the background writeback cannot* catch-up. This avoids (excessively) small writeouts* when the bdi limits are ramping up in case of !strictlimit.** In strictlimit case make decision based on the bdi counters* and limits. Small writeouts when the bdi limits are ramping* up are the price we consciously pay for strictlimit-ing.*//* 小于直接回收文件和背景回收的/2, 不占用本线程时间;否则说明背景回收没有运行,需要占用本线程时间,  */if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {  //(thresh + bg_thresh) / 2; 不回收current->dirty_paused_when = now;current->nr_dirtied = 0;                 /* 脏页数量重新置0 */current->nr_dirtied_pause =dirty_poll_interval(dirty, thresh);   /* 重新设置线程脏页门限 */break;}if (unlikely(!writeback_in_progress(bdi)))  /* 唤醒真正的回写线程 */bdi_start_background_writeback(bdi);if (!strictlimit)bdi_dirty_limits(bdi, dirty_thresh, background_thresh,&bdi_dirty, &bdi_thresh, NULL);//nr_dirty > dirty_thresh/** 如果是单个bdi独自回收,当前bdi的 脏页超过门限即回收;* 如果是整个系统回收,当前bdi超过门限且系统的脏页也要超超过门限;*/dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); //超过门限if (dirty_exceeded && !bdi->dirty_exceeded)bdi->dirty_exceeded = 1;                        //超过门限,后面需要加速回收bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,nr_dirty, bdi_thresh, bdi_dirty,start_time);dirty_ratelimit = bdi->dirty_ratelimit;pos_ratio = bdi_position_ratio(bdi, dirty_thresh,background_thresh, nr_dirty,bdi_thresh, bdi_dirty);task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>RATELIMIT_CALC_SHIFT;max_pause = bdi_max_pause(bdi, bdi_dirty);min_pause = bdi_min_pause(bdi, max_pause,task_ratelimit, dirty_ratelimit,&nr_dirtied_pause);if (unlikely(task_ratelimit == 0)) {period = max_pause;pause = max_pause;goto pause;}period = HZ * pages_dirtied / task_ratelimit;pause = period;if (current->dirty_paused_when)pause -= now - current->dirty_paused_when;/** For less than 1s think time (ext3/4 may block the dirtier* for up to 800ms from time to time on 1-HDD; so does xfs,* however at much less frequency), try to compensate it in* future periods by updating the virtual time; otherwise just* do a reset, as it may be a light dirtier.*/if (pause < min_pause) {trace_balance_dirty_pages(bdi,dirty_thresh,background_thresh,nr_dirty,bdi_thresh,bdi_dirty,dirty_ratelimit,task_ratelimit,pages_dirtied,period,min(pause, 0L),start_time);if (pause < -HZ) {current->dirty_paused_when = now;current->nr_dirtied = 0;} else if (period) {current->dirty_paused_when += period;current->nr_dirtied = 0;} else if (current->nr_dirtied_pause <= pages_dirtied)current->nr_dirtied_pause += pages_dirtied;break;}if (unlikely(pause > max_pause)) {/* for occasional dropped task_ratelimit */now += min(pause - max_pause, max_pause);pause = max_pause;}pause:trace_balance_dirty_pages(bdi,dirty_thresh,background_thresh,nr_dirty,bdi_thresh,bdi_dirty,dirty_ratelimit,task_ratelimit,pages_dirtied,period,pause,start_time);__set_current_state(TASK_KILLABLE);io_schedule_timeout(pause);//有可能会切出去,但最大超过200mscurrent->dirty_paused_when = now + pause;current->nr_dirtied = 0;current->nr_dirtied_pause = nr_dirtied_pause;/** This is typically equal to (nr_dirty < dirty_thresh) and can* also keep "1000+ dd on a slow USB stick" under control.*/if (task_ratelimit)break;/** In the case of an unresponding NFS server and the NFS dirty* pages exceeds dirty_thresh, give the other good bdi's a pipe* to go through, so that tasks on them still remain responsive.** In theory 1 page is enough to keep the comsumer-producer* pipe going: the flusher cleans 1 page => the task dirties 1* more page. However bdi_dirty has accounting errors.  So use* the larger and more IO friendly bdi_stat_error.*/if (bdi_dirty <= bdi_stat_error(bdi))break;if (fatal_signal_pending(current))break;}if (!dirty_exceeded && bdi->dirty_exceeded)  //如果不超过门限,则置0bdi->dirty_exceeded = 0;if (writeback_in_progress(bdi))  //正在回收,则退出return;/** In laptop mode, we wait until hitting the higher threshold before* starting background writeout, and then write out all the way down* to the lower threshold.  So slow writers cause minimal disk activity.** In normal mode, we start background writeout at the lower* background_thresh, to keep the amount of dirty memory low.*//** 节能模式,起到什么作用呢??*/if (laptop_mode)return;if (nr_reclaimable > background_thresh) //可回收的页面大于background_thresh,则触发线程异步回收bdi_start_background_writeback(bdi);
}

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.hqwc.cn/news/794021.html

如若内容造成侵权/违法违规/事实不符,请联系编程知识网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Cisco Catalyst 9000 Series Switches, IOS XE Release 17.15.1 ED

Cisco Catalyst 9000 Series Switches, IOS XE Release 17.15.1 EDCisco Catalyst 9000 Series Switches, IOS XE Release 17.15.1 ED 思科 Catalyst 9000 交换产品系列 IOS XE 系统软件 请访问原文链接:https://sysin.org/blog/cisco-catalyst-9000/,查看最新版。原创作品,…

dbeaver导出表结构和数据,无需二次操作

1. 对某个数据库右键(示例demo)→工具→转储数据库 2.接着按下面进行操作:3.创建跟上面同名字的数据库: 右键数据库名字-》工具-》执行脚本 导入数据,执行sql文件时报错unknown command \\. 在额外的命令参数中添加下面命令即可: --default-character-set=utf8

Day01 MarkDown语法学习

MarkDown语法学习 标题 #+空格 一级标题 ##+空格 二级标题字体 粗体 **粗体** 斜体 *斜体* 斜体加粗 ***斜体加粗*** 删除线 ~~删除~~ 引用引用 > 引用分割线---或者***图片![截图2](https://cdn.luogu.com.cn/upload/usericon/1.png) 超链接 我的博客 [我的博客](https://w…

9月杂题

如此成绩,如何NOIP?咕咕咕

Graph Edge Partitioning via Neighborhood Heuristic

目录概符号说明Vertex vs Edge partitioningNE (Neighbor Expansion)代码Zhang C., Wei F., Liu Q., Tang Z. G. and Li Z. Graph edge partitioning via neighborhood heuristic. KDD, 2017.概 本文提出了一种图分割方法 (edge partitioning), 保证只有少量的重复结点. 符号说…

[ABC370C] Word Ladder 题解

题目描述: 给予两个相等长度的序列,\(S\) 与 \(T\) ,以及一个空数组 \(X\) ,每在 \(S\) 上修改一个字符,便将修改后的 \(S\) 加入 \(X\) 中,直到 \(S\) 与 \(T\) 相同。(输出字典序最小的 \(X\) 数组) 拿过题一看,感觉还是蛮简单的,本题主要的难点在字符串的字典序上。…

P11019 「LAOI-6」[太阳]] 请使用最新版手机 QQ 体验新功能 题解

非常简单的模拟题。由题意得,即找出输入字符串中,用 [] 围起来的片段中的大写字母 \(A_1,A_2,A_3...A_n\) 然后将其转换为小写输出 \(/a_1a_2a_3...a_n\) 即可。 #include <bits/stdc++.h> #define seq(q, w, e) for (int q = w; q <= e; q++) #define ll long long…

P11020 「LAOI-6」Radiation 题解

一道简单的构造题,其实不用想的十分复杂的说。 首先,最多发射的宇宙射线 \(sum\) 也最多为 \(sum_{max}=min(m,n)\) 也就是说,无论如何摆放石子,也只能达到这个数量。那么我们的目的便变成了如何让石子变成这一个形状。如上图,在一个 \(3\times6\) 的矩阵中,其实只要三颗…