thermal子系统概述

thermal子系统是内核提供的温控管理框架,一套软件温度解决方案,配合ic内部温度传感器,对ic温度进行管控,保证系统稳定性。thermal系统多用于对ic内部的重点发热模块的功能管控,如cpu、gpu。

Sensor(传感器)将设备温度传给thermal子系统,thermal子系统将根据调控对象的温度,决定是否触发对应的冷却措施,如限制CPU最大工作频率,以及CPU打开的核数等,从而实现对设备的冷却。

thermal子系统中的几个概念:

  1. thermal zone
    代表一个温控管理区间,可以将其看做一个虚拟意义上的温度Sensor, 需要有对应的物理Sensor与其关联再能发挥作用。一个thermal zone最多可以关联一个sensor,但该sensor可以是多个硬件sensor的混合。

  2. trip point
    即触发点,由thermal zone维护。每个thermal zone可以维护多个trip point。trip point包含以下信息:

    • temp:触发温度,当温度到达触发温度则该trip point被触发。
    • type:trip point类型,沿袭PC散热方式,分为四种类型—passive、active、hot、critical。
  3. cooling device
    实际对系统实施冷却措施的驱动,温控的执行者。cooling device 维护一个冷却等级(state),一般state越高即系统的冷却需求越高。cooling device根据不同等级的冷却需求进行冷却行为。
    cooling device只根据state进行冷却操作,是实施者,而state的计算由thermal governor完成。

  4. thermal governor
    温控策略,即超过触发温度后,如何计算合适的cooling device的冷却等级。

thermal软件框架

  • thermal zone device: 获取温度的设备,一个thermal zone device也就代表一个温区(thermal zone),一般是硬件传感器。
  • cooling device:控制温度的设备,分为主动散热设备和被动散热设备,fan即为主动散热设备;而cpu,gpu等通过降低频率的方式来实现散热,即被动散热设备。
  • thermal governor:温控策略,计算cooling device的冷却等级。

框架初始化

1
2
3
4
5
6
7
8
9
//drivers/thermal/thermal_core.c
thermal_init
thermal_register_governors//注册governors(1)
for_each_governor_table(governor)
thermal_register_governor//将所有可用的governor注册到全局链表thermal_governor_list,并初始化def_governor
class_register(&thermal_class);//注册/sys/class/thermal
genetlink_init//注册Generic Netlink,暂时未了解
of_parse_thermal_zones//解析设备树
register_pm_notifier(&thermal_pm_nb);//注册通知链

(1)注册governors时,governor table是在编译时确定的,由__governor_thermal_table__governor_thermal_table_end确定了这个table,如下所示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
//System.map

xxxxxxxxxxxxxxxx T __governor_thermal_table
xxxxxxxxxxxxxxxx T __irqchip_acpi_probe_table_end
xxxxxxxxxxxxxxxx t __thermal_table_entry_thermal_gov_fair_share
xxxxxxxxxxxxxxxx T __timer_acpi_probe_table
xxxxxxxxxxxxxxxx T __timer_acpi_probe_table_end
xxxxxxxxxxxxxxxx t __thermal_table_entry_thermal_gov_bang_bang
xxxxxxxxxxxxxxxx t __thermal_table_entry_thermal_gov_step_wise
xxxxxxxxxxxxxxxx t __thermal_table_entry_thermal_gov_user_space
xxxxxxxxxxxxxxxx t __thermal_table_entry_thermal_gov_power_allocator
xxxxxxxxxxxxxxxx T __earlycon_table
xxxxxxxxxxxxxxxx T __governor_thermal_table_end

看来温控策略是从指定的段中获取的,其实每个governor都用下面的宏定义来定义

1
2
3
4
5
6
drivers/thermal/thermal_core.h
#define THERMAL_TABLE_ENTRY(table, name) \
static typeof(name) *__thermal_table_entry_##name \
__used __section(__##table##_thermal_table) = &name

#define THERMAL_GOVERNOR_DECLARE(name) THERMAL_TABLE_ENTRY(governor, name)

默认governor由配置决定,例如CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE

thermal zone device

设备来源

这里我们以常见的PC为例,固件提供的ACPI表中,常常存在"_TZ"命名空间,如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
Scope (_TZ)
{
ThermalZone (THM0)
{
Method (_TMP, 0, NotSerialized) // _TMP: Temperature
{
Local0 = \_SB.THSE
Local1 = CCNT (Local0)
Local3 = \_SB.SRAM
If ((Local3 == 0x0005000180150040))
{
Return (C2K (Local1))
}
Else
{
Return (C2K ((Local1 - 0x0A)))
}
}

Method (_CRT, 0, NotSerialized) // _CRT: Critical Temperature
{
Return (C2K (0x60))
}

Method (_PSL, 0, Serialized) // _PSL: Passive List
{
Return (Package (0x04)
{
\_SB.C000,
\_SB.C001,
\_SB.C002,
\_SB.C003
})
}

Method (_PSV, 0, NotSerialized) // _PSV: Passive Temperature
{
Return (C2K (0x46))
}

Name (_TC1, Zero) // _TC1: Thermal Constant 1
Name (_TC2, 0x32) // _TC2: Thermal Constant 2
Name (_TZP, 0x012C) // _TZP: Thermal Zone Polling
Name (_TSP, 0x012C) // _TSP: Thermal Sampling Period
}

Method (CCNT, 1, NotSerialized)
{
Local0 = ((Arg0 & 0xFFFF) * 0x02DB)
Local1 = (Local0 / 0x4000)
Local2 = (Local1 - 0x0111)
Return (Local2)
}

Method (C2K, 1, NotSerialized)
{
Local0 = ((Arg0 * 0x0A) + 0x0AAC)
If ((Local0 <= 0x0AAC))
{
Local0 = 0x0BB8
}

If ((Local0 > 0x0FAC))
{
Local0 = 0x0BB8
}

Return (Local0)
}
}
  1. _TMP:获取温度
    该方法通过读传感器对应的寄存器实现
  2. _CRT:critical 触发点
    若达到该温度则关机
  3. _PSL:被动散热设备列表
    执行被动散热时的cooling device
  4. _PSV:被动散热触发点
    若达到该温度则执行相应策略
  5. _TC1,_TC2
    被动散热计算温升趋势时需要使用的参数
  6. TZP: 获取温度的轮询频率
    在未超过被动散热触发温度时,使用该频率轮询设备温度,单位为十分之一秒,值为0表示不需要轮询(硬件能够生成异步通知)
  7. _TSP: 被动散热轮询频率
    在超过被动散热触发温度后,使用该频率轮询设备温度,单位为十分之一秒

详细说明见ACPI Specification: 11.Thermal Management

内核启动时会解析ACPI的DSDT表,命名空间"_TZ"下的ThermalZone (THM0)会被注册到acpi_bus_type上,id为ACPI_THERMAL_HID

解析接口:acpi_init->acpi_scan_init->acpi_bus_scan->acpi_walk_namespace

注册thermal zone device

流程如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
//driver/acpi/thermal.c

static struct thermal_zone_device_ops acpi_thermal_zone_ops = {
.bind = acpi_thermal_bind_cooling_device,
.unbind = acpi_thermal_unbind_cooling_device,
.get_temp = thermal_get_temp,
.get_mode = thermal_get_mode,
.set_mode = thermal_set_mode,
.get_trip_type = thermal_get_trip_type,
.get_trip_temp = thermal_get_trip_temp,
.get_crit_temp = thermal_get_crit_temp,
.get_trend = thermal_get_trend,
.notify = thermal_notify,
};

static const struct acpi_device_id thermal_device_ids[] = {
{ACPI_THERMAL_HID, 0},
{"", 0},
};

static struct acpi_driver acpi_thermal_driver = {
.name = "thermal",
.class = ACPI_THERMAL_CLASS,
.ids = thermal_device_ids,
.ops = {
.add = acpi_thermal_add,
.remove = acpi_thermal_remove,
.notify = acpi_thermal_notify,
},
.drv.pm = &acpi_thermal_pm,
};

acpi_thermal_init
acpi_bus_register_driver(&acpi_thermal_driver);//驱动注册函数(1)
acpi_thermal_add
acpi_thermal_register_thermal_zone(tz);
thermal_zone_device_register("acpitz", trips, 0, tz,
&acpi_thermal_zone_ops, NULL,
tz->trips.passive.tsp*100,
tz->polling_frequency*100);

(1)acpi_bus_register_driver会将驱动注册到acpi_bus_type总线上,根据ids最终会匹配到一开始注册的ThermalZone (THM0)

接下来介绍thermal子系统中thermal zone device相关的结构体和API

struct thermal_zone_device

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
//include/linux/thermal.h
/**
* struct thermal_zone_device - structure for a thermal zone
* @id: unique id number for each thermal zone
* @type: the thermal zone device type
* @device: &struct device for this thermal zone
* @trip_temp_attrs: attributes for trip points for sysfs: trip temperature
* @trip_type_attrs: attributes for trip points for sysfs: trip type
* @trip_hyst_attrs: attributes for trip points for sysfs: trip hysteresis
* @devdata: private pointer for device private data
* @trips: number of trip points the thermal zone supports
* @trips_disabled; bitmap for disabled trips
* @passive_delay: number of milliseconds to wait between polls when
* performing passive cooling.
* @polling_delay: number of milliseconds to wait between polls when
* checking whether trip points have been crossed (0 for
* interrupt driven systems)
* @temperature: current temperature. This is only for core code,
* drivers should use thermal_zone_get_temp() to get the
* current temperature
* @last_temperature: previous temperature read
* @emul_temperature: emulated temperature when using CONFIG_THERMAL_EMULATION
* @passive: 1 if you've crossed a passive trip point, 0 otherwise.
* @prev_low_trip: the low current temperature if you've crossed a passive
trip point.
* @prev_high_trip: the above current temperature if you've crossed a
passive trip point.
* @forced_passive: If > 0, temperature at which to switch on all ACPI
* processor cooling devices. Currently only used by the
* step-wise governor.
* @need_update: if equals 1, thermal_zone_device_update needs to be invoked.
* @ops: operations this &thermal_zone_device supports
* @tzp: thermal zone parameters
* @governor: pointer to the governor for this thermal zone
* @governor_data: private pointer for governor data
* @thermal_instances: list of &struct thermal_instance of this thermal zone
* @ida: &struct ida to generate unique id for this zone's cooling
* devices
* @lock: lock to protect thermal_instances list
* @node: node in thermal_tz_list (in thermal_core.c)
* @poll_queue: delayed work for polling
* @notify_event: Last notification event
*/
struct thermal_zone_device {
int id;
char type[THERMAL_NAME_LENGTH];
struct device device;
struct attribute_group trips_attribute_group;
struct thermal_attr *trip_temp_attrs;
struct thermal_attr *trip_type_attrs;
struct thermal_attr *trip_hyst_attrs;
void *devdata;
int trips;//触发点个数
unsigned long trips_disabled; /* bitmap for disabled trips */
int passive_delay;//被动散热时的温度轮询周期(ms)
int polling_delay;//未超过触发点温度时的温度轮询周期(ms)
int temperature;//当前温度
int last_temperature;//上一次温度
int emul_temperature;//模拟温度,打开CONFIG_THERMAL_EMULATION时,可以通过sysfs直接修改温度值
int passive;//表示当前是否处于被动散热!!
int prev_low_trip;//
int prev_high_trip;//(1)
unsigned int forced_passive;
atomic_t need_update;//用于标志是否开启了温度监控工作
struct thermal_zone_device_ops *ops;//回调函数
struct thermal_zone_params *tzp;//上层注册时传入的参数
struct thermal_governor *governor;//对应的governor
void *governor_data;
struct list_head thermal_instances;//和cooling device关联的结构体(2)
struct ida ida;
struct mutex lock;
struct list_head node;
struct delayed_work poll_queue;
enum thermal_notify_event notify_event;
};

(1) prev_low_tripprev_high_trip参数用于动态改变触发点温度,见driver/thermal/thermal_helpers.c:thermal_zone_set_trips函数,提要提供ops->set_tripsops->get_trip_hyst接口,本文暂不介绍

详见 ACPI Specification: 11.1.2 Dynamically Changing Cooling Temperature Trip Points

(2) thermal子系统使用下面的结构体,关联thermal zone device和cooling device

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct thermal_instance {
int id;
char name[THERMAL_NAME_LENGTH];
struct thermal_zone_device *tz;
struct thermal_cooling_device *cdev;
int trip;
bool initialized;
unsigned long upper; /* Highest cooling state for this trip point */
unsigned long lower; /* Lowest cooling state for this trip point */
unsigned long target; /* expected cooling state */
char attr_name[THERMAL_NAME_LENGTH];
struct device_attribute attr;
char weight_attr_name[THERMAL_NAME_LENGTH];
struct device_attribute weight_attr;
struct list_head tz_node; /* node in tz->thermal_instances */
struct list_head cdev_node; /* node in cdev->thermal_instances */
unsigned int weight; /* The weight of the cooling device */
};

结构体中的tz_node即对应thermal_zone_device->thermal_instances,后续《绑定动作》章节会介绍

thermal_zone_device_register

本节介绍thermal_zone_device_register的注册流程。温控肯定要循环监控设备温度,该操作在注册时就会开启,并循环执行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
//drivers/thermal/thermal_core.c
thermal_zone_device_register
dev_set_name(&tz->device, "thermal_zone%d", tz->id);
device_register(&tz->device);
thermal_set_governor//选择governor
list_add_tail(&tz->node, &thermal_tz_list);//注册到全局链表中
bind_tz(tz);//为该温区绑定冷却设备,后续《绑定动作》章节介绍
tz->ops->bind(tz, pos);
INIT_DELAYED_WORK(&tz->poll_queue, thermal_zone_device_check);
thermal_zone_device_reset(tz);
tz->passive = 0;//很重要,代表当前是否超过被动散热的触发温度
thermal_zone_device_init
thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);//注册设备时,调用一次该函数
update_temperature(tz);//更新温度
thermal_zone_set_trips(tz);//动态更新触发点,本文例子不支持该功能
handle_thermal_trip//根据触发点类型不同,执行不同的动作,该温区下的所有触发点都会执行一次
if (type == THERMAL_TRIP_CRITICAL || type == THERMAL_TRIP_HOT)
handle_critical_trips(tz, trip, type);//THERMAL_TRIP_CRITICAL下超过触发温度会关机
else
handle_non_critical_trips(tz, trip);//执行governor逻辑,后续《govornor》章节介绍
monitor_thermal_zone//再次监控设备温度
if (tz->passive)//可以发现只有实施了被动冷却后,才会使用passive_delay参数进行轮询
thermal_zone_device_set_polling(tz, tz->passive_delay);
else if (tz->polling_delay)//若没有实施被动冷却,则使用该参数轮询
thermal_zone_device_set_polling(tz, tz->polling_delay);
else
thermal_zone_device_set_polling(tz, 0);//(1)

(1)thermal_zone_device_set_polling函数实现如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//drivers/thermal/thermal_core.c
INIT_DELAYED_WORK(&tz->poll_queue, thermal_zone_device_check);

static void thermal_zone_device_check(struct work_struct *work)
thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);


static void thermal_zone_device_set_polling(struct thermal_zone_device *tz,
int delay)
{
if (delay > 1000)
mod_delayed_work(system_freezable_power_efficient_wq,
&tz->poll_queue,
round_jiffies(msecs_to_jiffies(delay)));
else if (delay)
mod_delayed_work(system_freezable_power_efficient_wq,
&tz->poll_queue,
msecs_to_jiffies(delay));
else
cancel_delayed_work(&tz->poll_queue);
}

可以发现最终再次调用了thermal_zone_device_update,来实现轮询设备温度,当delay == 0时,则取消轮询,即传感器会使用中断异步上传信息.

cooling device

设备来源

可以想象,CPU的被动散热,即降低cpu频率,那么对应的冷却设备也是CPU。

在内核初始化时,架构代码中通常存在topology_init函数调用register_cpu,将cpu到cpu_subsys总线上,不过此时并没有提供可以匹配的ID,ID的赋予还是在解析ACPI表的过程中实现的。

1
2
3
4
5
6
//drivers/base/cpu.c
register_cpu
memset(&cpu->dev, 0x00, sizeof(struct device));//可以看到将dev初始化了,而且本函数中未填充,需要fwnode成员来关联对应的acpi_device
device_register(&cpu->dev);
per_cpu(cpu_sys_devices, num) = &cpu->dev;//很关键,注册后到全局的cpu_sys_devices中,后续其他模块才能找到该CPU结构

subsys_initcall(topology_init),通常在initcall4的最前面执行

接下来先注册一下赋予ID的回调函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
//drivers/acpi/acpi_processor.c

static const struct acpi_device_id processor_device_ids[] = {
{ ACPI_PROCESSOR_OBJECT_HID, },
{ ACPI_PROCESSOR_DEVICE_HID, },
{ }
};

static struct acpi_scan_handler processor_handler = {
.ids = processor_device_ids,
.attach = acpi_processor_add,
#ifdef CONFIG_ACPI_HOTPLUG_CPU
.detach = acpi_processor_remove,
#endif
.hotplug = {
.enabled = true,
},
};

acpi_processor_add
dev = get_cpu_device(pr->id);//从全局变量cpu_sys_devices中获取register_cpu注册的dev
acpi_bind_one(dev, device);//将dev和acpi_dev绑定起来(给struct device->fwnode赋值)

acpi_processor_init
acpi_scan_add_handler_with_hotplug(&processor_handler, "processor");
acpi_scan_add_handler(handler);//注册一个handler到系统全局链表acpi_scan_handlers_list

可以看到acpi_processor_init这个函数最终将一个handler注册到了全局链表acpi_scan_handlers_list中,匹配ID为 ACPI_PROCESSOR_OBJECT_HID,ACPI_PROCESSOR_DEVICE_HID

接下来看一下解析ACPI表的过程

1
2
3
4
5
6
7
8
9
10
11
12
//drivers/acpi/scan.c
acpi_scan_init
acpi_processor_init//上面描述的,注册处理器的handler处理函数
acpi_bus_scan(ACPI_ROOT_OBJECT);
acpi_walk_namespace//遍历ACPI表,添加设备到acpi_bus_type,返回的device是根节点
acpi_bus_attach(device);//这个函数会递归解析所有子设备

acpi_scan_attach_handler//在全局链表acpi_scan_handlers_list中根据ids寻找,这里是ACPI_PROCESSOR_OBJECT_HID
handler->attach(device, devid);//此时调用attach回调函数

list_for_each_entry(child, &device->children, node)
acpi_bus_attach(child);

注册cooling device

cooling device注册流程如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
//driver/acpi/processor_driver.c
const struct thermal_cooling_device_ops processor_cooling_ops = {
.get_max_state = processor_get_max_state,
.get_cur_state = processor_get_cur_state,
.set_cur_state = processor_set_cur_state,
};

static const struct acpi_device_id processor_device_ids[] = {
{ACPI_PROCESSOR_OBJECT_HID, 0},
{ACPI_PROCESSOR_DEVICE_HID, 0},
{"", 0},
};

static struct device_driver acpi_processor_driver = {
.name = "processor",
.bus = &cpu_subsys,
.acpi_match_table = processor_device_ids,
.probe = acpi_processor_start,
.remove = acpi_processor_stop,
};

acpi_processor_driver_init
driver_register(&acpi_processor_driver);//注册到cpu_subsys总线上,根据ids找到对应设备(1)
acpi_processor_start
__acpi_processor_start(device);
cpi_pss_perf_init(pr, device);
pr->cdev = thermal_cooling_device_register("Processor", device, &processor_cooling_ops);
__thermal_cooling_device_register(NULL, type, devdata, ops);
list_add(&cdev->node, &thermal_cdev_list);//注册到全局链表中
bind_cdev(cdev);//绑定thermal zone device(2)
thermal_zone_device_update//监控设备温度(3)

(1)有多少个核就会注册几个冷却设备
(2)绑定thermal zone device
本例中acpi_processor_driver_initacpi_thermal_init之前执行(查看System.map),所以此时无设备(况且也没提供bind回调函数)。
(3)注册冷却设备时,也会开启监控设备温度,和注册thermal zone device时是一样的,只会执行一次(thermal_zone_device->need_update来记录)

绑定动作

cooling device和thermal zone device的关联信息记录在thermal_instance结构体中,cooling device最终是和具体的trip point绑定,即当trip point触发后由那个cooling device去实施冷却措施。每个trip point必须与一个cooling device绑定,才有实际意义。

在初始化时,thermal zone device通过调用bind_tz函数完成绑定cooling device

1
2
3
4
5
6
//drivers/thermal/thermal_core.c
bind_tz(struct thermal_zone_device *tz)
/* If there is ops->bind, try to use ops->bind */
if (tz->ops->bind)
list_for_each_entry(pos, &thermal_cdev_list, node)//遍历所有的cooling device
tz->ops->bind(tz, pos);

回调函数实现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
struct thermal_instance {
int id;
char name[THERMAL_NAME_LENGTH];
struct thermal_zone_device *tz;//关联的传感器
struct thermal_cooling_device *cdev;//关联的冷却设备
unsigned long upper; /* Highest cooling state for this trip point */
unsigned long lower; /* Lowest cooling state for this trip point */
unsigned long target; /* expected cooling state */
struct list_head tz_node; /* node in tz->thermal_instances */
struct list_head cdev_node; /* node in cdev->thermal_instances */
unsigned int weight; /* The weight of the cooling device */
}

//drivers/acpi/thermal.c
acpi_thermal_bind_cooling_device
acpi_thermal_cooling_device_cb(thermal, cdev, true);
int trip = -1;
if (tz->trips.critical.flags.valid)//(1)
trip++;
if (tz->trips.hot.flags.valid)
trip++;
if (tz->trips.passive.flags.valid) {
trip++;
for (i = 0; i < tz->trips.passive.devices.count; {//遍历"_PSL"列表
handle = tz->trips.passive.devices.handles[i];//这几行代码很重要(2)
status = acpi_bus_get_device(handle, &dev);
if (ACPI_FAILURE(status) || dev != device)
continue;

thermal_zone_bind_cooling_device(thermal, trip, cdev, THERMAL_NO_LIMIT, THERMAL_NO_LIMIT, THERMAL_WEIGHT_DEFAULT);
dev->target = THERMAL_NO_TARGET;//这个表示当前冷却等级
list_add_tail(&dev->tz_node, &tz->thermal_instances);//(3)
list_add_tail(&dev->cdev_node, &cdev->thermal_instances);
}

(1)上层解析保存ACPI表中的触发点时,是按着critical,hot,passive,active的顺序的,trip参数即代表第几个触发点
(2)passive.devices链表,是在解析"_PSL"时填充的,代表该温区的被动散热cooling device,可以发现在绑定时,判断了从这个链表获取的dev和待绑定的冷却设备是否相同。也就是说虽然有多少个核,就注册了多少个cooling device,但是用不用还是要看ACPI表中的"_PSL"
(3)关联函数最终将instance结构添加到thermal zone device和cooling device中的链表中

governor

govornor的结构体比较简单,主要是其计算冷却等级的回调函数,本节以step_wise为例

1
2
3
4
5
//drivers/thermal/step_wise.c
static struct thermal_governor thermal_gov_step_wise = {
.name = "step_wise",
.throttle = step_wise_throttle,
};

前文介绍过,注册thermal zone device或cooling device时,会开始轮询设备温度,但是对于温度的处理没有深入介绍,本节详细说明,流程图如下

当触发点类型不是hot和critical时,会执行governor.throttle回调

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
//drivers/thermal/step_wise.c
step_wise_throttle
thermal_zone_trip_update(tz, trip);
tz->ops->get_trip_temp(tz, trip, &trip_temp);//获取触发温度
trend = get_tz_trend(tz, trip);//获取温升趋势,即ops->get_trend回调(1)
if (tz->temperature >= trip_temp)
throttle = true//温度大于触发温度时置位(2)
list_for_each_entry(instance, &tz->thermal_instances, tz_node) //遍历所有cooling device
if (instance->trip != trip)//cooling device绑定的是触发点,需要判断
continue;

old_target = instance->target;//若没有超过温度触发点时,该参数还是THERMAL_NO_TARGET
instance->target = get_target_state(instance, trend, throttle);//计算下一次冷却等级(3)
if (instance->initialized && old_target == instance->target)//当target无变化时,直接循环
continue;

if (old_target == THERMAL_NO_TARGET &&
instance->target != THERMAL_NO_TARGET)//表示执行被动散热,将tz->passive标志位置位,使用passive_delay轮询
update_passive_instance(tz, trip_type, 1);

else if (old_target != THERMAL_NO_TARGET &&
instance->target == THERMAL_NO_TARGET)//表示冷却等级已经最低,且未超过触发温度,关闭被动散热功能
update_passive_instance(tz, trip_type, -1);
instance->initialized = true;
instance->cdev->updated = false; /* cdev needs update */
list_for_each_entry(instance, &tz->thermal_instances, tz_node)
thermal_cdev_update(instance->cdev);//上面已经计算出来cool device的冷却等级了,现在要真正更新冷却行为了
cdev->ops->set_cur_state(cdev, target)//调用回调processor_set_cur_state,更新state
freq_qos_update_request//限频
thermal_cooling_device_stats_update(cdev, target);//记录统计信息,cooldevice目录下有个stats目录,会记录在各个等级的时间

(1)温升趋势,例如上升(THERMAL_TREND_RAISING),下降(THERMAL_TREND_DROPPING)与稳定(THERMAL_TREND_STABLE),参与计算冷却等级
(2)throttle代表是否超过触发点温度,参与计算冷却等级
(3)根据trend和throttle计算冷却等级

  • 当throttle发生且温升趋势为上升,使用更高一级的cooling state;
  • 当throttle发生且温升趋势为下降,不改变cooling state;
  • 当throttle解除且温升趋势为上升,不改变cooling state;
  • 当throttle解除且温升趋势为下降,使用更低一级的cooling state;

本例中最终通过冷却设备提供的set_cur_state回调函数,来调整CPU频率,接下来简单看一下如何维护

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
//drivers/acpi/processor_thermal.c

#define CPUFREQ_THERMAL_MIN_STEP 0
#define CPUFREQ_THERMAL_MAX_STEP 3

/*
* There exists four states according to
* cpufreq_thermal_reduction_pctg. 0, 1, 2, 3
*/
static DEFINE_PER_CPU(unsigned int, cpufreq_thermal_reduction_pctg);
#define reduction_pctg(cpu) \
per_cpu(cpufreq_thermal_reduction_pctg, phys_package_first_cpu(cpu))


const struct thermal_cooling_device_ops processor_cooling_ops = {
.get_max_state = processor_get_max_state,
.get_cur_state = processor_get_cur_state,
.set_cur_state = processor_set_cur_state,
};

cpufreq_get_cur_state(unsigned int cpu)
return reduction_pctg(cpu);


cpufreq_set_cur_state(unsigned int cpu, int state)
reduction_pctg(cpu) = state;
max_freq = (policy->cpuinfo.max_freq * (100 - reduction_pctg(i) * 20)) / 100;
freq_qos_update_request(&pr->thermal_req, max_freq);

可以看到使用了percpu变量cpufreq_thermal_reduction_pctg来维护冷却等级,最大为3,在设置冷却等级时,根据冷却等级对CPU进行降频,通过freq Qos策略实现限频。

动态调频相关内容可参考另一篇文章《Linux cpufreq framework》

调试

在介绍struct thermal_zone_device时,提到过emul_temperature成员,当打开CONFIG_THERMAL_EMULATION配置时,可以通过echo xxx > /sys/class/thermal/thermal_zone0/emul_temp节点,即可改变当前温区的温度,方便调试(温度会立即更新)。