2016-01-21

即時系統加入排程器

目標

修改 linux kernel，在 kernel/sched.} 相關 scheduler 部分增加 Weighted Round-Robin scheduler、Shortest Job First scheduler 以及 Rate-monotonic scheduler。最後藉由 Project 1 的練習撰寫測試程序驗證 scheduler 是否符合預期結果。

修改檔案列表

.
├── README.md
├── arch
│   └── x86
│       ├── include
│       │   └── asm
│       │       ├── unistd_32.h
│       │       └── unistd_64.h
│       └── kernel
│           └── syscall_table_32.S
├── include
│   └── linux
│       ├── sched.h
│       └── syscalls.h
└── kernel
    ├── sched.c
    ├── sched_fair.c
    ├── sched_rt.c
    └── sched_weighted_rr.c

特別注意到，若編譯環境為 x86_64 則需要額外修改 arch/x86/include/asm/unistd_64.h 複製 arch/x86/include/asm/unistd_32.h 增加的 syscall。同理，增加自己定義的 syscall 時，需要修改 arch/x86 下的那三個檔案 (如果發現編譯 kernel 時，出現 WARNING: syscall NOT IMPLEMENTION. 大多數都是這個造成)。

在 include/linux/sched.h 中，增加 task 的成員變數，提供接下來針對 task 操作需要的參數，特別的是在 struct sched_rt_entity; 中，相較於一般資料結構的寫法，linux 採用倒過來將節點訊息放在資料內部中，利用 #define container_of(ptr, type, member) 取得物件訊息，這種寫法方便解決 task 在數個 scheduler 中轉換。

因為 syscall 的方式設定 task weighted，增加一個全區變數 int weighted_rr_time_slice;，每一次 syscall 去設定這一個全區變數值，然後 task 建立時，經過 sched_fork，直接利用被修改全區變數作為這一個 task 的 weight。若要特別針對 task 設定 weighted round-robin，在建立前都要呼叫 syscall 設定全區變數。

struct sched_rt_entity {
    ...
    //+ RTS Proj2: weighted_rr
    struct list_head weighted_rr_list_item;
    ...
};
...
struct task_struct {
    ...
    //+ RTS Proj2: weighted_rr
    unsigned int task_time_slice;
    //+ RTS Proj2: weighted_rr_prio
    unsigned int weighted_time_slice;
    //+ RTS Proj2: bonus RMS
    int period;
    ...
};
...
//+ RTS Proj2: weighted_rr
extern int weighted_rr_time_slice;

Part 1. Weighted Round-Robin Scheduler

enqueue_task_weighted_rr()
函數將會給予 task 和 rq，將 task 根據 rq 的資料結構重新將 task 放入。若按照 list 結構，直接模擬 queue 即可，由於 linux 提供 list 本身為雙向串列，直接加入尾端只需要 $\mathcal{O}(1)$ 時間。並且運行 rq->weighted_rr.nr_running++; 增加 rq 中的計數，隨後用來在回報該 scheduler 中有沒有存在 task 需要運行，方便在 $\mathcal{O}(1)$ 時間內完成，由於 Weighted round-robin 只需要確認 list 是否為空，那麼也可以利用 linux 提供的 list_empty() 確認。
dequeue_task_weighted_rr()
當 task 完成任務時，都要呼叫 update_curr_weighted_rr 進行統計運行多久，並且更新與 task 相關的排程權重。接著將 task 從 rq 串列中移除，並且更新 scheduler 的 task 計數 (rq->weighted_rr.nr_running--;)。時間複雜度 $\mathcal{O}(1)$
requeue_task_weighted_rr()
將一個 task 出列，在 Weighted round-robin 只需要直接對將 task.weighted_rr_list_item 移到串列最尾端，由於採用雙向串列，直接跟 scheduler 取得 list head (linux list 的 head 是不存放任何資訊) 直接加入尾端即可。時間複雜度 $\mathcal{O}(1)$
yield_task_weighted_rr()
直接運行 \lstinline{requeue} 即可。
pick_next_task_weighted_rr()
當最上層分配一段時間給 scheduler 運行，運行時會調用這個函數，並拿取要執行的 task，但並不用將其移除串列，並執行 next->se.exec_start = rq->clock; 紀錄 task 的開始運行時間，再呼叫 void update_curr_weighted_rr(struct rq *)。若不更新時間，則計算的相對運行時間會錯誤。
task_tick_weighted_rr()
當 scheduler 運行時，每隔一段固定的時間會呼叫此函數。若程序執行量超過 scheduler 的設定，則需要更新串列，特別注意到 if (p->task_time_slice && --p->task_time_slice) 在原本預設是 unsigned int 型態，處理不當會造成 overflow，另一種情況會是一開始 quantum 設定為 0 所導致。
需根據不同的 task 要補充不同的量 p->weighted_time_slice，不只要讓這支程式重進進入串列，同時需要呼叫 set_tsk_need_resched(p) (參考 kernel/sched_rt.c 代碼)，藉以重新呼叫 pick_next_task_weighted_rr(struct rq*)，否則這支程序會運行到結束。

Part 2. Shortest Job First Scheduler

enqueue_task_weighted_rr()
與 Weighted round-robin 類似，利用 list 做插入排序，可以在最慘 $\mathcal{O}(n)$ 時間內維護一個優先權由高至低的 list。需要 task 時，直接將串列的首元素移除。
dequeue_task_weighted_rr()
類同 Weighted round-robin。
requeue_task_weighted_rr()
在最短工作優先排程中，由於 task 優先權會變動，不方便確定執行過一陣子的 task 要移動到哪裡，最簡單的實作方式採用 dequeue 後再一次 enqueue。時間複雜度 $\mathcal{O}(n)$
yield_task_weighted_rr()
直接運行 \lstinline{requeue} 即可。
pick_next_task_weighted_rr()
類同 Weighted round-robin。
task_tick_weighted_rr()
在最短工作優先排程中若按照 Weighted round-robin 的寫法且不補充 quantum，則會變成 non-preemptive SJF。相反地，若要實作 preemptive SJF，需要在 check_preempt_curr_weighted_rr(struct rq*, struct task_struct*, int) 檢查執行的程序是不是串列的首元素，意即是不是最高優先權，如果不是最高優先權，則呼叫 resched_task() 即可完成 (詳見 Bonus RMS 的寫法)。

Bonus RMS

額外增加 syscall，修改三個檔案 unistd_32.h, unistd_64.h, syscall_table_32.S。增加 Rate-monotonic scheduler 的額外週期 (period) 和 syscall 函數主題，修改兩個檔案 sched.h, sched.c。

程式碼細節 `sched_weighted_rr.c`

enqueue_task_weighted_rr
使用與最短工作優先排程相同寫法，但利用週期 (period) 進行由優先權高到低排序。
check_preempt_curr_weighted_rr
週期程式可能會搶之前正在運行程式，若發現運行程式的優先權低於要進入排程器的程式優先權，呼叫 resched_task() 重新排程。
其餘類同 SJF。

測試程序 `test_rms.c`

參考 Project 1 的寫法，linux 針對每一個 thread 有各自計時的計時器，當 thread 被 context-switch 時，計時器也會隨之停止。

利用 main thread 產生週期性程式，pthread_create 並不會馬上將 thread 立即執行，迫使 main thread 的方式使用 sleep(1)，以秒為單位產生週期性程式。由於要考慮週期程式的執行時間，又能在輸出緩衝區觀察順序關係，利用 thread 的計時器，盡可能接近 1ms 才進行一次印出，但這也會造成幾毫秒的誤差，對於一支程序一秒大約印出 1000 個字元，測試程式中，將連續相同字元計數，若相同字元取用四捨五入到秒才進行印出，計時誤差部分就不予輸出。

Github

https://github.com/morris821028/hw-realtime-system

初體驗 qemu

若不想在 Windows 上灌 Virtual Box 進行 kernel 修改，那麼在 Linux 上面灌 qemu，在上面測試，假設只要測試 kernel 那麼 qemu 在第一次編譯以及在部分檔案重新編譯都只需要兩三分鐘。若 Windows 上用 Virtual Box 灌 Linux，又在 Linux 上面灌 qemu，容易遭遇到記憶體不夠的窘境。

Ubuntu Environment

Ubuntu 環境需要安裝下列插件。

1 2	$ sudo apt-get install qemu-system $ sudo apt-get install libncurses5-dev

Buildroot

用 busybox 建造一直失敗，轉向學長推薦的 buildroot，用別人打包好的環境替 qemu 建造 file system。

site

1
2
3

$ wget https://buildroot.uclibc.org/downloads/buildroot-2015.11.1.tar.bz2
$ bzip2 -d FileName.bz2
$ tar xvf FileName.tar

Compile Config

Buildroot 提供很多參數給予選擇

1	$ make menuconfig

這次作業設定，由於需要改寫 kernel 版本為 2.6.32.* 左右，因此相關的 gcc 編譯器等設定如下。同時，根據機器架構設定，助教預設在 visual box 上面，通常為 i386，而我們用實驗室 server，編 x86 64。

Target options  --->
    Target Architecture (x86_64)  --->
        x86_64
Toolchain  --->
    C library (glibc)  --->
        glibc
    Kernel Headers (Manually specified Linux version)  --->
        custom 2.6.x
        2.6.32.6
System configuration  --->
    [*] Enable root login with password
    [*] remount root filesystem read-write during boot 
Filesystem images  ---> 
    [*] ext2/3/4 root filesystem
    [*] tar the root filesystem

Compile

make -j8 用 8 個執行緒編譯，如果機器有很多個 core，那麼可以用 make -j20 在非常短的時間內完成。

接著在 buildroot

1 2	$ make -j8 $ cp output/image/rootfs.ext* WORKSPACE_PATH

Linux kernel config

1 2	$ make defconfig $ make menuconfig

[_] Enable loadable module support
File systems  ---> 
    [*] Second extended fs support
        [*] ext2
Device Drivers  --->
    Generic Driver Options  ---> 
        [*] Create a kernel maintained /dev tmpfs (EXPERIMENTAL) 
        [*] Automount devtmpfs at /dev

create arch/x86/boot/bzImage

1	$ make -j20

Test Flow

move test execute file

mnt.sh

1
2
3

sudo mount rootfs.ext2 mnt
sudo cp test_weighted_rr/test_weighted_rr mnt/root/
sudo umount mnt

run.sh

1	qemu-system-x86_64 -M pc -kernel arch/x86/boot/bzImage -hda rootfs-dirty.ext4 -netdev user,id=network0 -device e1000,netdev=network0 -nographic -append "root=/dev/sda console=ttyS0"

How to enlarge the rootfs.ext4

http://stackoverflow.com/questions/3495854/extend-the-size-of-already-existing-ext2-image

1
2
3

$ dd if=/dev/zero of=foo.img count=0 bs=1M seek=2000   # assuming target size is 2000M
# e2fsck -f foo.img
# resize2fs foo.img

Screen Workspace

commond tutorial

執行

$ screen

ctrl+a c
開啟新的視窗，並同時切換到這個新的視窗
ctrl+a tab
切換視窗
ctrl+a |
切割垂直視窗，並產生新的視窗。
ctrl+a k
關閉此視窗

之所以用 Screen 指令，主要是因為遠端到工作站上操作 qemu，若程序掛掉，很容易遭遇到要重新連線的麻煩事。

Speical Thanks

蕭光宏蕭大帥
Lab 332 Parallel & Distributed Laboratory

Virtual Box

第一次運行

$ cd /usr/src
$ sudo wget http://newslab.csie.ntu.edu.tw/course/rts2015/projects/linux-2.6.32.60.tar.gz
$ sudo tar -zxvf linux-2.6.32.60.tar.gz
$ cd linux-2.6.32.60
$ sudo make mrproper
$ sudo make menuconfig

1 2	$ scp morris1028@140.112.30.245:~/RTS/linux-2.6.32.60/arch/x86/include/asm/unistd_64.h arch/x86/include/asm/ $ scp morris1028@140.112.30.245:~/RTS/linux-2.6.32.60/kernel/sched_weighted_rr.c kernel/

$ sudo make bzImage
$ sudo make modules
$ sudo make modules_install
$ sudo make install

$ sudo vim /etc/default/grub
#Add "#" to comment the following 2 lines
#GRUB_HIDDEN_TIMEOUT=10
#GRUB_HIDDEN_TIMEOUT_QUIET=true
$ sudoupdate-grub2
$ sudoshutdown -r now

Test flow

$ cd /usr/src/linux-2.6.32.60
$ sudo make bzImage
$ sudo make modules
$ sudo make modules_install
$ sudo make install

Debug by `printk`

在另一個 terminal 開啟以下代碼，並且等待執行測試 scheduler 得執行檔在 kernel 運行時所打印出的 printk 訊息。

1	$ sudo more /proc/kmsg

即時系統使用排程器

目標

安裝 linux kernel 2.6.32.68、練習 pthread 設定 scheduler 相關函數以利後續修改 kernel source code。

設定 scheduler 問題

設定 thread 可以在哪些 core 運行，由於要測試執行順序,強制讓 pthread 都在 CPU 0 運行。
編譯時，增加 #define _GNU_SOURCE 在 #include <pthread> 之前，確定 cpu_set_t 型別存在。

cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
sched_setaffinity(0, sizeof(mask), &mask);

使用 sched_setscheduler() 時，需查閱 int policy 相關規定去設定 struct sched_param 的 priority 數值。例如 SCHED_FIFO 的 priority 介於 1 到 99 之間,若 priority 錯誤，sched_setscheduler() 會發生錯誤，錯誤訊息可以藉由 printf("Message␣%s\n", mid, strerror(errno)); 查閱。

1
2
3

sched_setaffinity(0, sizeof(mask), &mask);
              int sched_setscheduler(pid_t pid, int policy,
const struct sched_param *param);

如果 pid = 0，相當於 caller 的 thread 會被設置對應的 policy 和 sched_param。在 policy 設定規範中，有兩種 real time scheduler SCHED_FIFO 和 SCHED_RR。當 thread 設置 real time scheduler 時，需要用 root 權限來執行程式，否則會設置失敗。相反地，若使用 non-real time scheduler 則不需要 root 權限，如 SCHED_BATCH。

struct sched_param param;
printf("Thread %d sched_setscheduler()\n", mid);
param.sched_priority = sched_get_priority_max(SCHED_FIFO);
s = sched_setscheduler(0, SCHED_FIFO, &param);
printf("Thread %d sched after %s\n", mid, strerror(errno));

三種方法設定 pthread

方法一

各自的 pthread_create() 後，再呼叫 sched_setscheduler() 進行設定。這種寫法在本次實驗有順序問題，在 create 之後，無法保證執行 sched_setscheduler() 的順序，無法依靠 main thread 中執行 pthread_create() 的順序決定。幸運地，仍可以使用

1
2
3

pthread_barrier_t barrier; 
pthread_barrier_wait(&barrier);
pthread_barrier_init(&barrier, NULL, MAX_THREAD);

確保每一個 thread 都設置好各自的 sched_setscheduler() 後統一運行，運行順序取決 priority。

方法二

在 main thread 中，呼叫 pthread_attr_setinheritsched(attr, PTHREAD_INHERIT_SCHED); 接下來 create 出來的 pthread 都將繼承 main thread 的 scheduler 設定。這種寫法很方便，如果需要各自設置 priority 會不方便，倒不如直接用第三種寫法。

方法三

在 main thread 中，手動設置每一個 pthread 的 scheduler。

pthread_attr_t *attr = new pthread_attr_t;
struct sched_param *param = new struct sched_param;
// increasing order
param->sched_priority = sched_get_priority_max(SCHED_FIFO) - i;
// set scheduler
pthread_attr_setinheritsched(attr, PTHREAD_EXPLICIT_SCHED);
pthread_attr_setschedpolicy(attr, SCHED_FIFO);
pthread_attr_setschedparam(attr, param);
// create thread
pthread_create(&tid[i], attr, running_print, create_arg(i+1, scheduler))

計時設置

為確定每一個 thread 按照設定的 scheduler 運行，方式採用 print() 進行，但有可能輸出之間間隔距離過短，導致順序容易在輸出上發生無法預期的問題，適時地加入一秒間隔完成。

如何準確地停止一秒？不能呼叫 sleep() 因為這類的 system call 會產生 interrupt，此時 thread 將會降低 priority 或者進入隊列尾端。佔據 CPU time 有以下三種

busy work

只要停留的夠久即可，下面是一種簡單的方法

1
2
3

// busy 1 second
double a = 0;
for (int i = 0; i < 10000000; i++)	a += 0.1f;

系統時間

利用 critical section 抓系統時間 int gettimeofday (struct timeval * tv, struct timezone * tz);，這寫法會共用同一份時間，因此需要使用 mutex 完成。仍使用一個迴圈來判定時間是否超過，實際運行時間會超過一秒。

static double my_clock(void) {
    struct timeval t;
    gettimeofday(&t, NULL);
    return 1e-6 * t.tv_usec + t.tv_sec;
}
pthread_mutex_lock(&outputlock);
{
    double sttime = my_clock();
    while (1) {
        if (my_clock() - sttime > 1)
            break;
    }
}
pthread_mutex_unlock(&outputlock);

執行緒運行時間

事實上 Linux 有提供每一個 thread 各自的 clock，用來計算 thread 的執行時間。使用 clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t)，在舊版本需在編譯參數中加入 -lrt 才能使用。

static double my_clock(void) {
    struct timespec t;
    assert(clock_gettime(CLOCK_THREAD_CPUTIME_ID, &t) == 0);
    return 1e-9 * t.tv_nsec + t.tv_sec;
}

輸出 console 問題

在 SCHED_FIFO 這類的 real time scheduler 情況下，thread 輸出無法即時顯示在 terminal 上。

從 http://man7.org/linux/man-pages/man7/sched.7.html 可以見到 real-time scheduler 安排工作的比例，由於 print() 之後的顯示處理並不是 real-time process，當 real-time process 完全用盡 95\% 的 CPU time 時，顯示處理可能沒辦法在一秒內完成，可能要用好幾秒中的 5% 來完成。這也就是造成一開始會停頓一陣子，隨後才一次輸出數行結果。(註：並不是每一種裝置都會造成這問題，跟 CPU 時脈有關。)

The value in this file specifies how much of the period time can be used by all real-time and deadline scheduled processes on the system. The value in this file can range from -1 to INT_MAX-1. Specifying -1 makes the runtime the same as the period; that is, no CPU time is set aside for non-real-time processes (which was the Linux behavior before kernel 2.6.25). The default value in this file is 950,000 (0.95 seconds), meaning that 5% of the CPU time is reserved for processes that don’t run under a real-time or deadline scheduling policy.

解決這類的問題，可以從 kernel 設定著手，降低 real-time 最大佔有量，讓 print() 處理能即刻印出。這不是好的解決辦法，但可用於實驗。

// default
$ sudo /sbin/sysctl -w kernel.sched_rt_runtime_us=950000
// modify
$ sudo /sbin/sysctl -w kernel.sched_rt_runtime_us=500000

又或者採用拉大間隔，讓每一個 thread 都停止數十秒以上。

特別感謝

蕭光宏學長

Github

https://github.com/morris821028/hw-realtime-system

目標

修改檔案列表

Part 1. Weighted Round-Robin Scheduler

Part 2. Shortest Job First Scheduler

Bonus RMS

程式碼細節 sched_weighted_rr.c

測試程序 test_rms.c

Github

Ubuntu Environment

Buildroot

Compile Config

Compile

Linux kernel config

Test Flow

move test execute file

How to enlarge the rootfs.ext4

Screen Workspace

Speical Thanks

Virtual Box

第一次運行

Test flow

Debug by printk

目標

設定 scheduler 問題

三種方法設定 pthread

方法一

方法二

方法三

計時設置

busy work

系統時間

執行緒運行時間

輸出 console 問題

特別感謝

Github

程式碼細節 `sched_weighted_rr.c`

測試程序 `test_rms.c`

Debug by `printk`