Fair Share Schedulers: give every process a lottery ticket, more tickets for more priority.
Guarantee that each process obtains a certain percentage of CPU time
Does not optimise for turnaround or response time
TIP
if you are in need of a mechanism to represent a proportion of ownership, try using the ticket
Every so often hold a lottery to decide which job runs next. Higher priority jobs should be given more chances to win (more tickets)
Note
random algs often avoid strange edge cases
random algs are usually lightweight, and maintain little to no state
random algs is useful when very fast decision are required, so long as generating a random number is fast
Ticket Mechanisms
ticket currency: allow a user with a set of tickets to allocate tickets among their own jobs in whatever currency they like
User A -> 500 (A’s currency) to A1 -> 50 (global currency) -> 500 (A’s currency) to A2 -> 50 (global currency)User B -> 10 (B’s currency) to B1 -> 100 (global currency)
The system converts local currencies to global and holds its lottery among the global ones (50+50+100=200)
ticket transfer: process can transfer its tickets temporarily. useful for client server cases, client asks server to do some ask and transfers its tickets (it can’t work in that time anyway)
ticket inflation: if a process needs more cpu time it temporarily boosts its ticket value
Implementation
/* example implementation */node_t* dispatcher(node_t *head) { int counter = 0; int winning_ticket = rand() % TOTAL_TICKETS; node_t *current = head; while(current) { counter += current->tickets; if (counter > winning_ticket) { break; /* found winner */ } current = current->next; } /* current won: schedule it */ return current;}
good process list data structure (sorted perhaps?)
total number of tickets
simple ^_^
TIP
knowing which data structure to choose is a hallmark of good engineering. carefully consider its access patterns and frequency of usage
WARNING
we have not discussed how to assign tickets i.e. how to assign priority to a process. this is an open problem
Lottery scheduling can be shown to be very fair for longer job run times
Stride Scheduling
a deterministic approach to a fair-scheduler
basically run the process with the least progress so far. measure progress as inverse of ticket count
stride: ticketsn calculate for each process (where n is any constant)
pass value: progress tracker (incremented everytime process is run)
curr = remove_min(queue); /* client with min pass value */schedule(curr);curr->pass = curr->pass + curr->stride;insert(queue, curr);
NOTE
lottery scheduling has a nice property in that their is no global state. therefore no extra memory, and it makes adding new processes so much easier (this is good). if a new process was added in a stride scheduler and had its pass value = 0, it would monopolize the cpu (this is bad)
OSX and Windows use some forms of MLFQs3 (unix too if i’m not mistaken4) but Linux sets those aside and chooses to prioritise fairness by trying to model an “ideal, precise multi-tasking CPU”. On an “ideal CPU” two processes would simultaneously consume exactly 50% of the power and run at equal speed.
Linux offers user policies (SCHED_OTHER, SCHED_BATCH, SCHED_IDLE) and real-time policies(these make use of mlfqs) (SCHED_FIFO, SCHED_RR). Each thread has a policy associated with it.
The user policies use slightly different versions of the Completely Fair Scheduler and have an associated priority of 0 (real time policies have priorities from 1-99), instead they only make use of nice values to mimic priority
Since we only have a single CPU, Linux introduces the concept of vruntime which tracks the time a task ran so far in nanoseconds
Basic Operation
fairly divide the cpu evenly among all competing processes. uses a counting based technique virtual runtime
as each process runs its accumulates vruntime. Each processes’s vruntime increases at the same rate, in proportion with real time. When a scheduling decision occurs, CFS picks the process with the lowestvruntime to run next.
Pre-empting
We have discussed how the scheduler chooses which task to run, but when do we return control to the OS? How will the scheduler know when to context switch? How do we define a time quantum?
maintains various control parameters
sched_latency: used to determine how long a process should run before considering a switch? basically create a dynamic time slice, usually done something like #processessched_latency where sched_latency=48ms. Now what if #processes→∞? that’s too small a time slice
min_granularity: minimum value of sched_latency that can be assigned, typically min_granularity = 6ms
(i don’t know how outdated this data is exactly)
NOTE
time slices are only for the context switches, the scheduler still has separate periodic timer interrupts from the hardware to precisely update vruntime
EEVDF ( Earliest Eligible Virtual Deadline First)
As of linux 6.6 these guys have begun migrating from CFS to EEVDF5 which is supposably a slight improvement over CFS. It operates very similarly but now introduces the concepts of lag and virtual deadlines
anyway that’s the reason i can’t show the exact values of sched_latency and min_granularity. Be content with what you can see instead
# essentially the ideal value of sched_latency/nr_processes# i.e. the ideal/min time slice❯ sudo cat /sys/kernel/debug/sched/base_slice_ns2800000 # 2.8ms
(theres also a guy that hates these both and wants to use the Brain Fuck Scheduler)6
Weighting (Niceness)
CFS enables control over process priority. nice level −20<x<+19, with a default of 0. Each niceness is mapped to a weight, that allows us to compute the effective time slice of a process
(i am not sure how true this is, my understanding was that the time slice was simply #processessched_latency though i suppose if they all had the same niceness it would be the same thing)
weight of -20 = 88761, weight of +19 = 15⇒ ratio of 5900
thus vruntime is also updated as
vruntimei=vruntimei+weightiweight0∗runtimei
Weights
the map mentioned above is probably because for faster implementation reasons, there is an easier way to remember the weight mappings.
Linux treats 1024 as the base weight, and each nice value corresponds to the weight (1.25)n1024
Notice that positive nice values lead to larger vruntimes and the opposite for negative ones
Using Red-Black Trees
Unlike other systems, the CFS run queue is maintained as a Red Black Tree7, which is a balanced binary search tree (like the AVL tree8)
/* for example */struct t_red_black_node { enum { red, black } colour; task_struct *process; struct t_red_black_node *left, *right, *parent;};
The key for each node is the vruntime and each time the scheduler has to make a decision it dequeues the minimumvruntime which is the leftmost node (it maintains a special pointer to that location). Once the task has run for its time slice, its vruntime is updated and enqueued back on the tree
Here you notice that a higher nice value leads to a higher vruntime which pushes it further back in the RB tree and a lower one keeps it closer to the front
Dealing with I/O and Sleeping Processes
if a process sleeps for a long time, then when it wakes up it will monopolise the CPU :((
instead of letting it keep its same vruntime CFS will instead assign it the vruntime of the minimum process in the RB tree (which only contains running processes) preventing starvation
however frequent sleeping processes often do not get their fair share of the CPU