OS

Lottery Scheduling

Fair Share Schedulers: give every process a lottery ticket, more tickets for more priority.

  • Guarantee that each process obtains a certain percentage of CPU time
  • Does not optimise for turnaround or response time

TIP

if you are in need of a mechanism to represent a proportion of ownership, try using the ticket

Every so often hold a lottery to decide which job runs next. Higher priority jobs should be given more chances to win (more tickets)

Note

  1. random algs often avoid strange edge cases
  2. random algs are usually lightweight, and maintain little to no state
  3. random algs is useful when very fast decision are required, so long as generating a random number is fast

Ticket Mechanisms

  1. ticket currency: allow a user with a set of tickets to allocate tickets among their own jobs in whatever currency they like

    User A  -> 500 (A’s currency) to A1 ->  50 (global currency)
            -> 500 (A’s currency) to A2 ->  50 (global currency)
    User B  ->  10 (B’s currency) to B1 -> 100 (global currency)

    The system converts local currencies to global and holds its lottery among the global ones (50+50+100=200)

  2. ticket transfer: process can transfer its tickets temporarily. useful for client server cases, client asks server to do some ask and transfers its tickets (it can’t work in that time anyway)

  3. ticket inflation: if a process needs more cpu time it temporarily boosts its ticket value

Implementation

/* example implementation */
node_t* dispatcher(node_t *head) {
  int counter = 0;
  int winning_ticket = rand() % TOTAL_TICKETS;
 
  node_t *current = head;
  while(current) {
    counter += current->tickets;
    if (counter > winning_ticket) {
      break; /* found winner */
    }
    current = current->next;
  }
 
  /* current won: schedule it */
  return current;
}

For a good implementation you need to choose:

  1. good rng difficult
  2. good process list data structure (sorted perhaps?)
  3. total number of tickets

simple ^_^

TIP

knowing which data structure to choose is a hallmark of good engineering. carefully consider its access patterns and frequency of usage

WARNING

we have not discussed how to assign tickets i.e. how to assign priority to a process. this is an open problem

Lottery scheduling can be shown to be very fair for longer job run times

Stride Scheduling

a deterministic approach to a fair-scheduler

basically run the process with the least progress so far. measure progress as inverse of ticket count

  • stride: calculate for each process (where is any constant)
  • pass value: progress tracker (incremented everytime process is run)
curr = remove_min(queue); /* client with min pass value */
schedule(curr);
curr->pass = curr->pass + curr->stride;
insert(queue, curr);

NOTE

lottery scheduling has a nice property in that their is no global state. therefore no extra memory, and it makes adding new processes so much easier (this is good). if a new process was added in a stride scheduler and had its pass value = 0, it would monopolize the cpu (this is bad)

an example:

pass(a)
(stride = 100)
pass(b)
(stride=200)
pass(c)
(stride=40)
who runs?
000a
10000b
1002000c
10020040c
10020080c
100200120a
200200120c
200200160c
200200200

The Linux Completely Fair Scheduler (CFS)12

OSX and Windows use some forms of MLFQs3 (unix too if i’m not mistaken4) but Linux sets those aside and chooses to prioritise fairness by trying to model an “ideal, precise multi-tasking CPU”. On an “ideal CPU” two processes would simultaneously consume exactly 50% of the power and run at equal speed.

Scheduling Policies 12

Linux offers user policies (SCHED_OTHER, SCHED_BATCH, SCHED_IDLE) and real-time policies(these make use of mlfqs) (SCHED_FIFO, SCHED_RR). Each thread has a policy associated with it.

The user policies use slightly different versions of the Completely Fair Scheduler and have an associated priority of 0 (real time policies have priorities from 1-99), instead they only make use of nice values to mimic priority

Since we only have a single CPU, Linux introduces the concept of vruntime which tracks the time a task ran so far in nanoseconds

Basic Operation

fairly divide the cpu evenly among all competing processes. uses a counting based technique virtual runtime

as each process runs its accumulates vruntime. Each processes’s vruntime increases at the same rate, in proportion with real time. When a scheduling decision occurs, CFS picks the process with the lowest vruntime to run next.

Pre-empting

We have discussed how the scheduler chooses which task to run, but when do we return control to the OS? How will the scheduler know when to context switch? How do we define a time quantum?

maintains various control parameters

  1. sched_latency: used to determine how long a process should run before considering a switch? basically create a dynamic time slice, usually done something like where sched_latency . Now what if ? that’s too small a time slice
  2. min_granularity: minimum value of sched_latency that can be assigned, typically min_granularity =

(i don’t know how outdated this data is exactly)

NOTE

time slices are only for the context switches, the scheduler still has separate periodic timer interrupts from the hardware to precisely update vruntime

EEVDF ( Earliest Eligible Virtual Deadline First)

As of linux 6.6 these guys have begun migrating from CFS to EEVDF5 which is supposably a slight improvement over CFS. It operates very similarly but now introduces the concepts of lag and virtual deadlines

anyway that’s the reason i can’t show the exact values of sched_latency and min_granularity. Be content with what you can see instead

# essentially the ideal value of sched_latency/nr_processes
# i.e. the ideal/min time slice
 sudo cat /sys/kernel/debug/sched/base_slice_ns
2800000 # 2.8ms

(theres also a guy that hates these both and wants to use the Brain Fuck Scheduler)6

Weighting (Niceness)

CFS enables control over process priority. nice level , with a default of . Each niceness is mapped to a weight, that allows us to compute the effective time slice of a process

(i am not sure how true this is, my understanding was that the time slice was simply though i suppose if they all had the same niceness it would be the same thing)

weight of -20 = , weight of +19 = ratio of 5900

thus vruntime is also updated as

Weights

the map mentioned above is probably because for faster implementation reasons, there is an easier way to remember the weight mappings. Linux treats as the base weight, and each nice value corresponds to the weight

>>> 1024 / (1.25**(-20))
88817.84197001252
>>> 1024 / (1.25**(19))
14.757395258967641

giving the values mentioned above

Notice that positive nice values lead to larger vruntimes and the opposite for negative ones

Using Red-Black Trees

Unlike other systems, the CFS run queue is maintained as a Red Black Tree7, which is a balanced binary search tree (like the AVL tree8)

/* for example */
struct t_red_black_node {
  enum { red, black } colour;
  task_struct *process;
  struct t_red_black_node *left, *right, *parent;
};

The key for each node is the vruntime and each time the scheduler has to make a decision it dequeues the minimum vruntime which is the leftmost node (it maintains a special pointer to that location). Once the task has run for its time slice, its vruntime is updated and enqueued back on the tree

Here you notice that a higher nice value leads to a higher vruntime which pushes it further back in the RB tree and a lower one keeps it closer to the front

Dealing with I/O and Sleeping Processes

if a process sleeps for a long time, then when it wakes up it will monopolise the CPU :((

instead of letting it keep its same vruntime CFS will instead assign it the vruntime of the minimum process in the RB tree (which only contains running processes) preventing starvation

however frequent sleeping processes often do not get their fair share of the CPU

Footnotes

  1. https://docs.kernel.org/scheduler/sched-design-CFS.html

  2. https://opensource.com/article/19/2/fair-scheduling-linux nice blog

  3. https://www.youtube.com/watch?v=GWF2Xu725SM funny vid i stole her meme

  4. https://www.usenix.org/system/files/conference/atc18/atc18-bouron.pdf battle of the schedulers (bsd vs linux)

  5. https://lwn.net/Articles/925371/ eevdf

  6. https://www.cs.unm.edu/~eschulte/data/bfs-v-cfs_groves-knockel-schulte.pdf CFS vs BFS

  7. https://www.eecs.umich.edu/courses/eecs380/ALG/red_black.html and https://www.cs.usfca.edu/~galles/visualization/RedBlack.html red-black tree visualiser

  8. https://stackoverflow.com/questions/11224830/inside-the-linux-2-6-completely-fair-scheduler “why not avl???”