Understanding job priority
Job priority which is displayed in column 4 of qstats (which gets it
from column 2 of qstat -ext) is the key figure for job allocation and
decides the ordering of jobs as to which should be run next.
There are further hoops that are check against before the job is run.
Are there enough free cpus for a job of this size to run?
Are the free nodes the right type of node for the job? e.g some nodes
on Maxwell have larger memory and infiniband which the job might
require.
Is the job the right type to use the free cpus? We keep some cpus
free for t and u jobs so that quick test jobs can be run.
Even before a job has to pass these test it needs to get the highest
priority which as we will see is derived from a very convoluted route.
priority = nurg * 0.5 + ntckts * 0.5
nurg = normalized(urg)
nurg(i) = urg(i)/max(urg)
ntckts = normalized(tckts)
ntckts(i) = tckts(i)/10000
tckts = the number of tickets allocated to a job.
The divisor for ntckts is simply the total number of tickets allocated
at any one time amongst the waiting jobs.
There is very little specific information on the allocation of tickets
but I will give a general description at the end of the document.
I will start with the easier half of the equation which is the
urgency.
Urgency
urg = rrcontr + wtcontr
The value of the components can all be viewed with qstat -urg , further
information on the values can be found in the man page for sge_priority
rrcontr = The urgency value contribution that reflects the urgency that
is related to the jobs overall resource requirement.
rrcontr = (number of slots requested*100)+ (40000 if going to a t queue) + (80000 if going to a u queue)
wtcontr = The urgency value contribution that reflects the urgency
related to the jobs waiting time.
wtcontr = nint( seconds since job submission* 0.278 )
So the result of this is that the larger the job the higher the
priority the job gets and the longer a job waits the higher the
priority. Jobs going to the U
queue have a higher urgency than jobs going to the T queue and likewise a job going
to a T queue has a higher
urgency than a job going to an S
queue. A 16 cpu job that has waited 16 hours will have roughly the
same urgency as a 32 cpu job that has just been submitted if they go to
the same type of queue. A job submitted to the s64 queue will have to
wait 34 hours before it has higher priority than a job just submitted
to the t2 queue and 74 hours before it has higher priority than a job
just submitted to the u2 queue.
Tickets
There are many ways you can use tickets in grid engine but we
exclusively use the sharetree policy. What little information that
exists on this can be seen in the man page on share_tree. The
share tree defines the long-term resource entitlements of users
and projects. There are many ways you can look at resource usage
but the simplest for our sort of setup is the amount of time a users
program is occupying a cpu.
The starting point for a users entitlement is that all users have an
equal right to the resources. Within any project each user should get
an equal amount of time and each project should get an amount of cpu
time proportional to the number of people in the project. Maxwell is a
little different from past HPCF machines in that it was bought mainly
with grant money from a small number of groups. To allow for this these
projects have an increased entitlement. ( A similar thing happens on
Franklin where projects that have contributed grant money can have a
certain amount of priority usage . )
Every few minutes the scheduler looks at all the jobs that are queued
up to be run and divides up 10,000 tickets amongst them based not only
on the user's and project's entitlement but also on the past usage of
both the user and the project. This accumulated past usage (cpu hours)
is decayed with time. The decay rate of the past usage is 14 days, so
that it has half the weighting after 14 days as it did at the start.
The queuing system tries to allocate tickets so that over the long term
each user's and project's usage approaches is in proportion to their
allocation.
This I am afraid is the most detailed description of the ticket
allocation I can offer at this time.