Understanding job priority

Job priority which is displayed in column 4 of qstats (which gets it from column 2 of qstat -ext) is the key figure for job allocation and decides the ordering of jobs as to which should be run next.

There are further hoops that are check against before the job is run.
Are there enough free cpus for a job of this size to run?
Are the free nodes the right type of node for the job? e.g some nodes on Maxwell have larger memory and infiniband which the job might require.
Is the job the right type to use the free cpus? We keep some cpus free for t and u jobs so that quick test jobs can be run.

Even before a job has to pass these test it needs to get the highest priority which as we will see is derived from a very convoluted route.

priority  = nurg * 0.5 + ntckts * 0.5

nurg      = normalized(urg)
nurg(i)   = urg(i)/max(urg)

ntckts    = normalized(tckts)
ntckts(i) = tckts(i)/10000

tckts = the number of tickets allocated to a job.

The divisor for ntckts is simply the total number of tickets allocated at any one time amongst the waiting jobs.
There is very little specific information on the allocation of tickets but I will give a general description at the end of the document. I will start with the easier half of the equation which is the urgency.

Urgency

urg     =  rrcontr + wtcontr

The value of the components can all be viewed with qstat -urg , further information on the values can be found in the man page for sge_priority

rrcontr = The urgency value contribution that reflects the urgency that 
is related to the jobs overall resource requirement.

rrcontr = (number of slots requested*100)+ (40000 if going to a t queue) + (80000 if going to a u queue)

wtcontr = The urgency value contribution that reflects the urgency

related to the jobs waiting time.

wtcontr = nint( seconds since job submission* 0.278 )

So the result of this is that the larger the job the higher the priority the job gets and the longer a job waits the higher the priority. Jobs going to the U queue have a higher urgency than jobs going to the T queue and likewise a job going to a T queue has a higher urgency than a job going to an S queue. A 16 cpu job that has waited 16 hours will have roughly the same urgency as a 32 cpu job that has just been submitted if they go to the same type of queue. A job submitted to the s64 queue will have to wait 34 hours before it has higher priority than a job just submitted to the t2 queue and 74 hours before it has higher priority than a job just submitted to the u2 queue.

Tickets

There are many ways you can use tickets in grid engine but we exclusively use the sharetree policy. What little information that exists on this can be seen in the man page on share_tree. The share tree defines the long-term resource entitlements of users and projects. There are many ways you can look at resource usage but the simplest for our sort of setup is the amount of time a users program is occupying a cpu.

The starting point for a users entitlement is that all users have an equal right to the resources. Within any project each user should get an equal amount of time and each project should get an amount of cpu time proportional to the number of people in the project. Maxwell is a little different from past HPCF machines in that it was bought mainly with grant money from a small number of groups. To allow for this these projects have an increased entitlement. ( A similar thing happens on Franklin where projects that have contributed grant money can have a certain amount of priority usage . )

Every few minutes the scheduler looks at all the jobs that are queued up to be run and divides up 10,000 tickets amongst them based not only on the user's and project's entitlement but also on the past usage of both the user and the project. This accumulated past usage (cpu hours) is decayed with time. The decay rate of the past usage is 14 days, so that it has half the weighting after 14 days as it did at the start. The queuing system tries to allocate tickets so that over the long term each user's and project's usage approaches is in proportion to their allocation.

This I am afraid is the most detailed description of the ticket allocation I can offer at this time.