Estimated Tasks "Time To Complete" Are Very Wrong
BOINC uses part of the task headers that store the estimated fpops (floating point operations recorded in the <rsc_fpops_est> value) to compute an estimated run time with reference to a host (client) benchmark. If the benchmark is a true and correct reflection of the computational power that is actually applied to the science on hand, then fpops divided by benchmark gives a correct Time To Complete (TTC).
What happens if the task's header fpops estimates are wrong or the host benchmark was miscalculated?
- If too high fpops or too low benchmark, the client scheduler over-estimates the TTC
- If too low fpops or too high benchmark, the client scheduler under-estimates the TTC
Cause and effect, BOINC learns from these variations and stores the deviation in a value called the (Result) Duration Correction Factor (rDCF).
Next time a job arrives it uses the fpops estimate in the task header and applies the rDCF to provide an adjusted TTC.
That rDCF value starts out at a value of 1.000000. The change when the actual Task completion time has been reached is:
- If jobs take longer, the rDCF is aggressively increased.
- If jobs take less time, the rDCF is slowly reduced.
The reason this is so is because the rDCF is also used to estimate the amount of work called from the servers. The logic that follows is obvious:
- If the work done recently has taken much longer, too much work would get buffered when using the Cache/Additional buffer function, always assuming new work is going to have the same deviation as completed work. The risk here is that deadline exceeding is threatening by the time the last received work is getting it's turn.
- If the work is going to take much shorter, too little work is buffered, but no harm is done.
As we have experienced a few times, the work is sometimes split, without fault, in wrong sizes (very hard on non-deterministic calculations). Either they run much much longer and require multiple number of estimated computations or they are way overestimated and take just half or less run time.
Now what happens is, that whilst WCG maintains a running average of fpops for each project (used in the new work headers), BOINC was never geared to maintain a rDCF for sub-projects. It just keeps 1 for WCG. So, if 1 sub-project goes haywire on the estimates, the rDCF starts to affect the estimated TTC for all the other projects as well. Thus, if FAAH runs 5x longer than the fpops in the Task header suggests it should run, the following HPF2 job with similar fpops estimate is deemed to do the same and gets the inflated TTC associated.
On August 4, 2008 knreed announced that in order to mitigate this effect, future batches of work from projects of which WCG knows they are producing substantial variable run times, will get a limited sample distribution. They will be sent to known and reliable clients. Based on the actual result data, either the fpops estimate in the headers is adjusted or the batch is further cut to size, in order to keep the total average run time within a target area, e.g. 7 or 8 hours.
We are going to modify our processes going forward (starting today) so that we send out a limited number of workunits for each batch as soon as the batch is ready to be loaded and sent to members. This work will be sent to the reliable hosts so that we can get information about the behavior of that work as soon as possible. This process will limit the impact to the member community as we should be able to identify surprises like this before we send out tens of thousands of 'surprises'.
Future developments will allow sizing of work according computational power so that a weaker machine will do approximately the same run time as a power-cruncher. The product would thus be that you get something which in the extreme is similar to the RICE project where tasks run 8 hours, no matter what computer.
knreed explains further on July 31, 2008
Yes - there are actually a lot of advantages to doing this. We have been working with David Anderson and BOINC to get this capability added. David has done a lot of work on this already and the folks at Superlink@Technion! are the first BOINC project to put the new code into production. We will be updating our servers to utilize the new code later this year.
Once we have the code, the server will assess the 'effective power' of the computer requesting work and try to send it work that won't take it more than a day or so. Effective power is the raw power of the computer * the amount of time that BOINC is allowed to run work on the computer.
Once we have tested this and feel good about it, we will modify how we create workunits so that there is a lot of variation in the size and computers will be able to get the appropriate size of work. This will reduce load on our servers as we will be able to send bigger workunits to those powerful always on computers and it will improve our ability to effectively use those computers that are less powerful and are only on infrequently (and thus have a hard time completing work currently).
So it is a definite advantage to do this and we are anxious to get this in place.
- The client benchmark is re-evaluated once every 5 days (140 hours wallclock)
- Though excessively long tasks have a "Safety" to allow them to run extra long, all have a cut off factor of between 6 to 10 times the original estimated computations (fpops)needed to complete. This prevents them from running ad infinitum, particularly on clients that are not being attended. The factor is determined by taking the task's <rsc_fpops_bound> value divided by the <rsc_fpops_est> value. For instance the below results in a factor 10 cut off leading to a aborted:"Exceeded CPU time limit" if computations are performed beyond this point.
- As of BOINC server version 700 and client version 7.xx a project can lock the host rDCF with the <dont_use_dcf> instruction. When done the value is set to 1.000000. The benefit of this is, when a project has highly variable runtimes and or multiple sub-projects, the rDCF does not careen up and down based on what is currently processed, leading to either over or under caching on the client side. Instead, the server tracks the recent project average runtimes for each science and inserts and sends that mean runtime with the tasks to hosts. This results in stable work buffering, with slow moving of projected runtimes either up or down.World Community Grid has opted to activate this function.
Sample of individual job controls as sent by servers to client showing original fpops and the 10 fold time out value:
To return to the Frequently Asked Questions index choose link below or top left margin!