The following scenario helps explain: you have two radars. One is a wide-angle, ...

danbruc · on March 24, 2014

You are describing a problem where two clocks drift relative to each other, not a problem where one clock drifts away from the actual time. On the other hand the article gives the impression the failure occurred because the system failed to exactly measure the up-time, not because clocks in different systems or system components drifted relative to each other.

vonmoltke · on March 24, 2014

That is not the impression I got from the article. No logic in the system cares how long it has been up, not directly at least. What matters is drift from its time reference, which is a function of uptime.

Various modules in a complex system like this each have their own clock, which I will refer to generically as a real-time binary counter (RTBC), which the module uses as its event time reference. The RTBC starts at 0 when the module comes up. At some point shortly after coming up the module will check in with its controller, which will send a time-of-day (TOD) message. The module links the TOD message to a particular RTBC tick to create its time reference. At this point the time is free to start drifting relative to the actual wall clock time, until the system is power cycled again.

danbruc · on March 24, 2014

That is exactly what I said - different clocks drifting relative to each other. It is completely irrelevant that their one tenth of a second was not exactly one tenth of a second, what matters is that different clocks in the system had different ideas of one tenth of a second.

darkmighty · on March 24, 2014

You're applying a principle too broadly. Although the laws of physics don't change under a linear expansion of time, they are, for example, sensitive to linear expansion of velocities of missiles only: any non time-linear effect on the velocity is going to impact you -- for example, reynolds numbers for air depend non-linearly on the velocity which may be varying with time. Sure, if you multiply the whole system you would have that compensated by the increase in temperature and pressure which a faster time reference would observe, but it's not simulating the universe, just a limited set of variables.

Also, for obvious reasons of consistency and precision it would be better to keep a standard reference regardless.

danbruc · on March 25, 2014

I do not think of the problem as scaling the time by a factor - although this is the correct description - but as adding a constant offset. I think this is justified because the small drift is not significant during the relative brief period of time a target approaches. The offset builds up over time but only in the parts of the system that did not receive the improved algorithm and therefore these different parts disagree more and more on what the current time is.

vonmoltke · on March 24, 2014

> It is completely irrelevant that their one tenth of a second was not exactly one tenth of a second.

Its very relevant when the module that is off is trying to make telemetry calculations based on target Doppler velocity, which is given with real, ISO standard seconds. There is no clock involved in that. Diverging module clocks amplifies the problem.

Also, the ultimate reference is the true definition of a second. All modules are expected to be using it, as it is used to synchronize modules. It is the clock and at some level a clock that has a faulty definition will be drifting off another clock. Your distinction is irrelevant as far as real-time systems are concerned.

danbruc · on March 24, 2014

You are making a lot of assumption about how time might be used, nut I will ignore that because I have no clue if that is what really happens.

Let me repeat my point clearly. All clocks will drift away from the actual time. All the physics involved and measurements done are not depended on the current time - they will work the same at 14:07 as they do at 23:51 and they will therefore also work the same when the clock of the system drifted away from the actual time and believes it is 12:34 while it is 12:35. Important is only that all parts of the system agree on what the current time is and that the clock does not drift at such an high rate that all measurements and calculations done during a brief period of time become invalid, i.e. the clock should not report that it took two seconds for the incoming missile to travel one kilometer while it took only one second.

And the article gave the impression - at least to me - that the failure was caused because the system believed to be up for 100 hours while it was up for 100 hours and 340 milliseconds longer due to an imperfect representation of one tenth of a second. This makes no sense and is not what caused the failure. The failure was caused - as detailed in the other linked article - because one part of the system believed to be up for 100 hours while another part of the system performed more precise time conversions and knew that it was up for 100 hours and 340 milliseconds and this time difference between two parts of the system caused the failure.

For example one part of the system may have decided that the missile should be launched at 12:00:00.000 and the system responsible for doing so did that according to its clock but because of the time difference it was at 12:00:00.340 according to the clock of the system that made the decision.

krfsm · on March 25, 2014

My interpretation of the article:

Time is kept as an integer, stepped ten times per second. This can be exactly represented as a float, so probably uses the same 24 bit register. For 100 hours this integer would be 3600000, which fits into 24 bits with some room to spare. (But it would give a max uptime of the system of about 466 hours.)

Wide arc radar notes location, velocity, and time from clock above. This output data is still good enough for pinpointing the next position with a precision of about 170 meters (the distance the scud travels in the 0.1 second step of the clock). The precision radar system probably had accounted for this, and had a wide enough beam to handle this case.

Now, when deciding where to point the next precision beam, the radar multiplies the stored time value (exactly 3600000) with 0.1 (which is not represented exactly, but instead is about 0.000000095 less than 0.1) and uses this computed value in further calculations. This floating point value is now 0.34 seconds less than expected. The precision radar, even though it uses the same clock as above, has an incorrect representation of when the last wide arc radar update took place, and this propagates to the prediction of where the scud will be next (which is now off by 0.34 * 1676 ~= 570 meters). Thus, when it points the beam to where it believes the scud will be, the scud is outside the precision radars cone.

Note that both wide arc and precision beam systems have exact knowledge of the current system time at the point of their respective operations. What fails is precision beam's calculation of what wide arc's time reference actually meant.

The ironical part in the article probably refers to some computation using a delta, and if both time references ("then" and "now") have the error the delta will be small and possibly insignificant. However, if "now" is replaced with a more accurate representation of the clock above, only "then" has the big error, and the delta will be just as far off as the incorrect value above.

The exact error propagation depends on the order the calculations are performed in, and there's a whole field (numerical analysis) dedicated to controlling these errors. As developers we gladly ignore the problem even when we shouldn't.

grkvlt · on March 24, 2014

> You are describing a problem where two clocks drift relative to each other, not a problem where one clock drifts away from the actual time

The latter is simply a specific case of the former. The second clock is the one measuring 'actual' time, and the drift is relative to it.

danbruc · on March 25, 2014

Of course, but the point was, that a drift relative to actual time does not cause problems as long as all clocks in the system drift at the same rate while the failure was caused by different clocks in the system drifting relative to each other and therefore with different rates relative to actual time. Therefore I treated actual time as a special clock.