• Andrew..Peterson (8/1/2014)


    The more you are prepared, the less you need it."

    Andrew, the last line in your post reminded me of something from years ago. For the jobs that failed we use to use checkpoint/restart features, such that we kept reference of state and when the job was restarted it resumed execution from that state/checkpoint. Using this technique, we were able to save a tremendous amount of time in those old Big Iron days.

    Also in those jobs that failed to complete and just ran in loops, the last checkpoint may have been the one that caused a loop, wait, or other anomaly. If I remember rightly, and it has been a number of years, we could determine the state of the last correctly completed function or record, and then fool the checkpoint to restart just after the last successful process after the data error or other logic was fixed that caused the problem.

    I wonder after reading this series of posts if the old technique of checkpoint restart has been lost, or forgotten.

    Not all gray hairs are Dinosaurs!