Scheduled job running at ~10% speed after IPL/backups

jgoslin

Member
Hello all! We've come up empty-handed so I'm reaching out to the masses for advice. I'm a sysadmin turned BA so forgive me if I initially leave out pertinent information.

We have a pretty hefty custom job that processes incredibly slowly at the end of the month after we do back-ups and IPL the system. It normally processes 3800 records per minute and it drops to about 320 per minute when we see this.

Fact 1: The job in question joins F1201, F1217, F1731 and F17311 to pull asset information. It finds these asset numbers on sales orders in F4211. It performs a simple equation to basically estimate when these assets will need to be used again and it writes data to a custom table.
Fact 2: This job normally processes about 3800 records per minute and completes in 3 hours.
Fact 3: When this job is processing slowly, it only processes about 320 records per minute. This is the source of our despair.
Fact 4: This job runs every day at 12:05am and no users are in the system at this time.
Fact 5: The number of records processed is effectively the same every day. (It increases by about 20 a day at most but we're currently at 518,000.)
Fact 6: We do not have debug logging turned on for this job. When I turn it on though it severely impacts performance, essentially to the degree we see when the job is running slowly in production.

CNC says that no other jobs are running at the same time as this one, there are no table locks, no zombie kernels, and they also say that there are no access paths being rebuilt during/after the IPL.

The only thing we've been able to link it to is the IPL and even that isn't always the case. Last week we IPL'd production and didn't see the job run slow until a couple of days later. Today in our prototype environment the job ran normal twice, they did the IPL, and now the job runs slow every time I kick it off. Something is bottlenecking or throttling this job and we still can't find the root cause. We've gone down all sorts of rabbit holes involving creating new indices and modifying the SQL statements it generates, but the crux I keep coming back to is that this job runs fine every single day with the same records and same code until they do the IPL and backups on the last weekend of the month. The following Monday all hell breaks loose because this job runs and never completes because it's so slow.

All hints, pointers, and messages of encouragement are appreciated. Thank you!


System: DB2 on flash, iSeries, JDE tools 9.1.5, 5 processors allocated to production
 
Last edited:
Does it only happen on the first run after an IPL?

I would look at your SQL Plan Cache both before and after the IPL. I would also look at your MTIs (Maintained Temporary Indexes) that are built over the tables in question. You may want to make some of them permanent. These two things account for a large proportion of similar problems I have seen.

If I'm speaking 'greek', go ahead and PM me, my main business is JDE performance tuning on the IBM i.

Tom
 
I have run this job 5 times since the last IPL and it's running slow every time. CNC has turned on performance monitoring in this environment to collect information about the MTIs. Are there specific screenshots or logs I need to ask them for?
 
If your business view contains all four tables, are any of the joins an outer join? I've sometimes refactored a UBE to use a subsection join for getting detail table data especially when it originally was the right table in a left outer join relationship. Also, you can grab a SQL statement on the fly without going into debug mode - DB2 has a pretty good Visual Explain facility that can help point out issues. Try this captured SQL statement both before and after the IPL to see if the optimizer uses a significantly different plan, similar to what Tom suggested.
 
CNC is telling me that Visual Explain is a tool only for UNIX, Linux, and Windows, but we do have a left outer join in the view this job uses.

I was running this job repeatedly in PY and after the last IPL it ran slow 5 times in a row across two days. Yesterday they did a build in PY and now the job is back to running at a normal speed. Very weird.
 
If you've run the jobs multiple times, it sounds more like a caching/tuning issue to me. You will want to look at the memory pools, and what is running at the same time. You can find the MTI's in the index adviser.
 
Visual explain is for windows, linux/unix. But it pulls the data from the IBM i. Why does CNC think this is an issue?
 
Well, I found out that our visual explain "doesn't work". It throws an error every time they try to launch it and they say it has never worked. Aside from that we're testing the addition of indexes to see if that helps after an IPL. Indexes seem to have helped another job that starting blowing up so we're testing it out on this job before and after IPL. Something still doesn't feel right to me but I'll post an update when I have more concrete evidence.
 
Back
Top