There are 2 different testing tools - Autopilot and Virtual Autopilot (although they are pretty much the same tool really).
Autopilot is used on a FAT Client to test functionality of JD Edwards products. Remember that Autopilot runs underneath the physical GUI - it directly sends data and commands to the API levels. This means that if a GUI object has a bug - you won't find it through Autopilot. The other, important, factor to consider is that Autopilot is designed to run on top of a single copy of OneWorld. It does not run on Citrix - nor does it run on top of Java. Therefore it cannot realistically stress test the application (unless 10 users is stress testing to you !). Autopilot does, however, have the ability to package up information about an application error for forwarding to JDE Support. I hear this is very cool (but not used it yet myself).
Virtual Autopilot is the product that was developed further from Autopilot for JD Edwards Benchmarking labs. It takes an Autopilot script and allows the user to run many copies of the script from a workstation. This then emulates a certain number of users. None of the GUI is utilised at all - VAP hits the api level directly. This allows you to stress test OneWorld's functionality (mostly at the API and the Datastructure level) - but not how users operate (Citrix/Java Thin Clients).
Hence - the note stating not worth the paper its written on ? It all depends on what you want the tool to do. Autopilot is great for customers wanting to set up scripts - VAP is great for stressing the processes - but Macroscheduler on top of a terminal server is probably best for QA of the product itself - because Macroscheduler sits above the GUI (just an opinion of course
The much-improved Xe tool has to be used in context with documentation (eg, business cases, modeling, etc). Otherwise it looses it's relevance and becomes a maintanence nightmare.
When the user changes how they do things, the business cases/models & the script must all be updated or they become worthless. The way to do this is to propose and test the changes with the scripting tool so that the script EXACTLY matches what the user should do. And then implement what the script does and the user must do it that way -- not an easy process.
Remember the tool was developed by JDE to test using a known data set (the JDE demo data) and for specific modular functionality. This is easy for them but hard for real world users who don't have a known data set and where things frequently change and the test needs to be across several modules.
Interesting comment, rekblad, on Autopilot being developed around Demo Data rather than real-world data. I'm not sure what you mean however - could you expand ? I'd certainly like to know your experiences of Autopilot with your own data - and if there were any restrictions of autopilot because the data was different. I'd always hoped that autopilot would work directly with the foundation rather than at the data layer - but thinking about it - I think I can understand where some issues would arise.
Another person that we both know told me that the Autopilot scripts are fine
except that they are tied to the data and the application flow. If the data
changes, one may have to do a lot of work to make the scripts work again.
If the application flow changes, one may have to do a lot of work to make
the scripts work again. In his opinion, it is a toss up whether it is
better to try to fix the scripts or to start over. In the case of small
changes, fixing the scripts is okay but for anything like a release change
or a change from JDE demo data to real customer data, rewriting would
probably be better and may be the only choice. In the case of a small
application mod, you might be able to fix a pre-existing script. If you
didn't have a script, or if your test data changes, or if the app change
isn't small, then the scripts may not apply - in other words, you may have
to create new scripts.
If someone is taking the position that JDE ships scripts than can be used by
all customers with only minor changes, that position is in conflict with
actual experience. The shipped scripts run against demo data and imply the
demo setup. A customer should not test app changes or CRP setup with JDE
demo data, they should test with their CRP data or a subset (or copy) of
prod. That might require rewriting the JDE scripts (see above). If the
customer has any application mods, that may require rewriting the scripts.
Rewriting all the scripts is not a small effort - especially for someone who
doesn't have experience building up a big test. You may recall that during
benchmark tests we have worked on, we always threw out non-critical
functionality and restricted the number of cases to the absolute minimum
needed to demonstrate our point. That was the only way to make it possible
to perform the test. If a customer tries to test all their ordinary cases,
it would require a large effort and cost. We have both worked on tests that
cost in excess of $250,000 just to demonstrate a single capacity point. I
think you know what I mean.
I think I understand where this was going. My understanding was the tool was independent of the data - which of course is correct - but the supplied JD Edwards scripts obviously work only with the JDE demo data. I would therefore not expect to modify scripts if I were wanting a reliable testing mechanism - I would instead be creating new scripts from the start.
One retort (!) I do have to you Richard is that the Benchmarking scripts were actually very complex compared to customer requirements. We stress tested apps like Sales Order Entry with many, many sales orders (hundreds of thousands) that needed to be extremely randomized. A customer may feel comfortable testing their custom Sales Order Entry application with a dozen orders or even less ! In THEORY customers should feel comfortable that JDE has benchmarked large transaction loads since they fundamentaly demonstrate the scalability and reliability of the product.
If the scripts exist to validate that all the possible "ordering
combinations" (not configurator) are valid and if there are a lot of
combinations, then the scripts, and the corresponding configuration
information in the database, may be very complex. I mean that the script
must deliberately exercise every path and, if path number 217 fails, we have
to know that it was path 217 that failed and what path 217 does (how 217 is
different from 216 and 218) so that we can fix the corresponding
configuration information stored in the database.
If the purpose of the scripts is to validate capacity or scalability within
a given hardware and network structure, randomized orders of about the right
size would work fine. The randomized orders may drive very complex
processing such as advanced pricing and preference profile (a function of
what the customer is willing to set up or port for the test) with a pretty
simple user interface. In this case, 2 order types with randomly selected
customers and items may be quite sufficient. Compared to the first goal,
this takes a lot less set-up effort and the scripts are much simpler.
I believe that Autopilot is touted as a way to verify or prove a CRP setup.
A script can be executed to verify that today's changes to the CRP database
configuration don't break yesterday's configuration and setup - in other
words, when you are done you can run all the scripts to prove that
everything works correctly. That requires deliberate design, execution,
management, and monitoring of the scripts, the database, and the results. I
am not comfortable enough with Autopilot to know if I can run hundreds or
thousands of cases through it while maintaining adequate controls of the
test input, processing, and outputs. I am not comfortable enough with
Autopilot to know if I can generate my test scripts based on a model or
prototype. If the scripts can only be hand-created and if there is no
pass/fail and duration monitoring and result tracking, what's this then?
Jon, you and I built some doozy scripts and databases and ran some really
large tests - lots of machines and simulated users. I think that, for the
most part, our databases were large but pretty simple. Some real customers
have very complex database configurations (many possible kinds of sales
orders, different pricing rules, different accounting setups, transportation
routing, etc) that we never had the time to duplicate (by the way, other
customers don't have such complex setups). If the database setups are
simple like ours were, simple scripts and simple test management tools will
suffice. If the world is very complex and the requirement is to prove that
all possible combinations work, then the number of scripts, the underlying
database configuration and content, and the test management becomes
difficult. I am not trying to say that our testing was not representative
only that it represented the customers with simple requirements very well
and the customers with very complex requirements not as well.
It may be that you and I have a different view of this - you think that the
sales orders were complex and I think that they were pretty simple. We
should resolve this using Guinness and arm-wrestling, perhaps we could even
have ... a curry! In fact, I insist that we are disagreeing - where is
there good draft Guinness in Denver? I'm in town every weekend ... pick one
Theres not really any good Guinness in Denver you know - I'm always ready to be proven wrong on this matter (hee hee)
One of my last roles for WWAT was to counter exactly one of the arguments you made about scalability testing internally for the benchmarking group - I made a decision to eradicate any possibility of bad benchmarking at WWAT by ensuring that all tests and benchmarks were both running the OneWorld code and were honest representations of a true enterprise environment.
Hence the introduction of Macro-scheduler testing that sat above the GUI code and hammered OneWorld as if it were a real user - hitting the same Dr Watsons and Memory issues as a real user would - and because this was happening in a lab environment - Development were forced to help out as much as possible. Believe me, everyone knew of the instability issues of B733 base. B7331 was the 1st Macro scheduler run - and in the lab we discovered hundreds of issues before the release got distributed.
You are correct, however. The 18 standard scripts that JD Edwards used were extremely "simple" - and most important of all, were extremely disjointed. They had no representation of a real customer and after 200,000 sales orders had been entered, nothing was done with them in the application. The 18 scripts did touch on the majority of the GUI however - hence they were a great test of the toolset and foundation.
However, Richard, after you left (and just before I left) - I decided to create a high-watermark benchmark that would once and for all prove JD Edwards scalability - and would actually use the application as if it were running at a customer site.
The idea struck me after the successful running of the Fortune 1 benchmark. In that, we pulled in 8 RS6000 Application Servers and ran UBE's on them to process a large number of configurator sales orders against a huge RS6000 12-processor Database server. We pulled in a large number of application specialists to set up the applications themselves - generated huge amounts of randomized data and ran the data through the RS6000's using OneWorld's UBE's.
When I started the reports on this benchmark - I realized that what we were actually doing was running in parallel huge numbers of kernels against specific sets of data - and each kernel was pumping business function calls. Now - when OneWorld is running in true-physical 3 Tier - the same thing happens with Interactive users. A user will create business function calls from their Client to the Application Server - which in turn will process the data autonomously. The Client itself is doing very little of the business end of the work - the Kernels on the Application Servers are processing the real data.
Hence I worked out that if we can start scaling up the number of kernels on the Application Server against the Database Server - the number of Client connections that generate the App Server calls is really not as important....one just needs to know how many users a single Terminal Server or Web server can support - then work out how many TRANSACTIONS a Database/Application server configuration can support. By moving toward a Transactional model - one can truly understand how to architect Oneworld for a very large customer.
Ensuring that this kind of benchmark was not only useful for prospective customers to guage OneWorld's scalability but also ensuring that the benchmark was independent from a specific industry vertical was important. Too many times we had heard "but thats not my business" - even though it was unimportant. Tony and I decided, therefore, to ensure that what we benchmarked not only used some VERY advanced functionality in JD Edwards - but also that the fictitious company we created was not representative of any customer that JD Edwards could possibly sell too. In fact, we made the business model so ludicrous that it was OBVIOUSLY not any cusomers "business" and instead, they would be forced to look at the detail of the benchmark not the high-level view.
The benchmark was codenamed "Moonshot" - and the fictitious company was named "Apollo Inc". The benchmark was centered around Distribution - and we created a model that had so many transactions that it would be extremely difficult for any hardware provider to meet the goal.
The business model was the following...
1. Apollo Inc has 2 warehouses - each with 30,000 locations (10 aisles, 6 bins/aisle, 500 locations/bin)
2. Apollo sells ~5,000 different items (moondisks) - to ~50,000 distributors from around the US
3. Apollo has numerous web servers taking the orders and entering the orders into EDI format
4. In 8 hours, 200,000 orders with 84 lines apiece (16 million order lines) needs to be processed from Order to Cash. All orders come from a Distributor (hence the large # of lines)
5. No two order lines are the same (completely randomized on object #, customer # and Qty)
The Business Process flow used the following modules :
1. Sales Order Entry
2. Advanced Warehouse Management (picking)
3. Advanced Transportation (shipping)
5. Sales update to GL
The Sales Order Entry process was set up to either use Advanced Pricing (discounts based on quantities) or NOT (depending on the user). Both ON and OFF were tested.
Advanced Warehouse Management would pick from random warehouse locations using either FIXED or RANDOM logic (depending on which of the 2 warehouses were used)
Advanced Transportation assigned a carrier based upon location - weight was calculated and shipment costs were assigned as a backorder.
Invoices were then generated to reflect new freight costs and eventually updated against AR.
One of the most important, fundamental differences between Moonshot and the other benchmarking practices - is that a mistake in any part of the process would reflect throughout the entire procedure. We also purposefully introduced a small amount of invalid orders to ensure that the procedure was 100% accurate.
The goal of the project was to create a benchmark environment that could eventually be scaled (based on known technology) linearly to 2,000,000 order lines per hour. The proof of concept scaled to no more than 40,000 order lines per hour (Adv Pricing off) and 25,000 order lines per hour (Adv Pricing On) with 4 processors - to 100,000 order lines per hour with 16 processors - though with later benchmarks (performed at HP) this scaled to 485,000 lines per hour last November (http://biz.yahoo.com/prnews/001121/co_jd_edwa.html) with 48 processors. Our extrapolation was that 300 processors would be required for 2million transactions per hour (still accurate) and that a 24 processor RS6000 would handle the database load.
These results are pretty much exactly as customers see the performance of OneWorld. It is interesting to note (and you'd be proud, Richard) that since B7332 it is extremely difficult to see the limitations on Next Numbering anymore. Fascinating since we used to hit limitations at 50 users once upon a time !
Another important aspect of this benchmark is the fact that not only are the transactions real (and thousands of pages of PDF's are produced) - but the entire documentation of Moonshot was designed to be open so that any competitor could follow suit and try and beat the benchmarks. Why ? Because the initial data is EDI based - and therefore ERP independent ! All another vendor has to do is to enter the financial model of the company by performing a CRP (all pretty simple !)
Lastly, from what I understand, I believe a platform partner is seriously considering a project to reach the 2 million mark. This would blow any other benchmark clear out of the water - as if it were necessary. 500,000 order lines per hour is pretty scalable - I think that Amazon MAY do 2 million order lines a WEEK over 24hour peak periods !
Well - thats my little 2c. I agree with you wholeheartedly, Richard, that ERP vendors should try to benchmark based on real customers - and I tried to achieve that with the high-watermark benchmark "moonshot". I hope that the legacy of these benchmarks continue - but my fear is that the industry is changing, and with the introduction of Microsoft to the industry - benchmarks will become less and less real and more and more marketable.
Its all getting interesting, as one partner would put it.
I agree about the Guiness in Denver but sometimes mediocre Guiness is better
than a cup of warm ... beer.
When I started OneWorld testing, we hit next number issues with 2 users.
We've talked about this before and, in my opinion, the discussion is not
appropriate for this venue.
I agree that your moonshot test was well conceived. I didn't see the
execution and I don't think that I have your design notes so I will step
back from boundless praise but I know you and I know Tony so I am pretty
confident that I would trust your results.
The correct way to test applications for scalability is the way you did it.
A complete business process, representative database size and diversity,
hardware overbuilt so that you can find genuine limits, throw in some trash
to find out how the systems (hardware and software) handle it, good
instrumentation and reliable drivers.
It was a good test. I did miss the integrity testing for the data and the
troubleshooting guide that reflected the changes as the workload scaled up
but I wish I had been there! Did you ever crash any of the server boxes
while testing? That is the best!
Interestingly enough - as far as I recall, we hardly had any actual issues with OneWorld. Apart from the standard Application setup issues in the proof of concept (mostly with bad data) - there were very few technical issues.
Challenges that appeared were :
1. How to monitor progress without affecting the database
2. How to start all processes smoothly without colliding at the beginning (gradual startup)
3. How to reasonably transfer gigabytes of data around in reasonable timeframes
4. Database tuning
Of course the biggest challenge was the network (as usual !) - how to cram gigabits of database requests into a single Database Server. We eventually ended up with dedicated 4x100Mb cards - 6 front end 4-way servers had dedicated links to the database server, meaning the database server had 24 network slots. All was fed through a mega-Cisco box. I think I was recommending 16Gbit Ethernet cards for the big test in the database machine.....