How to Create a Mini Cluster with GeneXproServer
One of the hot topics in the data mining field is distributed computing and parallel processing of large volumes of data. Since GeneXproTools is a desktop application with a very high level of interactivity it is not possible to distribute the processing to remote servers and efficiently maintain the same level of interactivity. Distributed systems usually have poorer interactivity comparable to querying a remote database server: you run the query and wait for the result without an inkling of how long the process will take.
Since we knew that several customers wanted an automated way of processing runs we introduced GeneXproServer. GeneXproServer is an add-on to GeneXproTools that creates runs and processes them according to a script called a job. The job is defined in XML and can include commands to change settings, load data from databases and files, run external programs before and after a run and test the generated models. And each script can create as many runs as you want. A very simple example of a job would be:
<job filename="BreastCancer.gep" path="c:\jobs\job6" feedback="1000">
<run id="1" stopcondition="generations" value="1000">
<run id="2" stopcondition="generations" value="1000">
In addition to the job you also need to create a seed run. This is a regular GeneXproTools run ready to be processed. The complete documentation on how to create job files and operate GeneXproServer is documented in the GeneXproServer documentation.
The first step is to install GeneXproServer on each server of the cluster. The GeneXproServer installation file contains a second installation file that allows you to install GeneXproServer on a computer that does not have GeneXproTools: the Standalone Installer. First you need to install GeneXproServer on a computer that already has GeneXproTools installed and then run the file GeneXproServerStandAloneSetup.exe on each of the servers. This file can be found at C:\Program Files\GeneXproServer 1.0\ for x86 computers or at C:\Program Files (x86)\GeneXproServer 1.0\ for x64 computers.
The setup file can be installed manually or in silent mode using the following flags:
The first one shows a progress indicator whereas the second is totally silent.
GeneXproServer ships with two user interfaces: a lightweight Windows interface and a command line interface or CLI. The command line interface command is called GeneXproServerConsole and on a default installation can be found at C:\Program Files\GeneXproServer 1.0\ GeneXproServerConsole.exe for x86 computers or at C:\Program Files (x86)\GeneXproServer 1.0\ GeneXproServerConsole.exe for x64 computers. The Windows interface is useful to create and test job files but this tutorial only covers the command line interface.
The second step is optionally adding the location of GeneXproServerConsole to the computer’s path and at this point each server should be ready to run GeneXproServer manually.
Remote processing primer
There are several commercial cluster management products with advanced functionality in the market and, if you are planning to run a large cluster with dozens of servers you should definitely invest in one. But these frameworks can be costly and have a steep learning curve which may be overkill for small installations and that is why we present an essentially free alternative using VBScript and Windows Management Information (WMI). Both these technologies are installed by default on Windows XP, Windows Server 2003, Windows Vista and Windows 2008 Server making it possible to create temporary clusters out of any desktop machine available.
WMI also works with Windows 2000 with Service Pack 2 or later but it is not available on older operating systems.
To try these technologies you will need to have all the computers’ security and networking properly setup and you will need administrative privileges on all the computers of your cluster. These steps are not covered in this tutorial which assumes that you already have a Windows network in place and a properly setup domain.
Finally, you should establish a consistent folder structure across the cluster. To use the script below you will need to create a folder name “jobs” at the root of the c drive of each server and share that folder. You will also need to adjust the permissions of the share to let your user account write to that location. You should also ensure that there is enough disk space available counting one run file per run in the job plus a reasonable amount free.
The “cluster script”
The script that runs the cluster (GeneXproServerCluster.vbs) is very simple but not very resilient. It is a VBScript file that runs on all the operating systems stated before and it takes care of creating the job folder on each computer of your cluster, copying the job definition file and the seed run across, and kicking of the process.
Before you start you need to edit the top of the script to replace the following information:
jobNumber. Each job that runs across the cluster should have a different number. If you don’t change the number between runs the script will ask for permission to delete the previous runs’ files.
jobFile. This is the XML file with the job instructions. If you reuse the file in the download you don’t need to change this parameter.
jobSeed. This is the GeneXproTools run file that you want to clone and run across the cluster.
clusterArchitecture. This setting can take one of three values: x86, x64 or mixed. If all the servers in your cluster have 32 bits operating systems then choose x86, if all the operating systems are 64 bit operating systems then select x64 and if it is a mixed environment with both x86 and x64 operating systems then choose mixed. In this last case you will also need to add the location of GeneXproServer to the path of each of the servers. The default is x86.
Running the cluster
With all the installation steps out of the way it is time to join the various parts into a running cluster. You need to create a job file and one seed run that will be part of the cluster job. You may look into the example job files under the folder samples for examples but for this tutorial I will use the job above which creates two runs of 5,000 generations each.
Copy the script file, the job definition file and the seed run to a folder in your computer and double click the GeneXproServerCluster.vbs file. If all the setup is correct then you will see the following output (with different computer names, of course):
Note that even though the screen shows one server being processed after another, the script does not wait for GeneXproServer to finish the run in essence kicking of parallel runs a few moments apart.
Most of the problems with this setup are security or firewall related. It is a good idea to try one server at a time and if the script does not launch GeneXproServer then look up these Microsoft published articles to help troubleshooting WMI problems:
“WMI isn’t working”
This one has long unrelated name but covers changing the Firewall configuration of Windows XP SP2 computers to let WMI work its magic.
Windows Vista and Windows Server 2008 also have specific instructions on how to allow WMI connections.
If you get the error:
Error creating a run in servername
The remote server machine does not exist or is unavailable
Start by double checking that the server name is correct in the script. If it is then try disabling the firewall or changing its configuration according to the articles above. These articles should cover most problems but if you get stuck then email us at email@example.com and we will do our best to get you going.
Zip file containing the script, a job definition file and a sample run.
May 29, 2008