Monitoring And Analyzing Distributed Cluster Performance And Statistics Of Atlas Job Flow
Grid3 is a Grid facility used by many High Energy Physics experiments to enable physicists to process data intensive and CPU intensive jobs more effectively as well as more efficiently. Amongst other things, the highlights of Grid3 are participation by more than 25 sites across the U.S. and Korea which collectively provide more than 2000 CPU's, resources used by seven different scientific applications, including three high energy physics simulations and four data analyses in high energy physics, bio-chemistry, astrophysics and astronomy, more than 100 individuals are currently registered with access to the Grid, a peak throughput of 500-900 jobs running concurrently with a completion efficiency of approximately 75%. Since each application and organization utilizing Grids has different measures for efficiency and different parameters such as number of successfully completed jobs, turnaround time, number of idle processors, etc., to be considered for scheduling, scheduling on any Grid still needs to be tailored for individual cases. The ATLAS experiment is a High Energy Physics experiment that utilizes the services of Grid3 now migrating to the Open Science Grid (OSG). This thesis provides monitoring and analysis of performance and statistical data from individual distributed clusters that combine to form the ATLAS Grid and will ultimately be used to make scheduling decisions on this Grid. The system developed in this thesis uses a layered architecture such that predicted future developments or changes brought to the existing Grid infrastructure can easily utilize this work with minimum or no changes. The starting point of the system is based on the existing scheduling that is being done manually for ATLAS job flow. We have provided additional functionality based on the requirements of the High Energy Physics ATLAS team of physicists at UTA. The system developed in this thesis has successfully monitored and analyzed distributed cluster performance at three sites and is waiting for access to monitor data from three more sites.