Rough plan for summmer project.

Phases 1a and 1b can be done simultaneously. It seems like the rest need to be done sequentially.


Phase 1a:

Summary

Integrate Tycoon with PlanetLab node manager. See ticket for a superset of what we need.

Current progress:

  • wrote pseudo-code for virtualization module; code incorporates current AdminAPI for authentication, but only pseudo-code for slice management and tracking.
  • wrote and running code for slice tracking on PL node grouse.hpl.hp.com. It polls Slicestat every second and stores cumulative counters for each slice in a separate file.
  • finally upgraded PL nodes with python2.4
  • have current node manager API and Resource Pool interface from PLC (Steve Muir) in order to finish above code.
  • have a static resource pool of "200" (probed) from PLC under the "tycoon_root_pool" (see link for example on how this works).
  • Sean has agreed to help get us traces from [www.opendht.org OpenDHT]
  • Bookkeeping code apparently will be permanent since the needed information will not be exported from Slicestat.

Current problems:

* PLC (Larry and Vivek) have agreed to export the needed monitoring information from Slicestat so that we don't have to do the redundant (and less accurate) bookkeeping.

  • need to write account/integration for PL->Tycoon node users.
  • to integrate Tycoon into PlanetLab?, we have two solutions: TPS (Tycoon as a PlanetLab? service) and TPF (Tycoon and PlanetLab? federation). TPS is supposedly ready now (9/13/06), TPF is not and will require work from both us and PLC.

Current proposal/plan:

*integrate some subset of dedicated Tycoon nodes with PlanetLab. has become TPF.

  • Integrate Tycoon with PlanetLab (PlanetLabNodeManager): we now have guaranteed 10% CPU and 2MBs BW. Need to rewrite PlanetLabNoteManager.py
  • Create command-line and Web gui (xmlrpc server, synchronous). SLS essentially falls out of using Best Response.
  • deploy by Nov. 15.
  • should redo CPU benchmark on new "guaranteed 10% cpu share"

Description: what we want is for PlanetLab users to start bidding for resources. The best way to do this is to have a resource that they value, and have small overhead to use. Can we install some number of nodes as PlanetLab machines, and have Tycoon manage all of the resources on it? We would need to turn off the bit that allows any slice to add the node (except infrastructure ones like Co*). This way, when a PlanetLab user logs in, they can only add these machines if they bid (and we'll export this interface to them, and set up their accounts seemlessly, as Kevin and I had discussed), and the PlanetLab SLS (i.e. SWORD and CoMon) will be automatically in place to discover these nodes.

We'll need:

*an ok from PLC to do this (i.e. add the nodes and also be allowed to control them) did not get *an ok from HP to use their nodes this way did not get did not get *to make sure we can control which slices can add the nodes.

Other random (unresolved) questions:

  • is this CPU boost enough? I've run some initial test scripts, and I can see upwards of 6x increase in cpu share (from CPU% on SliceStat?). However, the script is completely CPU bound, so this is clearly an upper bound. The impact will be smaller if 1) other people get a CPU boost, 2) other people start using the slice (i.e. the boost will mean less), and 3) a job uses I/O. So far there has been a post on the PL mailing list asking about this (i.e. that Sirius doesn't help as much as advertised), so I e-mailed asking for the specifics of their experiment.
  • Need interpretation of resource pool share numbers from Steve.
  • Can we test whether or not turning off the "Add-node to slice" bit does anything?

Phase 1b:

Summary

Simulation of benefit of Tycoon to a service in a PlanetLab-like environment. Primary candidate is OpenDHT.

Current progress:

  • in the process of receiving trace data
  • set up and am running opendht and client live on Tycoon cluster to generate graphs.
  • understand rough sketch of how to use Best Response bidding agent.
  • baseline and propshare graphs have uncovered an unexplainable bug, probably in Xen.
  • have paper draft/outline that describes experiments in greater detail.

Current plan:

  • normalize data (i.e. set up proper environment between old and new)
  • design bidding agent

Description: The simulation study will take on the following theme. The aforementioned Worlds paper described some of the hacks that had to be done to improve the usability of a service like OpenDHT. I'd argue that a market can help make this evolve naturally. For example, forcing users to express preferences, and then constraining those who don't, will, theoretically, allow prioritized access to users of OpenDHT. What this means is that people who find high latencies (i.e. on the order fo seconds) unacceptable, can try and work around it. That example was from the perspective of the DHT clients. From the perspective of OpenDHT itself, it can do some smart bidding (i.e. bid high on loaded nodes, and at least one in each replica group) to try and ensure a reasonable level of performance. This example is what we want to show, since the measurements wouldn't rely on particular utility functions. So, here is how we could proceed. In a simulation environment, show how the new OpenDHT algorithm performs compared to the old OpenDHT (this should verify findings in paper to sanity-check the simulation environment). Then, use old OpenDHT + market to see if we can achieve at least the same level of performance under whatever assumptions we had to make about currency and background traffic. The background traffic can probably be modeled similar to the work by Oppenheimer in Usenix '06.

Note: it's not clear that there will in fact be benefit. It is important to distinguish the variation of load across different nodes in DHT, and load across time (on all the nodes). Markets can potentially provide benefit in this way. The description of the above concentrates on the former.


Phase 2:

Summary Begin experiment, which involves simulation-based (Phase 1b) and deployment-based data. The point of the experiment is to compare standard proportional-share scheduling versus the current Tycoon market scheduling from the standpoint of a service like DHT. If DHT can use something like Tycoon, we want to understand how much better we can do. We can control the Tycoon nodes that have been integrated with PlanetLab.

Steps needed to be taken include:

  • simulation-based experiments will require use of application-trace from something like OpenDHT. Need to talk to Sean Rhea to obtain this.
  • deployment-based data will involve getting trace of usage from current Tycoon Grid users and whichever PLab users we can convince to use it.

Note: We have to understand how OpenDHT uses existing load data to do its own load balancing for its clients. It will be important to note the difference between the "load signals" market prices can provide, and what other benefits (prioritization) having market prices provide.


Phase 3: (optional)

Summary Experiments involving having multiple instantiations of DHT simultaneously.