Tuesday, May 19, 2009

Service dynamics: the lazy man's way

The Problem

There is no doubt that the hardest topic in OSGi is how to deal with service dynamics. In this article I will give you the complete epic story of my suffering and enlightenment on the subject. I will start with the basic nature of the problem and than present two different ways to solve it. There are two key factors that make service dynamics fiendishly hard to get right.

Concurrency

Before I go further I feel obliged to explain one basic and somewhat startling fact: the OSGi container in practice does not run threads of it's own! It is merely a "dead" threadsafe object structure on the heap. The "main" thread is used to setup this structure and start the initial set of bundles. Than it goes to sleep - it's function to merely prevent the JVM from shutting down due to lack of living threads. The "main" thread is typically awaken only at the end of the container shutdown sequence when all other threads are supposed to be dead. It is used to perform some final cleanup before it also dies and lets the JVM exit. This means that all useful work must be done by threads started from bundles during that initial startup sequence. I call these "active bundles". Usually the majority of bundles are "passive bundles". These don't start threads from their BundleActivator.start(). Instead they setup the imports of some service objects, which are than composed into new service objects, which are finally exported. After the start() call returns the bundle just sits there and waits for a thread to call it's exported services. As elegant and lightweight all this might be it also means that the OSGi container does not enforce any threading model - it steps aside and lets the bundles sort it all out between themselves. The container object structure acts as a "passive bundle" (a bundle with ID 0 in fact) getting animated only when a thread from an "active bundle" calls in to perform some interaction with another bundle or with the container itself. Because at any time a random number of threads can call into the OSGi core the container implementers have their work cut out for them. You as an application coder are also no exempt from the suffering.

The concurrency factor is is than this: at all times an OSGi application is subjected simultaneously to two independent control flows. These are the "business logic flow" and the "dynamic management flow". The first one represents the useful work done by the app and has nothing to do with OSGi. Here you choose the design (thread pools, work queues etc) and code your bundles to follow the rules. The second control flow however is entirely out of your hands. It takes place generally when some management application you don't know about plays with the lifecycle of the bundles (this includes installing and uninstalling). Often there are more than one such applications - each with it's own threading rules just like your app. Some examples include provisioning over HTTP, management and monitoring over JMX, even a telnet console interface. Each of these can reach through the OSGi core, call BundleActivator.stop() on a bundle you depend on, and cause the withdrawal of a service you require. When this happens you must be ready to cooperate in the release of the service object. This odd arrangement is explained with the second factor I mentioned.

Direct service references

The second factor has to do with the way objects are exchanged between bundles. Here again OSGi is non-intrusive and lightweight: an importing bundle holds a direct reference to the object owned by the exporting bundle. The chief benefit of this design is that OSGi does not introduce method call overhead between bundles - calling a service is just as fast as calling a privately owned object. The downside is that the importing bundle must cooperate with the exporting bundle to properly release the service object. If an importer retains a reference to the dead service multiple harmful effects take place:

  • Random effects from calls to the half-released service object.
    Because the service object is no longer backed by a bundle calling it can yield anything from wrong results, to random runtime exceptions, to a some flavor of IllegalStateException the exporter has chosen to mark invalid services.
  • Memory leaks because of ClassLoader retention.
    The ClassLoader of the exporter bundle will remain in memory even if the bundle is uninstalled. Obviously each object on the heap must have a concrete implementing class, which in this case is provided by the dead bundle. This leak will happen even if the importer sees the service object through an interface loaded from a third library bundle.

All this means that the importer must track the availability of the service at all times and release all references to the service object when it detects it is going down. Conversely when the service goes back online it must be picked up and propagated to the points where it is used.

The deadly combination

Now let's examine in detail how Concurrency and Direct Service References play together when a service is released. Because we have two execution flows (concurrency), which access the same object reference (direct references) we must synchronize carefully. To aid you in this matter OSGi notifies importers about service state changes in the same thread that executes the service unregistration (e.g synchronously). In other words the management control flow passes directly through the ServiceListener of the importer. This allows the management flow and the business flow to meet inside the importer bundle. Such rendezvous' are critical because the importing bundle can use a private lock to prevent race conditions for the service object reference. If the management flow obtains the lock first (by entering the ServiceListener ) it will block the business flow and flush clean any references to the dying service. After the cleanup the business flow will usually resume with a RuntimeException notifying that the service is gone. Conversely if the business flow obtains the lock first it will block out the management flow and complete the current call to the dying service. In this case we count on the service exporter to first unregister the service and only than release it's resources. If this sequence is followed the service will be fully usable during the last call the business flow makes before it is blocked out by the management flow. Notice that from the point of view of the importer service dynamics are all about crashing safely .

What if service events were delivered asynchronously? Well than the management flow would place an event on some queue and destroy the service without waiting for the clients to release it. Until the importers are notified by some event delivery thread they would be able to call the service while it is being destroyed. To prevent this from happening the exporter would have the additional responsibility to mark the service object as invalid so it can reject clients by tossing exceptions at them. Now we have code to check service validity and throw exceptions in both the exporter and the importer. Also this would likely require all methods of the service object to be synchronized by a common lock. Such a lock would be a coarse granularity lock because it is accessed by all importing code. As such it distorts the concurrency design of the application more than the multiple finer granularity locks used by the individual importers.

Even under the current synchronous event dispatch it is sometimes useful to place invalidation code in your services. This adds additional safety against badly coded importers. For example if a set of bundles form an independent reusable component you can place additional safety code in the services intended for external use while keeping the services used between the constituent bundles simple.

The solution

So far I went to great pains to describe..err..the pain of service dynamics. Now that you are hurting let us discuss the remedy. For now I have exhausted the subject of the correct importer behavior. To recap: the importer must track the service and guarantee atomic service object swaps. No matter what other policies we invent we simply must follow this rule to be safe. Now let's add to this a service export policy. The sum of an import and an export policy should form a comprehensive doctrine about dealing with service dynamics. I will explore two export policies with their corresponding doctrines.

Eager

This school of though shoots for safe service calls. It's motto is "To export a service is to announce it is ready for use". Consider what this means for services that are composed of imported objects. These objects are called "required services". A service can also be "optional - e.g. logging. Under the eager motto when a required service goes down the export is no longer usable. So it must also be withdrawn from the OSGi service registry. This goes the other way too - when the required service comes back online the composite service must be registered once more. This results in cascades of service registrations and unregistrations as chains of dependent services come together and fall apart. Implementing this dynamic behavior varies from hard to exceptionally hard. The problem is that the imports and the exports have to come together into common tracking objects with the proper synchronization. Also quite often this dynamic dependency management is further compounded by the need to track events from non-service sources (for example we track a dynamic configuration to waiting for it to become valid).

Let us suppose we manage to write all of the boilerplate for each of our bundles. Now imagine how a thread races through the OSGi container when it executes the business control flow (e.g. useful work). It will start it's path from some active bundle that drives the particular application. As soon as it calls a service object it will leave it's "home" bundle and enter the bundle that exports the service. If that service is in turn implemented via other services the thread will hop into the bundle of each one and so on. For the original service call to succeed each and every hop must also succeed. Turns out we are trying to achieve a kind of transactional behavior - a call to a chain (or more generally a tree) of services either fully succeeds or can not be made in the first place because the root service is not registered. Under such strong guarantees the active bundle knows ahead of time (or eagerly) that a certain activity can't be performed and can take alternative actions. E.g. rather than react to an error it directly performs the respective error handling. I suppose by writing the complicated import-export boilerplate we avoid writing some exception handling code and don't need to worry about cleanup after a partially successfully service call.

Unfortunately this idea of safe service dynamics is completely utopian. The eager model simply can't work and this is the main point I want to hammer in hard with this blog. Imagine some management control flow kicks in and stops a bundle two hops removed from the current position of the business control flow. Since the business flow has not yet entered the stopped bundle it will not be able to block the management flow from taking down it's services. As a result our thread will run into a wall. Obviously no amount of "local" synchronization by each individual bundle along the path will guarantee the integrity of the entire path. What is needed is a third party - a transaction manager of sorts, to lock the entire path before the business flow starts traversing it. Since such a manager does not currently exist we can conclude that service flickering can't prevent errors caused by disappearing services.

This brings on the question if there is some other benefit to justify the complexity caused by service flickering. We could argue that although we can't guarantee that a service call will succeed at least service flickering can tell us the precise moment after which a service call is guaranteed to fail. This allows us to perform various auxiliary reactions right after a required service goes down. For example if a bundle draws buttons in your IDE and a direct or transitive dependency goes away it can pop a dialog or hide the buttons from the toolbar. Without the cascading destruction of the service chain the buttons will be right there on the toolbar and the user will get exceptions every time he clicks. I say this return does not even approach our huge boilerplate investment. Remember that this only works if every bundle along the dependency chain behaves eagerly - we have a lot of boilerplate to write. This becomes even more ridiculous if you consider the additional complications. Why should we blow the horn loudly during a routine bundle update that lasts 2 seconds? Maybe we should just "flicker" the buttons on the toolbar and postpone the dialog until the failure persist for more than 10 seconds. Should this period be configurable? Also who should react - only active bundles or every bundle along the service chain? Since we don't want to get flooded by dialogs (or other reactions in general) we must introduce some application-wide policy (read "crosscutting concern"). In short we have payed a lot to get back a dubious benefit and as a side effect have introduced a brand new crosscutting concern in our otherwise modular application.

Lazy

This approach defines the service export as "To export a service is to declare an entry point into the bundle". Since the export is merely a declaration it does not require any dynamic flickering. We simply accept that calling a service can result in exception because of a missing direct or transitive service dependency. I call this model "lazy" because here we do not learn about a missing service unless we try to call it. If the service is not there we simply deal with the error. The complete dynamics doctrine than becomes:

  • Explicit registration is used only during bundle startup.
    Generally a bundle should follow this sequence in BundleActivator.start():
    1. Organize the import tracking (as described below).
    2. Build the bundle internal structure. Store it's roots in non-final fields in the activator.
    3. If this is an active bundle start it's threads.
    4. If there are objects to export register them now and store their ServiceRegistrations in non-final fields in the activator.
    Upon completion of this sequence the bundle is started and hooked to the service registry. It's internal structure is spared from garbage collection because it is referenced from within the activator and the activator in turn is referenced from the OSGi container. Now the management control flow can leave the activator and go about it's business. If the bundle has started some threads to execute the business flow they can continue doing their work after the activator is no longer being executed.
  • Importers fail fast
    Every imported service must be tracked, all code that uses the service must be synchronized with the code that swaps the service object in and out. When an attempt is made to call a missing service a RuntimeException is thrown. This exception is typically called ServiceUnavailableException (or SUE).
  • Service errors are handled like regular RuntimeExceptions (faults)
    Upon a SUE you do the same stuff you should do with most exceptions: propagate it to a top-level catch block (fault barrier), do cleanup as the stack unrolls or from the catch block, and finally complete the crashed activity in some other rational way. In detail:
    1. If the service is optional just catch, log and proceed.
      If the service is not critical for the job at hand there's no need to crash. The SUE must be caught on the spot (e.g. we convert a fault to a contingency) and logged. Whether a service is optional depends on the concrete application. We can even imagine partially optional services where only some of the methods calls are wrapped in a try/catch for the SUE while others lead to more comprehensive crashes.
    2. If the service is required and you are a passive bundle clean up your own resources and let the exception propagate.
      Passive bundles don't drive business logic and therefore don't own the activities that call into them. As such they have no right to decide how these activities are completed and should let the exception propagate to the owner active bundle. They still must clean up any internal resources associated with the service call in a try/finally block. Because good coding style requires such cleanup to be implemented anyway it turns out that for passive bundles lazy service dynamics coast nothing.
    3. If the service is required and you are an active bundle declare the current activity as crashed, log, clean up, try contingency actions.
      If you are the bundle who drives the crashed activity it's your responsibility to complete it one way or another. Good design requires that you wrap an exception barrier around the business logic code to absorb crashes. If there is need of resource cleanup you do it as usual. Than you do whatever the application logic dictates: display an error dialog to the user, send an error response to the client, etc.
  • Explicit service unregistration is used only during bundle shutdown
    All bundles should execute the following sequence in BundleActivator.stop():
    1. Bring down all exported services with calls to the respective ServiceRegistration.unregister()
      In this way we make sure no business control flow will call a service and wreak havoc with our shutdown sequence. Also we don't cause trouble to our importers by exposing partially destroyed services.
    2. If you are an active bundle perform graceful shutdown to any threads you drive.
    3. Clean up any non-heap resources you own.
      Close server sockets, files, release UI widgets etc.
    4. Release your heap.
      This is done by explicitly nulling any fields contained in your BundleActivator. After stop() completes the bundle should only consume memory for it's the BundleActivator instance and it's ClassLoader. These are both managed directly by the OSGi runtime. The bundle will mushroom again into a runtime structure on the heap if some management control flow reaches through the OSGi core to call once more BundleActivator.start() ( e.g. a user clicks on "start bundle" in his JMX console).

The beauty of the lazy doctrine is that we manage to almost completely fold the hard problem of service dynamics into the much easier problem of dealing with exceptions properly. Turns out dynamics are not so horrible, they mostly force us to have a consistent error handling and cleanup policy - something any Java app worth it's salt should have anyway.

There is a substantial wrinkle in this smooth picture - the service import code is still hard to write and is quite disruptive to the business logic code. You have to sprinkle synchronizations all over the place to prevent the management control flow and the business control flow from competing for service object references. This issue is addressed by...

Service proxies

There is an infinite number of ways to achieve correct lazy importing behavior. In practice however mostly variants of the pattern I am about to present lend themselves to the limited understanding of the human brain. This pattern is so compelling that very early in OSGi history a utility called ServiceTracker was introduced to capture it. I have used and coded this enough times (sick if it really) I was able to emit these ~100 lines practically in one go and there is a good chance you can paste it in your IDE, and go import some services:

class ServiceHolder<S> implements ServiceListener {
  private final BundleContext bc;
  private final Class<S> type;
  private ServiceReference ref;
  private S service;

  /**
   * Called from BundleActivator.start().
   *
   * (management control flow)
   */
  public ServiceHolder(Class<S> type, BundleContext bc) {
    this.type = type;
    this.bc = bc;
  }

  /**
   * Called by the app when it needs the service. The rest of the code in this
   * class supports this method.
   *
   * (application control flow)
   */
  public synchronized S get() {
    /* Fail fast if the service ain't here */
    if (service == null) {
      throw new RuntimeException("Service " + type + " is not available");
    }
    return service;
  }

  /**
   * Called from BundleActivator.start().
   *
   * (management control flow)
   */
  public synchronized void open() {
    /*
     * First hook our synchronized listener to the service registry. Now we
     * are able to block other management control flows in case they try to
     * change the service status while we initialize.
     */
    try {
      bc.addServiceListener(this, "(" + Constants.OBJECTCLASS + "=" + type.getName() + ")");
    } catch (InvalidSyntaxException e) {
      throw new RuntimeException("Unexpected", e);
    }

    init(bc.getServiceReference(type.getName()));
  }

  /**
   * Called from BundleActivator.stop().
   *
   * (management control flow)
   */
  public synchronized void close() {
    /* Unhook us so the cleanup is not messed up by service events. */
    bc.removeServiceListener(this);

    if (ref != null) {
      bc.ungetService(ref);
    }
  }

  /**
   * Called by the container when services of type S come and go.
   *
   * (management control flow)
   */
  public synchronized void serviceChanged(ServiceEvent e) {
    ServiceReference ref = e.getServiceReference();

    switch (e.getType()) {
    case ServiceEvent.REGISTERED:
      /* Do we need a service? */
      if (service == null) {
        init(ref);
      }
      break;

    case ServiceEvent.UNREGISTERING:
      /* Is this the service we hold? */
      if (this.ref == ref) {
        service = null;
        /* Switch to an alternative if possible */
        init(bc.getServiceReference(type.getName()));
      }
      break;
    }
  }

  @SuppressWarnings("unchecked")
  private void init(ServiceReference ref) {
    if (ref != null) {
      this.ref = ref;
      this.service = (S) bc.getService(ref);
    }
  }
}

There! Now this looks like a real programmer article. Let's imagine we want to import the following wicked cool service.

interface Hello {
  void greet(String who);
}

In the BundleActivator.start() we must set up a ServiceHolder.

private ServiceHolder<Hello> helloHolder;

void start(BundleContext bc) {
  helloHolder = new ServiceHolder<Hello>(Hello.class, bc);
  helloHolder.open();
}

We than propagate the holder inside our bundle to all the places where the service is needed. At each site where in a non-dynamic app we would call the service we instead place the following code:

synchronized (helloHolder) {
  helloHolder.get().greet("Todor");
}

The synchronized wrapper is required even with this one-liner because it is the only way to make sure the service object won't become invalid right after the get() call returns and just before the greet() call begins. Needless to say this is painful and ugly. But it's the only way to be correct. Or is it?

If you squint you will see that we have actually coded the guts of a thread-safe proxy. Let's complete the proxy by wrapping our holder in the original service interface:

class HelloProxy implements Hello {
  private final ServiceHolder<Hello> delegate;

  public HelloProxy(ServiceHolder<Hello> delegate) {
    this.delegate = deleagte;
  }

  public void greet(String who) {
    synchronized (delegate) {
      delegate.get().hello(who);
    }
  }
}

Now we can create the HelloProxy in the activator and use it everywhere through Hello-typed references as if it was the original service. Except now we can store the "service" in final fields and pass it to constructors. Combine this with the rest of the Lazy doctrine and we get a clean separation between the dynamics handling boilerplate (locked in proxies an the activator) and the business logic code. Also the business code now looks just like a regular non-dynamic Java program. Cool! Except coding such proxies can get very tedious in the real world where we use many services with a lot more than one method. Fortunately such proxy generation is quite easy to code as a library or even better an active Service Layer Runtime bundle.

Before I explain how to sort out this last issue I must make an important observation: the eager and lazy models are not mutually exclusive. As the code above illustrates in the core of every lazy bundle runs tracking and reaction code similar to the code that would support an eager bundle. The lazy bundle wraps this tracking core with a stable layer of proxies that shield the application code (and it's control flow) from all the movement happening below. Still if you really need it usually you can plug code into the lower layer and have a hybrid eager(pre-proxy)/lazy(post-proxy) bundle. For example the eager part can do decorations or even complete transformations to the services before they are wrapped in proxies and passed to the lazy part. So if we exclude the dynamic service flashing the lazy model is really a natural evolution of the eager model to a higher level of abstraction.

Service Layer Runtimes

Since OSGi 4.0 it became possible to implement a special type of bundle that can drive the service interactions of other bundles. I call these Service Layer Runtimes (or SLR) because they hide the raw OSGi service layer from the bundles they manage. Although SLRs come in all shapes and sizes they inevitably include a dependency injection component. This is because DI is a natural match for services, which typically enter the bundle from a single point like the activator and need to be propagated all over the bundle internals. Doing this manually is tedious (for example it might require you to build chains of setters called "fire brigades", or worse - use statics). Delegating this task to DI is a huge relief.

Peaberry

I will start with my personal favorite. It is pure Java and is developed as an extension to the sexy Guice framework, which means it is lightweight, powerful and XML free. Peaberry steers the user towards the lazy model and in fact I came up with the idea when thinking how to use the framework most effectively. Using it feels largely like using pure Guice. All you need to do to get a service proxy is to bind the interface of the service to a special provider implemented by Peaberry:

bind(Hello.class).toProvider(service(Hello.class).single());

The proxies are than generated on the fly using ASM. From there normal Guice takes over and injects them as it would any other object. Code written in this way looks a lot like a plain old Java SE, with dynamic proxies practically indistinguishable from local objects. Peaberry has many more features including ways to filter services, import all services of a given type as an Iterable, hook code to the dynamic tracking layer below, do decorations to the services before they are wrapped in proxies. Finally Peaberry is service registry agnostic and allows you to seamlessly mix services from different sources - for example objects from the Eclipse registry can be mixed transparently with OSGi services.

Alas Peaberry is not yet perfect. One area where it lags behind the other SLRs is dynamic configuration. The user has to use the ConfigurationAdmin service to change a configuration or to expose classes as ManagedService to receive dynamic configuration. The other area is the lack of an extender - the user still has to code a minimalistic BundleActivator to set up the Guice Injector. The good new is that Peaberry is currently under active development and these gaps are sure to be plugged soon.

Spring Dynamic Modules

A descent choice, which also supports the lazy model. Again the service proxies are generated transparently for the user. Spring DM relies on the Spring component model to do dependency injection. Although it seems to provide more features than Peaberry it feels much more heavyweight to use.

Declarative Services

This is the only SLR standardized by OSGi. It tries to solve the dynamics problem with traditional Java means. It is high level in that it has a component model. It is low level in that it exposes the components to more of the service dynamic (no proxies). A dependency can either be defined as "dynamic" or "static".

For dynamic dependencies a component must provide a pair of bind()/unbind() callbacks for each service dependency. OSGi DS will do the tracking and call the respective callback. The component than takes over to performs all the swapping and synchronization on it's own. In this case OSGi DS saves the developer only the tracking code.

By default a dependency is "static" and the component only provides only needs to provide bind() methods. Now the component does not have to worry about synchronization or release of the service object. Instead OSGi DS will make re-create the entire component whenever the dependency changes. Alas this is the only way to make sure the old service object is released.

OSGi DS follows the eager service export policy: if a component is exposed as a service and some of it's required dependencies go away the component is unregistered. The consequence is also the cascading deactivation of dependent components. As we saw this lifecycle scheme can not prevent exceptions from failing transitive dependencies. The user code must still have proper error handling in place.

OSGi DS also supports the non Dependency Injection "lookup" style where your components receive a ComponentContext from which to pull out services.

iPojo

Architecturally iPojo seems to be "OSGi DS but done right". The OSGi DM heritage is all over the place. Here as well we deploy components as bundles with each component having it's dependencies managed by a central bundle. As with OSGi DS the components can be exposed as services to the OSGi registry. However apart from this iPojo departs from OSGi DS in very important ways. Most importantly it is highly modular allowing a component to specify a different pluggable handler for each dependency. For this reason it is hard to place iPojo firmly in either the Eager or the Lazy buckets as there are handlers that implement the service proxying behavior. Also the cascading deactivation of components is configurable. The handler magic is added via bytecode weaving, which if you can get over the extra build step pays off when deploying on resource constrained devices.

iPojo integrates the OSGi ConfigAdmin beautifully by establishing 1 to 1 relationship between a component and it's configuration. If you create a new configuration copy this leads to the creation of a new component and vice versa.

All in all iPojo is an interesting proposition developed solely with the OSGi environment in mind. It is definitely worth spending the time to explore. I would recommend iPojo over Spring DM as it feels simpler, cleaner and is more performant.

Update

Since this came out I have been working on refining it. There have been some important revelations worth checking out.