Guest
Welcome login
Share/Save
Share/Save
TIBCOmmunity > Blogs > ActiveSpaces > Tags
Home   Members Communities

ActiveSpaces

1 Posts tagged with the active-active tag
3

We have been very busy working over this summer on ActiveSpaces, so busy in fact that a new post is way overdue on this blog. Fortunately for you readers, being stuck in some random coach seat for well over 12 hours without net access (but with a spare battery ) means that I finally got around to write down some of my thoughts on Elasticity, fault-tolerance and maximizing infrastructure. Unfortunately there's lots to write about those subjects and I had more than one of those long flights to go through, so my apologies if this post is a bit lengthy.

 

Premise:

There is currently a lot of talk about 'cloud computing' and/or 'elastic computing', but what defines exactly elastic computing, and what does it do for you, why would you need it? And how does it all relate to ActiveSpaces?

 

Let me answer the most practical question first: why do you need your computing infrastructure to be elastic? The most important answer is very simple: to save money by optimizing your resource utilization, because 'idling' or 'stand-by' hardware and software (licenses) cost your enterprise money without being productive. Other reasons include the ability to maintain your SLAs in the face of extremely variable load conditions and of hardware and software faults.

 

One of the design goals of AS is to provide an infrastructure to be used for the creation and operation of 'Elastic' applications and services. And by 'elastic' I mean more than just 'distributed': and yes, there IS a distinction, which I will now try to explain. In order to understand the difference between those two terms let's first define them:

 

Distribution is the ability to divide a task into smaller sub-tasks that can be executed in parallel. It is what allows a distributed application or service to 'scale out': write the code once, and run it on as many machines as needed in order to achieve a target Service Level Agreement given a specific request load.

 

Elasticity, as it's name implies, when applied to an application or a service, describes the ability to to easily and seamlessly 'scale up' (add more resources in order to increase the overall processing power when needed) and 'scale down' (being able to 'switch off' some of those resources when the load on the overall system goes down and the system is over-provisioned). Obviously in order to be elastic, an application or service has to be distributed, but the reverse is not necessarily true!

 

The distinction between these two terms hinges on the keywords 'easily' and 'seamlessly': a distributed application or service is not elastic if it has to be brought down and reconfigured in order to take advantage of a new host or instance of itself, or in order to decrease the number of instances.

 

To take a practical example, compute clouds are elastic: they let you to add and remove machines, their associated OS and file-system images by simply invoking a web service. This means that clouds provide the elasticity of the hardware and of the platform, and can even fire up another instance of your application. However if your existing (or remaining) application or service instances have to be notified or configured in any way to be told to make use of (or stop using) this new instance, you do not have true elasticity of the software and therefore of your service.

 

So, why do you need to make your applications and services elastic? There are plenty of reasons, but as stated earlier for me the most important one is the following: the load generated by your business is almost never constant, it varies over time, and many studies have shown that in the increasingly digitized world we live in there is an ever increasing gap between your average load and you peak load. There are also plenty of material documenting the increasing impact a degradation of the performance of the service a business is providing has on the bottom line: customers are increasingly volatile, and the competition never more than a click away. Not meeting your SLAs to your customer means in some cases means they are entitled to a discount and will impact your revenue, and in some industries not meeting SLAs can even result in fines being imposed. There is even measurements by Google and Amazon suggesting that increases measured in milliseconds in the serving of a web page resulted in lower number of page views.

 

The traditional approach to this problem has been to size and provision the infrastructure for the peaks not for the average load: this is obviously an expensive proposition as it means that some (in some cases, most) of the infrastructure will most of the time just sit there idling, waisting electricity and costing money when the load is not at its peak.

 

The development of virtualization technologies and now of the cloud computing (and in the future the use of technologies like Complex Event Processing to anticipate and pro-actively prepare for the changes in the load) paradigm have been becoming increasingly popular because they allow the enterprises than know how to leverage them to save a lot of money by easily sizing up or down ('elasticizing') your hardware infrastructure as you need it optimize. However hardware and os virtualization only provide part of the overall solution: a service is composed of more than just the platform is runs on, the software that implements the service also need to be elastic in order to 'close the loop' and automate the re-sizing of the complete stack and to be able to react to expected or unexpected variations of the load.

 

ActiveSpace Elasticity:

As mentioned at the start of this post, ActiveSpaces is an in-memory data and messaging grid that allows application programmers to create elastic applications and services extremely easily. This is due not only to ActiveSpace's API and to the fact that the Space-based architecture is a natural match to distributed systems, but also to the fact that ActiveSpaces itself is elastic: nodes/applications/machines contributing storage and computing resources can be added and removed from the Metaspace (the cluster of nodes working together) on the fly, without requiring any change to a configuration file (as there are exactly zero configuration files!) or without needing to run a specific deployment manager application and without any service interruption.

 

Start with a single node, scale up by simply starting another instance of your application on any machine on the network, and back down according to your load and SLAs by simply stopping one of those instances, there is no need to do anything other than start and stop processes or new instances of your application on any node on the network.

 

How is this possible? Because ActiveSpaces is built using a true peer-to-peer distributed architecture. In order to really understand what this means allow me to explain a bit more how it works and how it compares to other kind of architectures typically used by other data-grids or distributed caches.

 

First let's define the problem that needs to be solved: since a Space is used to store data, in order to scale a Space over multiple machines a distribution mechanism has to be used to distribute the storage and management (what we call the 'seeding') of basic data elements (in our case, tuples (i.e. 'database rows')) as evenly as possible over the set of cluster members (that we call 'seeders'), this is also sometimes referred to as 'partitioning'. At the same time in order to be enterprise production quality the distribution mechanism has to be able to provide resilience to node failure (i.e. 'fault-tolerance' or the ability to survive the sudden loss of a machine without incurring the loss of any data stored in the Space).

 

Distribution:

ActiveSpace uses a proprietary group membership protocol and peer-to-peer 'distribution algorithm' that allows each member of a Space to quickly (and independently) find out which member of the Space (i.e. which 'seeder') is responsible for the storing of a particular tuple and to send the request directly to the appropriate seeder.

 

Compare this to more traditional techniques where the mapping of specific data item to specific cluster member nodes (the 'partitioning' of the data) is handled by a centralized server (i.e. a single member node of the cluster that takes over the role mapping lookup server role), requests to read or write data have to include either a 'lookup' operation with the server (and therefore a network round trip), or be redirected by that server node (which will sooner or later become a bottleneck to the performance and scalability of the overall system).

Another disadvantage of the techniques and algorithms used by other data-grid systems is that they typically require configuration files to be created listing the addresses of all the nodes potentially participating in the cluster, or the potential maximum and minimum number of partitions to be used. Those files not only have to be distributed to all the nodes, but they have to be kept in sync for the cluster to work properly (as an aside, I find it slightly ironic that some of the existing data-grid products that thrive to provide coherence of the data they store rely on the system administrator to provide coherency of their configuration files in order to work well). Another down-side being that a lot of those systems require you to bring the whole system down, update and re-distribute some configuration files before some changes (sometimes as simple as adding another node, or adjusting the number of partitions) can take effect.

 

Fault-tolerance:

Even more important than the ability to deal with peaks in the load without performance degradation, is the need to make the services fault-tolerant. A problem that has traditionally been solved by making the infrastructure redundant and deploying (and buying) everything twice in order to have an idling backup ready to take over in case the primary fails, what is called an 'active-passive' fault-tolerance architecture.

 

ActiveSpace Tuplespaces can be made tolerant to the sudden catastrophic failure of one of the nodes (seeders) in the cluster by specifying a 'degree of replication' for that Space: a degree of replication of one means that any single node can fail at any time without an data being lost, a degree of replication of two means that up to two nodes can fail at exactly the same time, and so on. ActiveSpaces uses the same peer-to-peer 'distribution algorithm' in order to provide fault-tolerance of the service it offers: the algorithm is not only able to determine which cluster member (seeder) is responsible for the original storage of the tuple, but at the same time it determines which node is responsible for the first degree of replication, which node is responsible for the second degree, etc…

 

Besides speed and scalability this algorithm has a couple of very important advantages:

- 'active-active' fault-tolerance: there are no standby 'cold', or even 'warm' nodes that do nothing but wait for a 'primary node' to fail.

- The replication itself is distributed: what is stored (seeded) by one node is uniformly and evenly replicated over all of the other nodes in the cluster. This not only makes degree of replication higher than one possible, it also means that after a sudden catastrophic node failure, the distribution of the tuples remains balanced.

 

Compare this to more traditional techniques where nodes responsible for the storage of data are deployed in fault-tolerant pairs, one node being the 'primary' node doing all the work for the partition of the data that has been assigned to it, and the other node being a standby 'backup' node that does nothing but use electricity and rack space until the primary node fails and in the worst case requiring the deployment of a usually very specific type of shared file system between the two nodes (worst-case because besides the extra complexity and price of a shared file-system the secondary node needs to read all of the data from that shared file system to come up providing for less than optimal fail-over time). Another disadvantage of the primary/secondary fault-tolerance architecture is that it can not provide a degree of replication higher than two: if both primary and secondary servers fail, not only is data being lost, but the given that some of those systems can not deal with all of the nodes assigned to a particular partition being down at the same time, the whole system will actually go down.

 

Conclusion:

ActiveSpaces provides true elasticity to your distributed applications and services. It is able to do this because it is thoroughly elastic itself: its true peer-to-peer architecture allows it to scale up and down on the fly without requiring any reconfiguration or user intervention other than starting another instance of an ActiveSpace-enabled process on the network. This peer-to-peer architecture and algorithms also means that fault-tolerance is 'active-active' and that none of the nodes participating in the nodes are idling unused as 'passive' backup servers, thereby optimizing the usage of your existing infrastructure, and saving money.

3 Comments Permalink