BPMS in production environment

I couldn’t find a better topic than “BPMS (Activos 6.1) in production” to conclude the whole series of a designing system with BPMS, sharing hints for developers and testing whole solution.
Although  ActiveVOS is certainly a cool product there is, as usual, a space for future improvements. The production environment is something special and as something special, it should be treated. If the production environment is down there is simply no business. Empowering the business is the main objective of BPMS, isn’t it? So technology should be ready to cope with that kind of situations. To cut a long story short. Every feature which supports maintainability, reliability, security and sustainability in day-to-day life is highly appreciated.
During the development life-cycle, it can be hard (especially in the early stage of the development) to foresee how the system will be maintained, what the standard procedures look like, etc. The goal is to mitigate the probability of a process, human or a technical error as low as possible taking into consideration an ease of problem detection as well.
The following pieces of functionality were found as highly desirable. Some of them are possible to avoid or at least lower the impact during the design time. For the rest of them, some developers’ effort need to be taken into consideration.
  • Different modes of Console – there are no distinct modes neither for development nor production environment. This comes in handy when you need to grant an access to operations for their day-to-day routine and you don’t wanna let them modify all server settings. For example, you just wanna restrict the permission to deploy new processes, start and stop services.
  • Reliable fall over – maybe this question is more on the side of infrastructure. As BPMS fully lives in a DB typical solution consists of cloning a production DB to a backup DB instance. In case of a failure, this instance is started.  If some kind of inconsistency gets into the DB during the crash of the main instance then it is immediately replicated to a backup instance. Does it make sense to start a backup instance?
  • Lack of data archive procedures – the solution itself doesn’t offer any procedure how to archive completed processes. Because of legal restrictions specific to the business domain, you are working in you cannot simply delete completed processes.  As your DB grow in size the response time of BPMS grows as well. You can easily get into trouble with time-out policy. Data growth 200GB per month is feasible. You cannot simply work this problem out by using some advanced features of the underlying DB like partitioning because you wanna have processes which logically belongs together in one archive. You will be struggling to find out such partitioning criteria which could be used in practice and fulfils mentioned requirement.
  • Process upgrade – one of the killer features,  process migration of already running processes to an upgraded version works only in case of small changes of the process. Moreover what if your process consumes an external WS which lives completely on its own? What if someone enhances that service and modify that interface? Yea, versioning of the interfaces comes to attention. Having process upgrade feature without versioned interfaces is almost nonsense or at least need a special attention while releasing. Even with versioned interfaces, it is not applicable in all situations, eg. sending new data field which presence in the system is not guaranteed.  In large companies, this feature is a must. Otherwise, it is hard to manage and coordinate all the releases of all connected application.
  • Consider product roadmap – actually, this item belongs to project planning phase where we make decisions about what technology to use. In some environment like banking, insurance etc. there can be legal requirements to have all products from production environment supported by a vendor. If the vendor’s release strategy is a new major version every half a year and support scope is current major version plus two major back then this could pose a problem for a product maintenance team during the product lifecycle. Migration of all non terminated processes may not be a trivial thing and as such this represents an extra cost.

BPMS Lesson Learned – Developer’s experience

In the design phase, we focused overall architecture of solution using BPMS, in this part, we will focus on developer’s interaction with BPMS system, design tools, etc. It describes subjective experiences with ActiveVOS designer version 6.1. At the time of writing this article, ActiveVOS v. 9  was announced. I believe that a lot of stuff described in this article were enhanced and the product has made a great step ahead. Here are some key points which were recognized as a limitation.
  • Mitigate your expectation regarding BPMS designer IDE. Designer is Eclipse-based IDE so for those who are familiar with Eclipse shouldn’t be a problem to start work with. Just expect that not all features are fully elaborated like code completion is completely missing. Xpath, XQuery error highlighting, code formatting (even tab ident is completely missing). Message transformation can be really painful from that point of view. It is good that these problems were at least partially addressed in future releases.
  • Team collaboration is a bit difficult. Not because of missing integration to a version control systems like SVN, CVS, etc. But just simply because of generated artefacts like deployment descriptors (pdd files), ant scripts they all contain absolute paths to files. What simply doesn’t work on different PC. Fortunately, this problem is easy to avoid by replacing this absolute path by relative ones.
  • Be prepared that some product feature is not reliable or doesn’t work. As nobody is perfect even ActiveVOS BPMS is not an exception. Just name those which we have to cope with:
    • eventing – On Weblogic application server running in the cluster this feature was not reliable.
    • instanceof – Xml processor used by ActiveVOS  ( Saxon library )  doesn’t support keyword instanceof used for element identification in the inheritance hierarchy.
    • time-out on asynchronous callback “receive” were no reliable – Once it time-out after 5 minutes (required 3 min), next time 1 hour, …

BPMS Lesson Learned – Design

This blog contribution tries to summarize experiences gained when designing BPMS solution using ActiveVOS v. 6.1. and highlight key design decesions. It describes the consequences of these decisions in the context of a bigger picture (impact on production, maintainability, day-to-day routines, …).  As the overall architecture is a set of design decisions which depends on a business context and objectives you are trying to achieve there is no simple copy-paste solution applicable in every situation. Let’s make long story short and have a look at those key areas:
  • BPMS is not a web flow framework! It can be tempting to use a WS-HT interface for each interaction with a client/user. Beside the fact that you have to initiate somehow the interaction with the client ( create a human task) you are not aware of, it has a negative impact on DB space consumption. Every human task has several underlying processes consuming space  depending on a log level and persistence settings but moreover this doesn’t hold any business relevant info. In newer versions this is treated in a better way.
  • Think of your error handling strategy. Used implementation offers feature “suspend on fail” so in case of an error the process is put in suspended state and waiting for a manual interaction. The operator can replay the process. Be very careful with this feature. Overuse of this pattern leads to high database space consumption as a precondition of this feature is a full process logging and persistence enabled. More over what the operator can do with the failing process? In a “properly tested” system  this can happen due to the two main reasons: either data missing or technical reason like service is unavailable. In the first case do you really think that the operator will know your client’s  insurance number somewhere in the system? Certainly no. What about auditing in such a system, that’s another really important question. Fail over to human task seems like a reasonable solution. In the second case, implementation offers a feature called “retry policy”. Taking advantage of this feature you can achieve short-time outage immunity of the system. 
  • Structure your processes to smaller blocks. Although I can understand why people tend to design long processes, experiences prove the opposite. With one long process which realizes the whole business you’ll gain  readability but you’ll lose re-usability, maintainability  and more importantly scalability of the system. Not all the processes and sub-processes have the same level of importance at least from the technical perspective and realized business as well. You can use different levels of process/sub-process logging, persistence and polices such as retry policy. All this has a positive impact on lowering the database  consumption ratio, improving stability and robustness of the system. One important thing which we shouldn’t leave aside is spatial scalability. Every sub-process is just another web service so in case we need to improve throughput of the system we are free to setup a new instance and deploy those processes there. We are absolutely free from the point of infrastructure like load balancing, clustering, …  The only thing we need to keep in mind is that created human tasks are running within “BPEL engine” (no different component). The current version of implementation wasn’t able to read human tasks from more than one “BPEL engine” .
  • Sort your data. In every business domain there are some legal requirements such as auditing and archiving those information for a number of years. Naive approach I’ve met can be: “BPMS holds all the info”. That certainly does. But how the data are structured? Are all the data stored in BPMS relevant to business hence worth of archiving? BPMS is definitively keeping a lot of system information not relevant to business in his own internal structure. E.g. incoming messages as xml objects in variables. That means to get a specific info you would need to find that variable out and within xml object locate that specific piece of information. Moreover this information doesn’t have a time scope guaranteed. If the process doesn’t have full logging enabled then just the latest status is kept and no track of changes. The best approach is to solve this auditing requirements in advance. Store all your audit info into a separate DB schema in a well-defined readable structure. You save a significant amount of consumed DB space and you don’t have to create a “data mining” solution from the BPMS schema. No need to talk about expenses for additional disk capacity.
  • Reason over every feature used in the human task. Human tasks offer a lot of features which support interaction with users e.g. email, task escalation, task notification, etc. Especially pay attention to the task notification feature. Internal implementation is equal to a human task which simply doesn’t show up in the inbox as a new task but only as a notification. It was measured that one human task consumes roughly 1MB. Overuse of this feature can have a big impact on disk space consumed by the DB. The most dangerous thing about notifications is that underlaying system processes are in the running state until the user confirm delivery. Hence it is difficult to delete them from the console. Also associate the notification to the human task feature is missing so there is no other way to cope with the problem then a manual cancellation of the internal system process. This process can be really time-consuming when your system generates three thousand messages over the night.
  • Use “proxy back-end call pattern”. Every back-end system has his own domain model and message structure. It is really tedious and time-consuming to build messages on the BPMS side where the only tools available are Xpath, XQuery, etc. Proven by experience the better and more efficient approach is calling a “proxy method” which is responsible for building the message up, sending the message out and processing the result passed back to BPMS.