CopyPastor

This is not a good question for Stackoverflow, because any answer will be a opinion based and many approaches may work and depend on more details in your specific use case. So don't be disappointed if the question get's closed at some point.
That being said I am not too shy to offer my opinion or at least some thoughts about your described situation.
1. I believe your question about how to set up teams and how to set up the architecture are very tightly linked because architecture will with no doubt effectively follow the organizational structure. So I will give further thoughts about the organizational setup first. 2. An estimation of the total manpower required for each of the businesses should give you an idea of how many teams you need. Trying to keep team size small (say 2-8 people) will help reducing the communication overhead. So if you think this is the size for a whole business then there is no need to further split responsibility. 3. Responsibility is the most important keyword. You have to avoid any situation where a common service/library is used but has multiple or no owners. There should always be exactly one organizational owner. Thus, when organizations recognize the overlap of functionality in separate areas it is a common practice to establish a team that will be responsible and provide this functionality to others. This could be in the form of shared libraries or actually deployed services. In both cases it is important that the communication is formalized by correctly versioning their work and leaving it to the consuming groups which versions to use, when to upgrade, putting in requests for new features, etc. This approach will decouple the teams that use this common functionality. 4. In your problem description the core of the problem is the business logic and it's complexity / overlap. So I would argue that the most important role is the product management. They have to be very good (and at least a bit technical) and sort this exact mess into reusable pieces and things that are specific to only a single business. If you have a whole team of product managers they need to communicate very well and build this picture together. What is most important here is a good communication about the *vision* for the future and not just immediate requirements. Only then can the architecture and teams be set up in the best possible way. 5. No matter how careful the initial setup - Changes WILL happen. Whatever you think in the beginning to be the best solution will change at some point in the future. In order to prepare for this I always recommend to go with the simplest approaches - Even if it means some code duplication or other imperfections. As software architects we tend to love the beauty of perfection, but that is rarely the most effective approach in the real world. 6. It is common sense to make a simple shared service/library that can be made to fit multiple use cases by adding some configurability. Up to certain degree of complexity that is a useful approach, but you have to be sensible to the consumers of that library/service and it should be easy to reuse at any point. It is not black and white about when to a functionality becomes too big / complex and has to be split into multiple pieces to be maintainable, but looking at it with the eyes of the maintainer and the eyes of consumer will make a determination easier. In the case of configurable service libraries you could also have separate deployments with different configurations, so using commonly developed components, but deploying different endpoints for each use case. If you use technologies that produce only a small deployment overhead (for example golang containers that are only a few mbs), then the large number of deployed services is not a drawback but a strength because they can be upgraded / versioned independently and it is even easy to run multiple versions in parallel. 7. Infrastructure and service deployment may or may not follow the architecture of the services. As a general rule I would recommend to look for the simplest approach, which often means providing common infrastructure that is shared among services and deployment configurations are where the distinction between services starts. For example a common cluster / streaming / gateways / databases. Exceptions to that could be very special needs for single services, like a hardware encryption key store or GPU servers for machine learning, etc. This would be the approach for any reasonable sized system. (Of course if you are going scale to very large sizes it also a very feasible approach to have complete stacks / clusters for specific services.) 8. Persistency design is most crucial. Where it is relatively easy to evolve a business logic, reorganize it, etc., it is rather difficult to evolve your historic data. Often you have a choice to do a design in one of two ways: 1. Smart algorithms, dumb data. 2. Smart data, dumb algorithms.
(Smart referring to more elaborate / reflecting more of the business requirements) The second approach is usually harder initially, but in my experience will result in better results when a certain complexity threshold is reached.
So these are just a few things that came to my head when reading your question. I apologize that they cannot answer your detailed question about how to slice the billing API, but maybe you have a few additional considerations at hand.

This is a classic question I was asked during an interview recently How to call multiple web services and still preserve some kind of error handling in the middle of the task. Today, in high performance computing, we avoid two phase commits. I read a paper many years ago about what was called the "Starbuck model" for transactions: Think about the process of ordering, paying, preparing and receiving the coffee you order at Starbuck... I oversimplify things but a two phase commit model would suggest that the whole process would be a single wrapping transaction for all the steps involved until you receive your coffee. However, with this model, all employees would wait and stop working until you get your coffee. You see the picture ?
Instead, the "Starbuck model" is more productive by following the "best effort" model and compensating for errors in the process. First, they make sure that you pay! Then, there are message queues with your order attached to the cup. If something goes wrong in the process, like you did not get your coffee, it is not what you ordered, etc, we enter into the compensation process and we make sure you get what you want or refund you, This is the most efficient model for increased productivity.
Sometimes, starbuck is wasting a coffee but the overall process is efficient. There are other tricks to think when you build your web services like designing them in a way that they can be called any number of times and still provide the same end result. So, my recommendation is:
+ Don't be too fine when defining your web services (I am not convinced about the micro-service hype happening these days: too many risks of going too far);
+ Async increases performance so prefer being async, send notifications by email whenever possible.
+ Build more intelligent services to make them "recallable" any number of times, processing with an uid or taskid that will follow the order bottom-top until the end, validating business rules in each step;
+ Use message queues (JMS or others) and divert to error handling processors that will apply operations to "rollback" by applying opposite operations, by the way, working with async order will require some sort of queue to validate the current state of the process, so consider that;
+ In last resort, (since it may not happen often), put it in a queue for manual processing of errors.
Let's go back with the initial problem that was posted. Create an account and create a wallet and make sure everything was done.
Let's say a web service is called to orchestrate the whole operation.
Pseudo code of the web service would look like this:
1. Call Account creation microservice, pass it some information and a some unique task id 1.1 Account creation microservice will first check if that account was already created. A task id is associated with the account's record. The microservice detects that the account does not exist so it creates it and stores the task id. NOTE: this service can be called 2000 times, it will always perform the same result. The service answers with a "receipt that contains minimal information to perform an undo operation if required".
2. Call Wallet creation, giving it the account ID and task id. Let's say a condition is not valid and the wallet creation cannot be performed. The call returns with an error but nothing was created.
3. The orchestrator is informed of the error. It knows it needs to abort the Account creation but it will not do it itself. It will ask the wallet service to do it by passing its "minimal undo receipt" received at the end of step 1.
4. The Account service reads the undo receipt and knows how to undo the operation; the undo receipt may even include information about another microservice it could have called itself to do part of the job. In this situation, the undo receipt could contain the Account ID and possibly some extra information required to perform the opposite operation. In our case, to simplify things, let's say is simply delete the account using its account id.
5. Now, let's say the web service never received the success or failure (in this case) that the Account creation's undo was performed. It will simply call the Account's undo service again. And this service should normaly never fail because its goal is for the account to no longer exist. So it checks if it exists and sees nothing can be done to undo it. So it returns that the operation is a success.
6. The web service returns to the user that the account could not be created.
This is a synchronous example. We could have managed it in a different way and put the case into a message queue targeted to the help desk if we don't want the system to completly recover the error". I've seen this being performed in a company where not enough hooks could be provided to the back end system to correct situations. The help desk received messages containing what was performed successfully and had enough information to fix things just like our undo receipt could be used for in a fully automated way.
I have performed a search and the microsoft web site has a pattern description for this approach. It is called the compensating transaction pattern:
[Compensating transaction pattern][1]

[1]: https://docs.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction

CopyPastor

Possible Plagiarism

Original Post