MobileFirst Analytics - Planning for Production
MobileFirst Analytics – Pre-production
Part of the value of MobileFirst Platform (MFP) Foundation is the out-of-the-box, simple, integrated analytics reporting and console you deploy so you can see a near live view of your infrastructure performance, client logs, and many other events and statistics.
This article does not intend to give you the talking points or espouse the benefits of analytics. I hope the value of analytics and in-production client log collection is well understood! The goal here is to give you the proper knowledge to confidently go into production with a robust, ready, and scalable analytics cluster capable of handling the heaviest punishment.
MFP Analytics Server Internals
The underlying data store in the analytics server is ElasticSearch. More than just the data store, ElasticSearch also manages a horizontally scalable cluster of ElasticSearch nodes. Ultimately, then, the analytics server is really just a single ElasticSearch node. We entrust all of the cluster management to ElasticSearch.
ElasticSearch gives deployers the opportunity to set some fine-grained configuration parameters. One example is the ability to control all twelve of ElasticSearch’s thread pools. All told, there are perhaps over a hundred configuration parameters you can tune. Don’t do it! Unless you really know what you’re doing, and have specifically identified the need, avoid falling to the temptation to turn the knobs and flip the switches.
There are a few configuration parameters that MFP Analytics Server requires you to consider before going into production. We’ll get into that a bit later.
I strongly encourage you to get familiar with ElasticSearch. It’s documentation will be your primary source for understanding and managing the cluster of MFP Analytics servers.
It’s Not a Data Warehouse!
Relational database technology has been around since roughly the early 1970s. ElasticSearch is not a relational database. It is fundamentally different, and even encourages data duplication to improve query efficiency. ElasticSearch uses Apache Lucene, which describes itself as a “high-performance, full-featured text search engine.” ElasticSearch describes itself as “distributed, scalable, highly available, with real-time search and analytics capabilities.” ElasticSearch promotes its use for “real-time data” and “real-time analytics”. Also, “with ElasticSearch, all data is immediately made available for search and analytics.”
Let’s dissect that last claim. Can you apply the same claim to a relational database? In the general case, no. To achieve near real-time data access to all of your data, you need to distribute load and keep the data in memory. It’s not quite accurate to call ElasticSearch a NoSQL data store. It’s much more than that!
I strongly encourage you to read the Wikipedia entry on “NoSQL”.
If it’s Not a Data Warehouse, Why Do I Need a Disk?
I’ve heard people say they did not give any disk space to the MFP Analytics Server installation because “we were told it was not a database.” Well, it’s not a relational database, but it certainly has data storage requirements! How would you achieve persistent data store without a disk? Please, please give your installation some disk space.
So How Do I Confidently Go Into Production, Anyway?
First, test! Seriously, you do have a test cluster, right? And you’ll be simulating the stress that cluster will experience in production, right? That’s the advice given by the capacity planning section of the ElasticSearch Definitive Guide (to which I will be referring a lot and which you are strongly encouraged to read).
There are several basic constraints you must be concerned with when planning for capacity.
- rate of indexing (do you have a busy infrastructure?)
- size of documents (are your debug log messages huge?)
- rate of querying (how often are you querying the data?)
- desired query performance (can you tolerate 5-second wait times on the console?)
- how long to keep the data (one week or many months?)
To properly plan for capacity, you’ll need to know the answers (or at least an approximation) to each of the above points. Do you process millions of network requests per second, or only a few sync requests per day? Are you keeping all of your client and server logs at high verbosity? Do you have many people refreshing the analytics web console throughout the day and night? Are you willing to have poor query performance, therefore poor analytics web console load times? How long do you want to keep the data?
Because ElasticSearch and the analytics server intends to be real-time, all data is held in fielddata in memory. Quoting ElasticSearch guide: “[Fielddata] is an in-memory data structure that must be accessible for fast execution, and it is expensive to build. If you have to reload data for every request, performance is going to be awful.”
That should help us draw a strong conclusion about our hardware requirements. We need enough memory to hold all of the fielddata. Which brings us to…
I shant repeat what ElasticSearch Guide already says about hardware, so, go forth and read, my son!
How Does MFP Analytics Contribute to Scaling Requirements
We store many document types in the ElasticSearch data store, but only a few can experience rapid accumulation. Some of that accumulation is out of your control, such as the recording of all network requests coming into the MobileFirst Operations Server. Some of the accumulation is fully under your control, such as at what verbosity the client and server logger is configured, and whether or not you have instrumented your applications to report custom analytics, and how extensive that instrumentation is.
The document types that contribute most heavily to rapid accumulation are:
- ServerNetworkTransactions – all network requests received by MFP Operations Server
- CustomData – data reported by you if you’ve instrumented your client applications
- Devices (bounded to a fixed number)
- MfpAppLogs – raw debug logs from client applications, if you’ve set a high verbosity and are explicitly sending them to the server
- MobileUsers (bounded to a fixed number)
- ServerLogs – raw debug logs from the MFP Operations Server, if you’ve set a high verbosity and are explicitly sending them to the server
You have the choice to turn off or on any of these. Turning off some of these reported types reduces the value of the analytics, and reduces your ability to make informed decisions about your applications and infrastructure. Storing all types, but reducing their longevity restores that value. It is your job to decide what data is most valuable to you, and to set the capacity and scaling requirements accordingly!
Ok, So What Performance Should I Expect
In our testing, we found that 10 million ServerNetworkTransactions documents in one shard on one physical node answered a complex query in about 3.7 seconds. 50 million ServerNetworkTransactions documents answered in about 10 seconds. If you consider the analytics web console “broken” when it takes longer than 10 seconds to load, then 50 million ServerNetworkTransactions per node is your scaling factor.
Hmmm… instead of writing a big convoluted paragraph describing all the various permutations of shards and nodes, perhaps this type of calculation would fit neatly into some kind of computer thing, like a high tech abacus.
As it turns out, the analytics team produced a capacity planning spreadsheet for exactly that purpose.
One other thing to note, also from our testing, is that setting two primary shards, two replica shards, and two physical nodes reduced that 3.7 second query time to under one second.
Don’t assume you can just keep bumping that primary shard setting and get more performance! Shards are not free; they take system resources to manage. There is a sweet spot, and you have to find it.
You really, really should read the whole Designing for Scale section.
Can’t a Guy Just Get a Quick Start Guide?
Ok, fine. I’ll give you some requirements. This is the part I promised earlier!
- Run production servers on Linux (prefer not VM due to the VM’s unpredictable handling of virtual memory).
- Turn off swap (see Elastic configuration documentation)
- Configure swappiness (see ES configuration documentation)
- Give half of the physical memory to the JVM for heap space (by -Xms8G -Xmx8G or by using ES_HEAP_SIZE env var. See ES configuration documentation)
- Do not give more than 32G RAM per node to the JVM.
- Increase allowed open file descriptors to 32k or 64k (see ES configuration documentation)
- Increase the virtual memory mmap counts (see ES configuration documentation)
- Set bootstrap.mlockall to false (see ES configuration documentation)
- Use a JDK from the ES support matrix or IBM J9 version 188.8.131.52 (Java 8 SR1) or version 184.108.40.206 (Java 7.1 SR3)
- Use the capacity calculator spreadsheet and set the shards configuration accordingly, or with a bit of a buffer.
- Use the capacity calculator spreadsheet and allocate (or at least plan for!) physical nodes to handle the capacity you require.
- Use a 64-bit JVM. The heap size for a 32-bit JVM is extremely limited (http://javarevisited.blogspot.com/2013/04/what-is-maximum-heap-size-for-32-bit-64-JVM-Java-memory.html). This is not a suitable environment in which to run the Analytics console, as Elasticsearch heavily relies on a substantial heap size.
- Then, revisit this list before production and don’t skip any of it! Especially the minimum_master_nodes setting in the Important Configuration Changes section.
I hope this post gives you a better feel for the fundamentals of the MFP Analytics Server technology, scaling techniques, and production planning. I welcome your feedback, and look forward to your comments!
Inclusive terminology note: The Mobile First Platform team is making changes to support the IBM® initiative to replace racially biased and other discriminatory language in our code and content with more inclusive language. While IBM values the use of inclusive language, terms that are outside of IBM's direct influence are sometimes required for the sake of maintaining user understanding. As other industry leaders join IBM in embracing the use of inclusive language, IBM will continue to update the documentation to reflect those changes.