• SaaS “Log” Management

  • My Thought

  • Architecture

  • Case Study

  • References

  • Usage Scenario

    Where and how can we add value to users?

  • Ingestion and Collectors

    • simple upload for quick try out
    • one for all sources
    • relay/forward
    • sending target
    • central management (ex. tagging)
  • Advanced Query Functions

  • End

  • Log Management System in General

    I think “log“ should be a generalized term, including at least:

    1. states: cpu, memory, disk, network, performance metrics … etc.
    2. configs: system tunable, application settings … etc.
    3. logs: system, application logs … etc.
    4. events: patch applied, application new version deployed, external referral or promotion/campaign (causing traffic surge), security breach … etc.

    Together, these represent the “status” of the target systems so that users can understand whether their systems are healthy or not, and if not, where the problem is and how to resolve it.

    From the system’s perspective, there’s really no difference among these. They are all just “text” data, needed to be parsed/stored/indexed/aggregated and then fetched/searched/charted/analyzed.

  • Ingestion and Collectors

    In general, there are two different approaches:

    1. Centrally managed collectors: Splunk and SumoLogic are examples.
    2. Decentralized collectors: Loggly, Papertrail, Flume, Fluentd, Logstash, etc.

    Centrally managed collectors are usually proprietary, like SumoLogic’s (MII as well). SumoLogic’s central collector (source) management is able to change remote collectors’ config: target file(s) location, filters, tags, etc. The up side is, well, centralized management. You manage all collectors in one place (on UI). Don’t get it wrong, you still need to install collector on each host, just the configuration is done on the server.

    The 2nd approach configures individual collector separately on the host it is installed. These vendors usually don’t implement their own collector but instead relying on commonly seen collectors like rsyslog/Flume/Fluentd/Logstash. All they need to do is support one common endpoint (syslog is the most common protocol) or create their own plugin for those collectors.

    Flume/Fluentd/Logstash all have more features (strong plugin eco-system) than SumoLogic’s. Though the convenience of managing sets of collectors centrally is really an advantage. I’d argue that the ideal design is to be able to use those common collectors while still able to centrally manage them (to some extent). I don’t think these two approaches are mutually exclusive.

  • Query Language

    There are generally 2 types of log query implementation: Lucene based and Splunk/SumoLogic style extension.

    The difference is Lucene based query only has, well, Lucene search functions. That basically means you can only search based on fields present in the Lucene index. Generally there’s no aggregation or analytics support (in the query). This category includes Solr, ElasticSearch, Loggly (among others) … You do have limited aggregation and also faceting, which are supported by all 3 products mentioned above, but not directly using the query. Mostly aggregation and faceting are through UI operation. You search something first then aggregating on the resultset, or faceting on the side which indirectly suggest possible filters to use next. This category generally requires UI operation in order to do meaningful aggregation or analytics. There’s a feel of “multi-step” in order to complete a task.

    Splunk and Sumo Logic are in the second category. The “query” language supports aggregation and much more directly. You can still use UI to do things but in the end they got translated to a “query”. Also there’s “pipe” so you get to apply one after another operation which is very convenient and powerful. Another strong point is you can create new, “virtual” fields on the fly. For example, browser agent field in Apache access logs contains multiple information. With this you can “extract” the OS part only without actually having an OS field in the index (which would need schema change and re-index from Lucene’s perspective).

    From the indexing perspective, the 1st feels more static while the 2nd more dynamic. Using the 1st, you have to re-index and re-processing if you need a new field. Using the 2nd, just add/change the query. But of course, the 2nd must suffer from some runtime overhead, which is the trade off. We could also argue that the 2nd require the more advanced knowledge of the query language.

    I do not believe these are mutually exclusive. It would be great to combine the good of both. In fact, Sumo Logic does mention it and plan to add finer control on specifying static field.

  • Alerts

    All products I’ve tried using saved search for creating alerts. When creating an alert, you specify the criteria of matching records with a query as a saved search, as well as how often this should be checked. The highest frequency of running alert searches is 1 minutes (of all products I’ve tried), probably out of performance consideration. Usually there’s an additional control on how many matches within a specific time window would be treated as an alert. For example, only more than 5 matches (of the alert’s saved search query) in the past 60 minutes would trigger an alert.

    SumoLogic does support real time alert based on saved search (“human real time“ as it is explained, in the range of 8-12 seconds), though in real time mode it can only check up to the latest 15 minutes time window. Also some search language operator is not allowed in real time mode.

    Using saved search for alert provides the most flexibility, as whatever you can do with search is available to alerting. From implementation perspective, you don’t need to create a separate logic for alert evaluation which is nice. However, performance is a problem if the alert checking runs too often.

  • Pipeline

    One question needs to be answered is:

    How to dynamically add fields into index? Assuming we are going to use Lucene based solution (Solr or QRadar).

    Dynamically change and reload schema definition does seem feasible (without restart of course). This needs further investigation depends on what we use.

    The other one, re-indexing, is easily done with Kafka. The ultimate source of truth should be on HDFS, which means we’ll need to first “normalize” input data from Kafka and persist into HDFS, then pour back to Kafka in another topic. When replaying, we simply read from HDFS and re-pouring back to Kafka. This could be done in Storm, with separate bolts for Kafka and HDFS persistence.

  • Dashboard

    • First, Banana is really slow. If Kibana is the same, I’m not going with it.
    • Should support Kibana json layout, or at least allow conversion.
  • Security

  • When you have diversity of event sources, how do you search?

    Obviously, we search for a problem or root cause of a problem. But upon a “hit”, here should be a way to pull in all related info. One possible dimension is time. Another might be call trace.

  • QRadar

  • Solr

  • Kafka

    What is the best way of receiving into Kafka on the internet?

  • Storm

  • Scenarios Tested

    Apache access log

    Analytics on the simple access log from Apache. Of course, these scenarios are meant to be put on a dashboard.

    1. unique visitor counts per day (ip/date/agent combined)
      • possibly excluding spiders
    2. Top Requested URL sorted by hits
    3. Top Requested Static Files sorted by hits
    4. Top 404 Not Found URLs sorted by hits
    5. Top Hosts sorted by hits
    6. Top Operating Systems sorted by unique visitors
    7. Top Browsers sorted by unique visitors
    8. Top Requested Referrers sorted by hits
    9. Top HTTP Status Codes sorted by hits

    Unique visitor requires the ability to “combine” fields. OS or browsers requires virtual field derived from a field.

  • Loggly

    • +750 billion events logged to date
    • Bursts of 100k+ events per second over the span of hours
    • Events vary in size from 100 bytes to several megabytes
    • End-to-end processing in 10 seconds (form ingestion to searchable on web)
  • Sumo Logic

  • Papertrail

  • Solr/Banana

  • Splunk

  • New Relic

  • What is needed in a log management system

    • Search (for something), better yet, search API
    • Chart (for spotting trend)
    • Realtime (fast enough, no lag)
    • Flexible ingestion API (or standard API)
    • High fault tolerance (no log lost allowed)
  • Best Practice in Log Management

    • Use existing logging infrastructure
      • Real time syslog
      • Application log file watching
    • Store log externally
    • Log messages in machine parsable format
      • JSON for structured information
      • Key-value pairs
  • Lambda Architecture

    • Batch layer
    • Serving layer
    • Speed layer

    Data goes to both batch and speed layer. Batch layer contains the master dataset, serving layer generates query views, and speed layer provides query views of the most recent data which batch/serving layer has not yet caught up.

    The batch layer runs functions over the master dataset to precompute intermediate data for your queries, which are stored in batch views in the serving layer. The speed layer compensates for the high latency of the batch layer by providing low latency updates using data that has yet to be precomputed into a batch view. Queries are then satisfied by processing data from the serving layer views and the speed layer views and merging the results.

  • HBase

    A key/value store providing random and realtime read/write access.

  • Hive

    The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

    It does not provide random access (read/write).

    Think of it as an higher abstraction for using MapReduce with SQL-like method.

    Hive is useful only if the data is structured.

  • Dashboard References

  • Sudden Surge of Page Rendering Time

  • Apache Access Log Analytics

  • Maximo Log Analytics

    • sessionize / transaction

      * | transaction host cookie
      
      event=1 host=a
      event=2 host=a cookie=b
      event=3 cookie=b

      Group (cross-source) events together (transitive or not) for further aggregation.

    • Maximo logs:

      SessionId -> ActionId/time -> SqlId/time
      • n lowest actions taking more than t sec.
      • For these actions, the slowest queries.
        • Queries longer than t sec
        • Queries taking > p% time
    • timeslice / timechart

      • freely time bucketed aggregation
        • daily, hourly … etc
        • by time or by buckets
      • trend analysis
        • user session count trend
        • memory usage (total/available)
    • Result Manipulation

      • numeric, like bytes -> GB
      • string, like concat, substitute
    • Conditional

      • if (<cond>, <true>, <false>) as <field>
      • apply classification on-the-fly
        • ex. classify response time range
      • derived filed
        • ex. Browser/OS info out of agent field
    • User expectation

      There are 3 main goals:

      1. Dashboard: it shows what users care most.
      2. Alerts: it lets users know when and what they need to know.
      3. Diagnosis: it helps users identify the root cause of alerts and problems.

      I want to dig into 2 and 3 first. They are more likely the areas a system can differentiate from others.

    • Conceptually, there are 2 steps in using such a system to resolve problems: 1) identify the problems, 2) resolve the problems.

      Of course, users can manually search for problems (based on his experience). But a superior system would “magically” identify potential problems to users via alerts. That’s the selling point. We’ll talk about that point later. Lets assume a potential problem has been identified, and we have the exact timestamp of when and what the problem is.

      • Pull out all other kinds of data at around the same time. Everything should be plotted on a time series chart.
        • Automatically Identify and highlight those with abnormal pattern, be it spike, valley, or any level out of normal range.
        • Allow easy in/exclusion of data for further analysis.
        • Allow switching among data series easily, as well as navigating a single data series back and forth in time.
        • Allow free trend overlapping.
      • Allow annotation on data.
      • A secondary alert could also be useful during this process, not necessarily the normal alerts for identifying anomaly. For example, database rows read per transaction over some value might not worth sending out notification, but it could be a hint in this manual analysis and worth pointing out.
      • Once identified, the correlation should be learned.
    • Now back to the abnormal detection. In addition to user alerts, there are also automatic detection:

      • Anomaly detection, basically detect any change of stable level. cpu level change, web page traffic sudden surge, no. of Java exception increases, etc.
      • dynamic baseline comparison
      • log pattern change. new/unknown log entries appear, log frequency change, etc.
    • Then there’s situation users would like to manually search to identify problems. We could do something to assist in this process:

      • faceting, not just normal count-based facet, but also:
        • multiple selection
        • multiple level (pivot)
        • numeric range facets for special field (cpu for example)
      • auto identify new fields, allow option to “promote” to add new fields
        • facet on these “virtual” fields
      • time line also shows log frequncy trend line
        • while typically the main time series chart shows the volume of event counts over time. It would be nice to allow “dragging” a facet onto the chart to draw faceting over time. need back-end support for sure. I doubt how the default chart was queried, should be what we need.
        • different line for different log level
        • different line for different log pattern, could categorize log entry (exception, system state, etc)
      • similarly, the 2nd ruleset can be used here. think of it as “best practice” suggestion.
    • Arguments against Custom Collector and Central Management

      Even though central management has its up side, there might be some concern:

      1. Many users probably already have some collector installed and running (or are familiar with). It’s better not asking them to install a new kind. Instead, we support those directly.
      2. Many enterprise users have automated configuration management system like Puppet/Chef/Ansible. For these users, stick to those is more reasonable, both for security concern (let some components in their environment managed by outsider) and familiarity (they know and are familiar with how to management their environment with tools of their choice).

      I think these are more likely the concerns for large shop.

    • Design

      • For quick self starter, provide on-page one-time uploading.
      • On server receiving end, support syslog (this covers rsyslog/Flume/Fluentd/Logstash, among others) and http/s (this covers app client side sending)
        • Provide config generation support on UI for major collectors. Obviously, documentation need to be great. Whenever possible, one-line shell script config should be provided. When not feasible (like firewalled), allow simple config file download.
        • Provide a leanest and pre-packaged collector (just for ex. say Fluentd) for users who don’t already have/use any collector.
      • Provide SDK for direct sending from application, at least for Java, Javascript (Python, Ruby, and Go would be next, generally the more languages the better).
      • Parsing/processing does not occur on collector side, but on server side.
        • It’s better done after Kafka so that it’s possible to re-processing without resending to correct human error.
        • Auto detection of source data, whenever possible.
        • Users should be able to tag source(s) or source groups, manageable from UI. Tags could act as hints for (easier or more efficient) later processing.
        • This layer also supports filtering on the server side (also manageable from UI).

      BTW, also need to consider receiving metrics protocol used by most commonly used time-series metrics system. Ex. Graphite/StatsD. We want more than logs.

    • Further Thought

      This is not an elegant design, too “hacky”. There are too many “state” to maintain, inconsistent between client and server side. It might be worthwhile to just enhance an open source collector to allow server controlled configuration. Users can simply add our plugin (trivial thing to Logstash and Fluentd). The critical piece is to see if existent plugin framework allows configuration reloading.

    • Limitation of Lucene based query language

      Simply looking for something is not a problem. The problem is with correlating different records in the result set (that is, aggregation and analytics).

      Take the experiment scenarios on Apache access log. Loggly (mainly Lucene search only) can only “partially” achieve 6 out of 9 scenarios mostly because it can only aggregate and sort on one dimension. All 6 achievable scenarios by Loggly are something like “top xxx sorted by yyy…”. That translates to search by “xxx” then order by “yyy” (order by is through the UI, not search language). Even with 6, the result cannot include fields other than “xxx” (so it’s only for charting purpose, like pie chart). The needed UI operation is not yet available on Loggly.

      The other 3 scenarios are all using so called “unique visitors”, which is defined by the combination of 3 fields: date, ip, and agent. Here, Lucene/Solr does not support virtual or derived field hence unless we change the schema to add 3 additional fields and re-index we cannot search on “unique visitors”.

      Similarly, 2 of the 3 scenarios also use “partial” field info from field “agent” which contains both OS and browser info. If we want to aggregate on OS, we cannot unless we add a new field to contain only the OS part of the original “agent” field.

      I think since this is not provided by Lucene search engine, it should be the upper layer’s responsibility, either Solr or the application. Splunk/SumoLogic query language is one way to support that (on top of search result set), while it is quite possible to achieve the same thing from UI operation. It is just that Loggly has not yet reached that level.

      I would say using query language is much more flexible and elegant from implementation perspective. Combined with “pipe” or stream concept, it is very powerful.

    • Design

      • Use Lucene/Solr search language as the base.
      • On top of that, provide aggregation and pipe.
        • Everything can be expressed with a query. Charts, tables, etc.
        • Map the query to Storm topology. The very first part search result would feed the input to the spout. From there, we have all possibility.
        • This also serves alert.
    • QRadar

      From my understanding, QRadar does it in the reverse way. Instead of running alerts periodically, it runs all alert checking on each event received. Since events must must go through CRE before persisted into Ariel, there must be some latency added to ingestion process.

      On the other hand, running saved search does not impact ingestion. The only contention is if the query (read) does impact the write to the underlying data store, and how much.

      If using search based alerts, we can’t afford to do it in QRadar’s way, that is, running all alert queries on every single event received (essentially, QRadar is not using search to implement alert).

      However, QRadar’s alerting does allow a few interesting and powerful concepts, which cannot be covered by search based approach (at least not directly):

      1. Rule can depend on other rule, that is, recursive.
        • when an event matches any|all of the following rules
        • when these rules match at least this many times in this many minutes after these rules match
    • Design

      Search based is the way to go:

      • It should be more performant (resource-wise) than all-rule-per-event evaluation.
      • Search is implemented by Storm topology, which also is a perfect fit for alerts. Stateful stream processing can make running total or count based rule evaluation.
      • Recursive alerts could be provided by Storm topology as well, with branching (possible to another topology).
      • Periodic checking is supported by bolt system events. Real time checking is, well, native Storm.

      In addition to user defined alerts, there is another category of anomaly detection alerts. Though users should be able to turn on/off these alerts, even possibly tweak the algorithm used for their needs. This category of alerts should by default turned on.

    • Kafka

      The first and the entry point of our pipeline should be Kafka. For a few reasons:

      1. Decouple the ingestion and the processing.
        • This provides a desired property for the system that no matter what happens to the pipeline processing, the collection never got interrupted. Users sending never gets timed out because of the pipeline is full.
        • One of the requirement of a log/event management system is that no log/event is lost, otherwise we might loose critical, even just tiny, piece of data. Kafka persists to disk to it provides the necessary guarantee.
        • We can easily “replay” incoming data stream when needed. This simplifies the system in many way, for example when there’s a need of re-indexing (for Lucene) because of new fields added. Another example would be to correct human error in the pipeline.
      2. Decouple internal systems.
        • All systems, be it APM, SCALA, QRadar, Analytics, UI, or even “Storage”, communicate through Kafka. No unnecessary inter dependency among systems. For example, UI could subscribe to “alert” queues while all systems publish to those queues.
        • It also servers as a kind of buffer/queue so a system does not get blocked by others.
      3. Serve as the single source of the truth
        • All systems get the same data input stream, in the same order. There’s no synchronization needed between the systems receiving the input.
        • Hadoop and Storm could both subscribe to the same queues for receiving input data.

      Note that Kafka is not actually the entry point of the whole system as it currently does not have security. There is another light layer in front of Kafka responsible for receiving from collectors and publishing to Kafka queues. This light layer could simply be Logstash or Fluentd behind DNS and/or load balancer.

    • HDFS

      This is where the source of truth stored. It is immutable and normalized. Each atomic piece of data should:

      • be as raw as possible, to allow maximum possibility
      • has an uid, for desired idempotent operation
      • timestamped

      Enforceable and Evolvable Schema

      Thrift or Protocol Buffer

    • Storm

    • My Questions

      • pipeline fault tolerant? Kafka?
      • load balance among nodes
      • plug new computation into the pipeline
    • Scaling

      Data nodes

      EP could have additional data nodes which shard the data as well as spread the search workload. Adding data nodes is dynamic.

      ECS

      Adding ECS on the other hand is not really dynamic (distribution in that sense). Data stays to where it was processed, which means users need to manually distribute the sending among multiple ECS.

    • FTS - Free-Text Search

      On the UI, it is a 2nd search mode: “Quick Filter”.

      Use Lucene library for free-form search capabilities inside raw Ariel payload data

    • Anomaly Detection

    • Solr v.s. Lucene

      Solr adds functionality like

      • XML/HTTP and JSON APIs
      • Hit highlighting
      • Faceted Search and Filtering (Lucene has basic facet search already since 4.x, see this)
      • Geospatial Search
      • Fast Incremental Updates and Index Replication
      • Caching
      • Replication
      • Web administration interface etc
      • Copy Fields
      • Dynamic Fields
      • Schemaless Mode
    • Limitation

      • No automatically resharding or dynamic adding replicas

      • Fields cannot be updated/added easily/efficiently

      • No join fields across different documents. Impossible across multiple servers
    • Facet

      The most basic form of faceting shows only a breakdown of unique values within a field shared across many documents (a list of categories, for example), Solr also provides many advanced faceting capabilities. These include faceting based upon

      • the resulting values of a function,
      • the ranges of values,
      • or even arbitrary queries.

      Solr also allows for hierarchical and multidimensional faceting and multiselect faceting, which allows returning facet counts for documents that may have already been filtered out of a search result.

    • Clustering and Scalability

    • I found having a single source of stream data is great in that we could have multiple, many consumers: APM, Alerting, QRadar, Analytics …

      But then, each would need to be stream processing system. However, how to keep “history” available for the processing purpose?

      If we keep the canonical/atomic data in Kafka as the source stream, each consumer then would be equal in that they process by their own. If any failure, they just replay from Kafka. There’s no need to read from Hadoop from other systems for example .

      Plus, any derived stream would also be available if needed by some systems.

      We still need a kind of snapshot/restore facility so we don’t have to keep complete log in Kafka.

    • Storm processing output is better written back to Kafka instead of directly to data stores (say Cassandra). This decouples Strom from the underlying data store.

    • Tuning

      http://www.slideshare.net/Hadoop_Summit/analyzing-12-million-network-packets-per-second-in-realtime (p26)

      • JBOD better performance compared to RAID
      • num.io.threads: same as disk
      • num.network.threads: based on sum of consumer/producer/replication-factor

      Kafka Spout

      • parallelism == #partitions
      • fetchSizeBytes
      • bufferSizeBytes
    • First generation

      (2011)

      • Near real-time searchable logs
      • Thousands of users
        • 10-100k events/sec
        • Typically, when user systems are busy, they send even more logs (Challenges)
        • Log traffic has distinct bursts, can last for hours
      • EC2 (only)
      • Use Solr cloud
        • Tens of thousands of shards (with special ‘sleep shard’ logic)
      • ZeroMQ
    • Second geenration

      (2013/09)

      • Always accept log data - never drag the source down because of some problems in ingestion part
      • Never drop log data - some user are willing to trade latency for not losing any data
      • True elasticity - truly and easily scale horizontally and vertically
      • AWS deployment
      • Kafka, Storm, ElasticSearch
    • Storm Removed

      (2014/08)

      See What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope

      • Guaranteed delivery feature causes big performance hit
        • 2.5x hit
        • Potential workaround: batch logs. Ack a set of logs. But this is hacks and not consistent with Storm’s concept. And it’s not trivial to do.
        • (now) Trident can be used for this purpose though.
      • Custom approach: 100K+ events per second
    • Try

      scenarios done: 2/3/4/5 (no table form, just chart), 8/9

      Provisioning takes about 1 min.

      Built on Angular, load page very slow … Also, it could be network but I don’t feel it fast, in fact it feels very slow, like when faceting.

    • LogReduce

      Automatically apply the Summarize operator to the current results (non-aggregate).

      The Summarize operator’s algorithm uses fuzzy logic to cluster messages together based on string and pattern similarity. You can use the summarize operator to quickly assess activity patterns for things like a range of devices or traffic on a website. Focus the summarize algorithm on an area of interest by defining that area in the keyword expression.

      Reduce logs into patterns, works across multiple sources. By identifying the change of log patterns, say a new pattern emerges, it is able to do anomaly detection.

    • Transaction

      allows cross event tracing of transactions based on a common field, with states defiend:

      ... | transaction on sessionid
      with "Starting session *" as init,
      with "Initiating countdown *" thresh=2 as countdown_start,
      with "Countdown reached *" thresh=2 as countdown_done,
      with "Launch *" as launch
      results by transactions showing max(_messagetime),
      sum("1") for init as initcount

      also able to draw flow chart directly:

      _sourceCategory=[sourceCategoryName]
      | parse "] * - " as sessionID
      | parse "- * accepting request." as module
      | transaction on SessionID with states Frontend, QueryProcessor, Dashboard, DB1, DB2, DB3 in module
      results by flow
      | count by fromstate, tostate

    • Scheduled view

      Like materialized search, pre-aggregated result.

      Indexes Paritions

      You can create a partition specific for a search sub-range. for performance.

    • Try

      all 9 scenarios can be done.

      it is very much like Splunk, which does all the processing at search phase. This would typically cause long duration especially for aggregation type operation. Loggly style usually come back in less than 5 sec.

      It does support field extraction at collection phase. Define the parse operator as a rule and it would be applied automatically. This combines the best of both world.

      provisioning not fast …

      collector setup is not flexible. requires installation of its collector.

      now I know why IBM is looking for it. enterprise…

    • Try

      More developer centric mindset. It doesn’t have rich feature set. Much of its functions are basic “tail” like operation.

      Analytics side, it does not have any ‘native’ function. It allows daily archive to S3 and then provides some scripts/instructions for loading archives into different tools/envs for self analytics work. Ex. Hive or Redshift.

      One interesting tool allows tails locally the centralized logs.

      A Ruby shop.

    • Try

      Tried local Banana/Solr. To be honest, it’s not fast. Any action on the page take maybe at least 1 second. (This may be due to local Solr slowness, not sure.)

      Also, sometimes the page just “locks”.

      What I like about Banana (Kibana I assume):

      1. Whole page is controlled by a single json file (not just widget, but also the active filter, the “value”), and better you get to load the json from anywhere lively (include gist).
      2. json can be saved to solr as well
      3. Purely in browser.
    • Searches and Reports

      TPAE UI session > 150 
      
      sourcetype=tpae eventtype="tpae_total_users_and_jvm_ui_sessions" OR eventtype="tpae_total_users" OR eventtype="tpae_jvm_ui_sessions"|timechart span=15m max(ui_sessions) as sessions by host|untable _time host sessions|search sessions > 150|timechart first(sessions) by host
      
      TPAE UI session tracks 
      
      correlation_id=* ElapsedTime session_id=*|table session_id correlation_id _time user_id elapsed_time app event|sort session_id correlation_id
      
      TPAE memory available 
      
      sourcetype=tpae eventtype="tpae_memory_status"|stats sparkline(avg(eval(memory_available/1024/1024))) as memory_available, avg(eval(memory_available/1024/1024)) as avg, min(eval(memory_available/1024/1024)) as min, max(eval(memory_available/1024/1024)) as max by host
      
      TPAE slow query by app, object 
      
      sourcetype=tpae eventtype="tpae_slow_sql"|stats count, avg(sql_time) as avg_exec_time, max(sql_time) as max_exec_time, min(sql_time) as min_exec_time by app, object|sort 20 - count, - avg_exec_time 
      
      TPAE stacked memory available and used 
      
      sourcetype=tpae eventtype="tpae_memory_status"|timechart span=15m avg(eval(memory_available/1024/1024)) as avg_available, avg(eval((memory_total - memory_available)/1024/1024)) as avg_used
      
      TPAE top mbo count 
      
      sourcetype=tpae eventtype="tpae_mbo"|stats sparkline(avg(mbo_count)), avg(mbo_set_count) as set_count, avg(mbo_count) as count by mbo|sort 20 - set_count
      
      TPAE ui sessions' correlations 
      
      correlation_id="*" ElapsedTime|stats first(elapsed_time) as time first(user_id) as user first(app) as app first(session_id) as session_id by correlation_id|stats count(correlation_id) as count first(time) as time first(user) as user first(app) as app by session_id|sort 10 - time
      
      TPAE users and session 
      
      sourcetype=tpae eventtype="tpae_total_users_and_jvm_ui_sessions" OR eventtype="tpae_total_users" OR eventtype="tpae_jvm_ui_sessions" | timechart span=15m first(users) as total_users, avg(ui_sessions) as per_jvm_sessions
    • It’s bad for end users to be the first to notice this.

    • Simple Rule

      
      if resp_time > 2 sec
        alert!
      end

      That is easy.

    • Dynamic Baseline

      This is better.

    • Break Detection

      We should detect this for users.

    • Anomaly Detection

      Seasonal patterns, longer period underlying trend.

    • Simple rules, advanced rules with states, really complex rules with lots of calculations …

    • Notification

      In time, even better before the incidents.

    • True Detective

      • Metrics
      • Logs
      • Events
      • Configurations

      We need complete information.

    • With the identified incident’s time:

      • Pull out all kinds of data nearby:
        • All plotted on time series charts
        • Easy navigation, ex/inclusion, comparison, annotation
    • Faceting

      • visual clue at a glance
      • click any to add filter(s)
    • Pivot Faceting

      multi-level facet across separate fields

      • exception / java_class
      • url_path / resp_time
    • Time Series Faceting Chart

      Normal with message volume over time:

      Drag/drop a facet onto it becomes:

    • Learning

      • Once resolution identified, the causality is learned
        • for future alerts as hints
        • for refined alerts
      • Guidance and suggestion (common sense and learning)
        • Rules for analysis assistance
    • (from goaccess)

    • A few scenarios requires more than just search with simple aggregation/sort.

      1. Daily unique visitors (ip-date-agent)
      2. Top operating systems sorted by unique visitors

      (Others are mostly achieved using group-by.)

    • Daily unique visitors:

      * | parse using public/apache/access 
        | timeslice 1d 
        | sum(size), count by src_ip, _timeslice, agent 
        | sum(_sum) as size, count by _timeslice
      
      Time                      size  count
      12-26-2014 00:00:00 62,277,020 15,070
      12-27-2014 00:00:00    447,010    100

      (SumoLogic syntax)

    • Top operating systems sorted by unique visitors:

      * | extract "(?<src_ip>\S+?) 
          \S+ \S+ \S+ \S+ \"[A-Z]+ \S+ 
          HTTP/[\d\.]+\" \S+ \S+ \S+ 
          \"(?<agent>[^\"]+?)\"" 
        | if (agent matches "*Win*", "Win", "0") as OS 
        | if (agent matches "*Linux*", "Linux", OS) as OS 
        | where OS != "0" 
        | timeslice 1d 
        | count by OS, src_ip, _timeslice, agent 
        | top 10 OS

      (SumoLogic syntax)

      • UI session > 150 - hosts having excessive UI sessions and timesliced
       | timechart span=15m 
         max(ui_sessions) as sessions by host
       | untable _time host sessions
       | search sessions > 150
       | timechart first(sessions) by host
      • memory available - host stats with sparkline, min, max, avg in table format
       | stats 
         sparkline(avg(eval(memory_available/1024/1024))) as memory_available, 
         avg(eval(memory_available/1024/1024)) as avg, 
         min(eval(memory_available/1024/1024)) as min, 
         max(eval(memory_available/1024/1024)) as max 
         by host
      • stacked memory available and used - timesliced avg memory used vs total
       | timechart span=15m 
         avg(eval(memory_available/1024/1024)) as avg_available, 
         avg(eval((memory_total - memory_available)/1024/1024)) as avg_used
      • slow query by app, object - sorted query stats (avg,min,max) grouped by (app,object)
       | stats 
         count, 
         avg(sql_time) as avg_exec_time, 
         max(sql_time) as max_exec_time, 
         min(sql_time) as min_exec_time 
         by app, object
       | sort 20 - count, - avg_exec_time
      • top mbo count - top N mbo stats with sparkline and avg
       | stats 
         sparkline(avg(mbo_count)), 
         avg(mbo_set_count) as set_count, 
         avg(mbo_count) as count 
         by mbo
       | sort 20 - set_count
      • users and session - timesliced total user and session counts
       | timechart span=15m 
         first(users) as total_users, 
         avg(ui_sessions) as per_jvm_sessions
      • ui sessions’ correlations - group by action first then group by session, time rolled up
       | stats 
         first(elapsed_time) as time 
         first(user_id) as user 
         first(app) as app 
         first(session_id) as session_id 
         by correlation_id
       | stats 
         count(correlation_id) as count 
         sum(time) as time 
         first(user) as user 
         first(app) as app 
         by session_id
       | sort 10 - time
    • Looking at the architecture, QRadar is composed of one Console and multiple managed hosts (in a distributed way). It is not clear (to me) how the management of this distributed system is done.

      • How to add a new managed host (or remove one)? Is it dynamic? There’s actually one more kind, data node, which is per managed host scaling extension. You can freely add managed and data nodes. Data stored in a managed node stays there forever, as there’s no automatic balancing among managed hosts. However, data nodes of a managed host do have rebalancing. Data nodes essentially shard data as well as balance the search workload.
      • How does the load balance among managed hosts? Is it static for a specific data source? Currently there is no balancing among managed hosts.
        • somewhere I read load balancer is supported, “syslog based data sources (or anything “pushed” to qradar), we now process those log sources on ANY event processor“)
        • does that mean it doesn’t matter for an event to be stored on which node as the query is distributed to all nodes? or is there’s some mechanism to ‘forward’ an event to the right place?
      • How does Tomcat/UI scale? Are multiple Console nodes allowed? Single UI currently, multiple not allowed.
      • For Ariel, I’ll assume Ariel is not distributed database. So no sharding etc. In that case, data collected at a host is stored at the same place and queried by the same host, correct? Correct, static.
      • Multi-tenant support?
    • Source input uses in-memory queues for buffering. Understand events are dropped once queues are full (though it is not really desirable). What about fault-tolerant? If a host is down, are all events are lost?

      What is the typical (or expected) “turn-around” time after an event is sent to Qradar and it is searchable on the UI?

    • Ariel schema seems to be static (normalized fields as they are called). If users want to parse their own log source, they need to create Log Source Extension (LSX). It seems only those normalized fields are usable in AQL.

      • Schema-less support? Is adding/changing fields allowed? If not, say in my log I have something like “… elapsed_time=103 …”, would it be possible to query by “elapsed_time” (as a field)?
      • Dynamic (or virtual) fields?
    • How is free-text search’s Lucene index building fit into the pipeline?

      It is only used for full text search, along with the Ariel query. Full text search filter first and then feed to Ariel search.

      • Does it run in parallel to the pipeline? on all nodes?
        • Is free-text search distributed?
      • Exactly which content/field is being indexed?
        • Does it allow specifying custom field (in Lucene index)? or is it only full-text search on the single entity “payload“?
      • Is it still based on Lucene 3.x?
        • No faceting (even simple faceting) is a major problem.
    • The pipeline …

      • Is the whole pipeline synchronous or asynchronous?
        • seems asynchronous as there are queues in some components (parser, CRE, forward, etc …)
        • then what is the internal queues used by the pipeline? scalability and fault-tolerance characteristics? (seems in-memory queues)
        • How to scale the pipeline?
    • Rules are defined with pre-defined tests, specifically for security domain. Some example pre-defined test:

      • when the local network is one of the following networks
      • when the destination network is one of the following networks
      • when an event matches any|all of the following rules
        • dependency/relationship among rules
      • when the event(s) have not been detected by one or more of these log sources for this many seconds
        • timed window
      • when these rules match at least this many times in this many minutes after these rules match
        • counter combined with timed window

      So …

      • How to add new test? You cannot.
      • Where is event streaming used for? for alerts?
    • Event Correlation Service (ECS) - even/flow pipeline

      • Event Collector
        • Receive events (Protocol sources)
        • Parse and normalize events (DSMs)
        • Coalesce events
        • Optionally forward events
        • Log Source Traffic Analysis & Autodetection
      • Event Processor
        • Correlate events (Custom Rules Engine (CRE))
        • Store events (in our proprietary Ariel database)
        • Forward events matching rules within the CRE to our Magistrate component
      • Magistrate (MPC) (Console only)
        • Associate matching events with offenses
    • Ariel and AQL

      • Time-series DB
      • Proxy (console only) is responsible for issuing distributed search (and caching)

      It is specifically for events/flows (at least originally). Not sure how suitable for arbitrary data. In comes down to whether it supports dynamic schema. Say ingestion has a new field, can this field be created in Ariel schema?

      Of course, there are only 2 “tables” (events and flows) so naturally AQL does not have join.

      The point of a custom query language (like Splunk/Sumo-Logic) over Lucene/Solr query language is to be able to conveniently “stream” operation, manipulate field(s), aggregate, etc.

    • Distributed PostgreSQL

      Master in console, slaves (read-only) in all managed hosts

      • Much of QRadar’s configuration and reference data comes from a local postgres database.
      • Proprietary database replication scheme - periodically packages up changes from “interesting” tables on the console and serves them up to MHs that will periodically hit a webservice in tomcat to get the latest DB deltas.
    • Hostcontext

      node process manager

      Runs on each appliance in a deployment

      • Runs the “ProcessManager” component that is responsible for starting, stopping and verifying status for each component within the deployment.
      • Is responsible for the packaging (console) and the download/apply (MH) of our DB replication bundles.
      • Responsible for requesting, downloading, unpacking, and notifying other components within an appliance of updated configuration files.
      • Responsible for monitoring postgresql transactions and restarting any process that exceeds the pre-determined time limit. This portion is referred to as “TxSentry”.
      • Responsible for disk maintenance routines for disk cleanup.
    • Accumulator

      pre-populate by-minute, by-hour, and by-day counts of data for efficient graphing

    • Asset Profiler

      owns and is responsible for persisting our asset and identity models into the DB

    • Vis

      responsible for communicating with third party vulnerability scanners (through our plugins), scheduling scans and processing results

    • QFlow

      receiving flows from a number of sources, creating flow records, and periodically sending these records to ECS for further processing. sources could be:

      • Raw packets from a network interface
      • Raw packets from a high-speed packet capture device (DAG or Napatech)
      • JFlow (Juniper), SFlow, Packeteer, Netflow / IPFIX (Cisco)
    • Reporting Executor

      Responsible for the scheduling and execution of the “report runner”. The report runner generates the reports that the executor requests.

    • Terms

      • Flow: The record of network communication between two end points. A flow is based on the usual 5-tuple: source IP, source port, destination IP, destination port, protocol.
        • A flow is different from an event, in that flows (for the most part) will have a start and end time, or, a life of multiple seconds.
      • Offense: A container used to represent the flows and events that caused an alert within QRadar. Example:
        A CRE rule exists with the following tests:
        -Create an offense when there have been at least 10 login failures followed by a login success from a single source to a single destination within a 10 minute window.
        This rule could be used to try and find people guessing (and succeeding) at passwords. In this case, an offense would get created that contained all the login failures and the login success. It would appear under the “offenses” tab in the UI and would appear on the dashboard.
      • Reference set: A set of reference data within QRadar that is able to be populated by the CRE and able to be consumed and tested against in the CRE. This is a case where the response to a CRE rule could be “add the source IP to this reference set” and another rule could be “if the destination IP matches an IP in this reference set….”
    • Design

      Data is from Ariel payload and feed to FTS. Essentially create one index file per time slot (default is 8h) and keeps N slots (configurable). Search can be parallel down to multiple slots (then aggregated). Indexing must be done and committed to the index file before a searcher instance can be created.

      Hence the last, most recent slot is special in that it is using Lucene’s NRT search and has limitation in that indexer and searcher must be in the same Java process (performance).

      Solr utilizes update log and hence is reading from the log uncommited change. Benefit? This real-time get might only be limited to get by unique key.

    • Concern

      Why not using Solr, as it is together with Lucene in Apache.

      • Scaling/sharding
      • Faceting
      • Searching API

      Solr is pretty mature now, why build from scratch? any special feature?

      • Facet Types

        • Field faceting - specify filed(s) to facet while searching
        • Query faceting - use arbitrary queries as facet, each query result would be one facet term
        • Range faceting - kind of special case of query faceting for numeric and date fields
        • Multi-select faceting - tag/ex allows to ignore filtering on indicated faceting, allowing while filtering on a facet term, the count of other terms of that facet are still preserved. Useful for UIs allowing multi-selection (without remembering those counts by UI layer)
        • Pivot faceting - (or nested facet) facet across multiple dimensions (ex. how many 4- and 5-star restaurants exist in the top 3 cities in the top 3 states), currently on allowed on field faceting - Pivot faceting also has practical performance constraints.
      • Ideas

        • Comparison between different filtering results
          • While experimenting filtering on different facets, it would be great to have a way to quickly compare with previous faceting result(s), if users desire.
        • Multi-selection of facets, Hierarchy facets selection
      • Lessons Learned

        • Event ingestion too tightly coupled to indexing
          • Manual re-indexing for temporary SOLR issues
        • Multiple Indexing strategies needed
          • 4 orders of magnitude difference between our high volume users and low volume users (10 eps vs. 100000+ eps)
          • Too much system overhead for low volume users
          • Difficult to support changing indexing strategies for a customer

        That’s how Kafka and Storm entered.

      • ElasticSearch

        • Add indices with REST API
        • Indices and Nodes have attributes
          • Indices automatically moved to best Nodes
        • Indices can be sharded
        • Supports bulk insertion
        • Plugins for monitoring cluster
      • Solr

        • The collection API in Solr was branch new and very primitive, while ES had native, robust battle-tested index management.
        • Both had sensible default behavior for shard allocation to cluster nodes, but the routing framework in ES was significantly more powerful and stable than the collection API.
        • Solr consistently indexed at about 18k eps, while ES started at about the same rate then gradually slowed down to around 12k eps (though ES is not yet using newer Lucene version)
        • Neither allowed for dynamic changes to the number of shards in an index after creation.
      • Kafka is perfect match for real time log events

        • High throughput
        • Consumer tracks their own location
        • Use multiple brokers
          • Multiple brokers per region
          • Availability zone separation

        Storm

        • “Pull”
          • Provisioned for average load, not peak - because Kafka decouples Storm from ingestion point and hence you don’t need to always keep up with the peak workload
          • Worker nodes can be scaled dynamically
      • Ingestion Architecture

        • Multiple collectors at first layer (multiple zone)
          • C++ multi-threaded
          • Boost ASIO
          • Each can handle +250k/sec
        • Collectors to Kafka brokers
        • Then Strom picks from Kafka, after processing sent to Kafka
          • Storm topology: first (multiple) classification bouts (identifying customers) and summary statistics bouts in parallel, then rate determination bout (monitoring volume limits), last (mutiple) analysis bouts and archiving bouts in parallel (to S3), finally emits to Kafka (stage 2)
          • Snapshot (stage 2) last day of Kafka events to S3
          • Also send to S3 for customer’s own usage
        • ElasticSearch picks from Kafka
          • deferred problematic events to Kafka for later re-processing
      • Pre-production system

        • Parallel to production, picking events from the same Kafka topics (one of the partition, as it does not need full data set)
      • AWS deployment

        • Collector needs high compute power, disk is not important
        • Kafka is memory optimized, needs disk buffer caching
      • False starts

        • AWS ELB for load balancing between collectors
          • No port forwarding 514 (syslog)
          • No UDP forwarding
          • Burst hit its ELB limit
        • Ended use Route 53 DNS Round Robin
        • Original use Cassandra both for persistence store and as queue
          • Cassandra not designed for queue usage
          • Added complexity as Multi-tenancy require handling data bursts. Collectors still needed to be able to buffer to disk.
      • Big Wins

        • Easy to scale Storm resources
        • Route 53 DNS supports with latency resolution
        • Pre-production staging system
      • Dynamic Fields

        At a glance, you would see a summary of the different kinds of exceptions that are currently occurring and the frequency of these exceptions between desktop vs mobile or across different browsers (kind of sub-grouping). It’s this instant and continually customized overview that is new and different. it allows you to get a much fuller view of your logs and to more quickly zero in on what’s important.

      • Ingestion syslog utility

        PatternDB to parse log messages and generate name-value pairs from them. These can then be forwarded to Loggly using the JSON output.

        See https://www.loggly.com/blog/sending-json-format-logs-syslog-ng/

        Should you send us a log format we don’t currently recognize, we’ll want you to tell us about it.

      • Ingestion and Source setup

        Provides a quick upload way (curl) to try first. Very simple and straight forward (though it’s even simpler to upload on the page I suppose).

        Auto detected apache access log as the type.

        Does seem to be syslog based (receiving end). In fact, all investors do support syslog end point.
        Even file monitoring instruction uses syslog configuration method. Even log4j is instructed to be configured to send to local syslog daemon first.

        From client side, you can also easily send logs via http/https, in plain text or json format.

        Using tokens for security/identify. (https://logs-01.loggly.com/bulk/95179f81-503b-4d70-a2ed-728955261cc0/tag/file_upload)

        Ingestion takes some time to show up on search page (~20s).

        Source group: logical grouping on host/app/tag (AND). Used for like production/staging.

        There doesn’t seem to be a way to re-parsing existent data and dynamically on UI. That is, to modify the parsing method (like add a new field like in Splunk), you’d have to re-sending data (after changing the collector).

        json are automatically parsed and fields are automatically extracted. There are a few supported log formats, but other than that, you have to parse yourself and send in in json

        In general, only has syslog and http/https endpoints.

        One interesting thing: wikified strings are considered while indexing. That is, S3Stats.processStatEvents is indexed as “S 3 Stats process Stat Events”.

      • Search syntax similar/same to Lucene/Solr with minor extension. Like field:>40, exists:field:value, category range field:[xxx TO yyy]

        Faceting is on the left side. It’s not very fast (feeling so slow might be my network). 2-level faceting is possible and clicking creates a filter. Multiple filtering is allowed. You can easily remove a filter in active.

      • UI

        I like the timeline google finance like operation.

        Save current ‘search’ easily by “star” button.

        Alert is based on saved search and run periodically (quickest is ran every minute). Can only alert if the saved search condition matched over some number in some past period, like over 5 times in the past 5 minutes. In addition to email, you only have http call back and page duty choices.

        Charting/trending/grouping is done on UI and split into multi-step. First select chart type, then grouping accordingly. There’s also a grid view that you can compose your own table view freely.

        Currently only timeline/pie/bar/single-value. Charting should really include grid/table form.

        The ‘dimension’ in charting is fixed. That is, you cannot combine multiple fields. For example, unique visitors in Apache access logs requires combination of IP/Date/Agent. This is more fundamental to the faceting, which only allow faceting on existent field, no composition.

        Also, there’s no control over the time unit. Saved widget used current time range as it is (which is not easy to use btw). Say I want to have a trend line on the unit of per day.

        I like the tabbed interface, able to preserve a diagnosis session (most likely the time range filter).

        When charting/trending, can’t do roll-up. Like group by then roll up some other columns (of course, it can only show single filed used as the group by). For example, top hosts also showing total bytes requests for each host.

        • Ingestion

          Requires custom collector. Once collector is installed, it is managed via web UI. Sources could have different static tags for search purpose. Filters could also be defined on sources (ex/include, hash, mask).

          Currently sources could be :

          • Local File Source - files on the same collector host.
          • Remote File Source - ssh / unc
          • Syslog Source
          • Windows Events Source
          • Script Source - pretty limited, must be on the same host as the collector. only bat/vb/bash/python/perl

          There’s another “hosted” option which provides HTTP/S and S3 endpoints.

          Collector takes job from server, scheduled seems to be every 2 minutes. Other than that, ingestion speed is fine compared to Loggly (can’t really tell how slow or fast it is due to the lag).

          The parsing is done during search time, just like Splunk.

        • It has pipe in search just like Splunk.

          There’s a summary operator which cluster messages based on pattern similarites. Ex.

          error or fail* | summarize| parse "message: *" as msg | summarize field=msg

          more …

          count _size | top 10 _size by _count

          Parsing apache access log:

          * | parse using public/apache/access | where status_code matches "4*"
        • UI

          Actually more mature than Loggly.

          I like the ‘Library’ overlayer that shows common/stock search categorized by application.

          When I try to create a widget out of the unique visitors result, it failed as not supported operator.

          Dashboard: for some reason, can’t get it to show data successfully.

        • Alert

          Also via saved search. Even though on UI it says it could be real-time, but I think it is still running periodically.

          Alert actions could be email, script, or save to index.

        • Ingestion

          Basically syslogd. It also provides a small remote_syslog daemon.

          Least feature reach tried so far. Essentially positioned as a centralized “tail” service. Minimum search function (basic boolean operators and full text). Also seems to only allow syslog messages.

          Allow log filters to removes logs never wanted while ingestion.

        • Alert

          Scheduled saved search. Support many web-hooks as notification.

        • Dashboard

          The whole dashboard is row based, with each row containing multiple “panels”. You can specify the height of each row in pixel, width of panels in x/12 (like bootstrap). There are about 17 types of panel for you to choose, more like pre-defined panel for use. Ex. timepicker, query, facet, filtering, etc. Think of widget.

          • bettermap: Map
          • column: container panel allow vertical arrangement of multiple panels within a row.
          • facet:
          • filtering: shows current active filters
          • fulltextsearch
          • heatmap: for showing pivot facet
          • histogram: A bucketed time series chart of the current query, including all applied time and non-time filters. When used in count mode, uses Solr’s facet.range query parameters. In values mode, it plots the value of a specific field over time, and allows the user to group field values by a second field.
          • hits: shows total events count
          • map
          • query: search bar
          • rangeFacet
          • scatterplot: choose 2 variables
          • table: A paginated table of records matching your query (including any filters that may have been applied). Click on a row to expand it and review all of the fields associated with that document. Provides the capability to export your result set to CSV, XML or JSON for further processing using other systems.
          • term: Displays the results of a Solr facet as a pie chart, bar chart, or a table. Newly added functionality displays min/max/mean/sum of a stats field, faceted by the Solr facet field, again as a pie chart, bar chart or a table.
          • text
          • ticker: stock ticker style
          • timepicker

          Each panel defines its own config section.

          Some panels apply globally to the whole page’s panel (like timepicker and query).

          “Display help message when Inspect” allows html, markdown, and text modes, with some content specified in panel config. For example, bettermap allows showing active query when inspect.

          One pitfall is some faceting only can work on numeric fields. which is kind of inconvenient.

        • The Database

          New Relic Insights is powered by a highly distributed cloud-hosted event database with an innovative architecture that does not require indexing.

          Data is distributed across cluster nodes so each query can leverage the full power of the cluster to return billions of metrics in seconds.

          The New Relic Insights database does not require indexing. This eliminates the need to configure the database in advance based on the types of queries expected. Query speed is fast regardless of what attributes you query.

        • Presenting Data

          New Relic Insights uses the SQL-flavored query language - NRQL (New Relic Query Language) for querying the events database.

        • The amount of data we collect every day is staggering. Initially all data is captured at full resolution for each metric. Over time we reperiodize the data, going from minute-by-minute to hourly and then finally to daily averages. For our professional accounts, we store daily metric data indefinitely, so customers can see how they’ve improved over the long haul.

          Our data strategy is optimized for reading, since our core application is constantly needing to access metric data by time series. It’s easier to pay a penalty on write to keep the data optimized for faster reads, to ensure our customers can quickly access their performance data any time of the day.

          Creates a database table per account per hour to hold metric data. This table strategy is optimized for reads vs. writes

          Having so many tables with this amount of data in them makes schema migrations impossible. Instead, “template” tables are used from which new timeslice tables are created. New tables use the new definition while old tables are eventually purged from the system. The application code needs to be aware that multiple table definitions may be active at one time.

        • Data collection handled by 9 sharded MySQL servers.

          The “collector” service digests app metrics and persists them in the right MySQL shard.

          These back-end metrics are persisted in the customer’s New Relic agent for up to one minute, where they are then sent back to the New Relic data collection service

          Real User Monitoring javascript snippet sends front-end performance data to the “beacon” service for every single page view.

        • Challenges

          • Data purging: Summarization of metrics and purging granular metric data is an expensive and nearly continuous process
          • Determining what metrics can be pre-aggregated
          • Large accounts: Some customers have many applications, while others have a staggering number of servers
          • Load balancing shards: Big accounts, small accounts, high-utilization accounts
        • Plugin architecture

          An In-Depth Look at the New Relic Platform

          Plugins are made up of two parts:

          • An agent that observes the system being monitored and reports metrics about that system up to the New Relic data center
          • A matching set of dashboards that display those metrics
        • Alerting

          Alerting: AppDynamics computes your response time thresholds by itself and might take some time to learn your system. New Relic relies on custom thresholds defined by you for its Apdex index.

        • Event Pipeline

          Receiving, processing (normalize & coalesce), running it against Rules and then storing the data to disk,

        • Event Collector

          There are two separate paths for events and flows within the event collector. It is responsible for normalizing and parsing the received events.

          There are multiple in-memory “input” queues (for different types of souce) that house incoming events. queues are based on “event counts” rather than memory usage, and as such, are limited to these set numbers to ensure we don’t consume all available memory. Once those queues fill, if more new events come in, they are dropped.

        • Event Processor

          The Event Processor houses the Custom Rule Engine (CRE) and the storage filters for Flows and Events.

        • Magistrate

          It is responsible for creating offenses and tabulating all the information associated with those offenses.

        • What’s good about Storm?

          • Spout and bolts fit network approach, where logs could move from bolt to bolt sequentially or need to be consumed by several bolts in parallel
          • Guranteed data processing
          • Dynamic deployment
        • ElasticSearch index management

          • Indices are time-series data
            • Separated by customer
            • Represent slices of time
            • Higher volume index will have shorter time slice
          • Multi-tier architecture for efficient indexing
            • Multiple indexing tiers mapped to different AWS instance types
        • Use the right tech for the job. The main New Relic web application has always been a Rails app. The data collection tier was originally written in Ruby, but was eventually ported over to Java. The primary driver for this change was performance. This tier currently supports over 180k requests per minute and responds in around 2.5 milliseconds with plenty of headroom to go.

          • Events Source
            • Push: syslog (tcp/udp), SNMP
            • Pull: files (sftp/ftp/scp), jdbc, WMI (windows domain controller service), https for applications such as cisco ips, and checkpoint where qradar uses a custom application (leapipe2syslog) to connect, query and send the data back into qradar
          • Normalization

            Extracting properties: event id, source ip, source port, dest ip, dest port, protocol, username, pre-nat ip/port, post-nat ip/port, ipv6 source/dest ip, source/dest mac, machine name.

          • Coalesce

            If events repeat, then the events coalesce based on a multipart key, which uses source ip, dest ip, dest port, username & event id. Once 4 events come in that match the same key, the 4th event is coalesced with any additional events that match that same key, for the next 10 seconds. Once 10 seconds goes by, the counter is reset and coalescing for that “key” restarts.

          • Traffic Analysis

            autodetect log sources

          • Offsite Target

            Forward processed, parsed events to another QRadar deployment. This is typically used in “Disaster Recovery” (DR) deployments, where customers want to have a second console/installation that has a backup copy of production data.

          • Selective Forwarding

            configured in the Admin tab under the item “Routing Rules”. Options in the routing rules allow a user to filter/drop/modify events that are going through the pipeline.

            This happens AFTER licensing, occurring between the EC and EP processes, so events that say, are marked to be dropped, STILL count against your EPS license.

            The option in the ‘routing options” include:

            • forward: to another destination than the EP
            • drop: completely drop the event from the pipeline, so that it is NOT stored to disk, forwarded, etc
            • bypass correlation: this puts a tag on the event itself (and is written to disk w/ the ariel record) to NOT process the event in the CRE. Not even the false positive rule would match the event.
          • Sources
            • Local EC
            • Local CRE (events created by rules)
            • External EC
            • External EP
              • Processing by Magistrate to create or contribute to an offense; or
              • Centralized processing on the CRE because the event has been deferred by the Managed Host; or
              • Both of the above.
          • Custom Rule Engine

            The CRE runs on each event processor and monitors data as it comes through the pipeline. There are multiple types of rules, which check for either properties of single events, and functions which can track multiple events over a period of time or other “counter” based parameter.

            The CRE is an asynchronous filter which performs:the processing of events and flows on all the rules currently active in the system. Some of the activities include:

            • Marking events with full or partial matches on tested rules; and
            • Identifying events with tags for Magistrate or Central Processing; and
            • Delegating responses to rules (SNMP, Reference Set, Notications … etc); and
            • Emitting new CRE events when specified by a given rule.
          • Event Streaming

            The event streaming code bypasses normal ariel searching and uses a connection from the “ecs” service directly to tomcat.

          • This is the heart of our correlation abilities. The UI component allows a rich set of real-language based tests such as “and when the Source IP is one of ….” or “and when the payload matches the following regular expression”. Customers configure these rules to target specific incidents that they may be interested in. The output of the CRE is typically an offense, an e-mail, a system notification, an entry into a reference set, to forward the event on somewhere else, etc…

          • Rules are defined with a test preset, specifically for security domain. Rules have multiple variables for users to fill. A rule can have multiple tests, with or/and relationship.

            One kind of test could be when a specific set of “rules” matched …

            Rule action can be:

            • change event severity, etc
            • annotate event
            • drop event

            Rule response can be:

            • create a new event
            • email
            • syslog
            • forward
            • notify
            • trigger scan

            There could also be limiter on no more than N per X (time) per unit (rule/source/etc).

          • Global Rule

            Rules that have counters or functions and are marked as “test globally”, if they have partial matches, their “partial matches” are sent up to the EP on the console for counting.

            For example, if your rule says look for windows events and 5 login failures from the same ip and same username, and it’s marked test globally, each login failure for windows events is sent to the EP on the console from the EP that sees them. If there were 3 failures on one EP, then 2 on another, the central EP will get notification of all 5 and fire the rule.

            When you have a rule that’s marked to test globally, it only really triggers if it’s a rule with a function. A rule that only tests a single event, even for multiple properties, is automatically NOT tested globally to avoid unneccessary load on the event pipeline.

          • For example if you had a rule looking for an event with user = x and source ip = y and event type is “login failed”, that will automatically NOT run a global test, because it’s only testing 3 properties of a single event. However, if you have a rule marked test globally, and you have a function looking for a count of events over a period of time, that’s called a partial match. When you have partial matches across multiple event processors, this is why we need to have the concept of “global rules”.

            What happens then, is that event processor with the partial match, wll send that event ‘reference” to the central event processor on the console. Then, if another EP in another location also finds an event that matches, it also sends it to the console for global testing, and we can then do correlation across multiple event processors.

          • Global tests are turned -off- by default, currently, since the load on the central EP can be substantial, if alot of rules are tested globally.

            We must store duplicates in GCC mode for magistrate to be able to properly handle pending offenses and partial matches of rules along the way. Essentially what happens is:

            1. Local CRE on an EP runs through the list of tests. If the event/flow matches a rule tagged as global, that event/flow gets sent to the console (central CRE) for processing.
            2. Local CRE on the EP continues to run through the rest of the tests and stores the results locally, since it was the EP that received the event.
            3. Central CRE on the console receives the event and runs it through the list of global rules and tags the event with any matches it finds. This gets stored with the duplicate flag. The custom rules matched/partially matched will only be the ones from the global rules.

            Searches WILL show both events because they are both there.

            Magistrate can now create offenses and run pending offense queries to backfill offenses that had lead-up global events.

          {"cards":[{"_id":"54aa489123db65801d8eaa67","treeId":"5483ac8e1268987d4019293f","seq":1450532,"position":1.5,"parentId":null,"content":"# SaaS \"Log\" Management"},{"_id":"4ea834399524482ffa0000fb","treeId":"5483ac8e1268987d4019293f","seq":1456227,"position":0.5,"parentId":"54aa489123db65801d8eaa67","content":"## Usage Scenario\n\nWhere and how can we add value to users?"},{"_id":"4ea6ddce0721ceef8e0000f9","treeId":"5483ac8e1268987d4019293f","seq":1464116,"position":1,"parentId":"4ea834399524482ffa0000fb","content":"### Sudden Surge of Page Rendering Time"},{"_id":"4ebf336b0c63c51333000110","treeId":"5483ac8e1268987d4019293f","seq":1464117,"position":0.5,"parentId":"4ea6ddce0721ceef8e0000f9","content":"![](http://blog.eukhost.com/wp-content/uploads/2013/06/Make-Website-load-faster.png)\n\nIt's bad for end users to be the first to notice this."},{"_id":"4ecd6303d6f68dd000000117","treeId":"5483ac8e1268987d4019293f","seq":1473100,"position":0.5625,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Simple Rule\n\n```\n\nif resp_time > 2 sec\n alert!\nend\n\n```\n\nThat is easy."},{"_id":"4ebf56d40c63c51333000112","treeId":"5483ac8e1268987d4019293f","seq":1491195,"position":0.59375,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Dynamic Baseline\n\n![](http://cdn2.content.compendiumblog.com/uploads/user/458939f4-fe08-4dbc-b271-efca0f5a2682/1ce5308b-c6c1-4dde-9593-741c6cebc514/Image/b42bc3f824e94aff1441c938cb6da198/laney_p__chart_of_build_failures_by_stage.jpg)\n\nThis is better."},{"_id":"54aa489123db65801d8eaa69","treeId":"5483ac8e1268987d4019293f","seq":1491139,"position":0.625,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Break Detection\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/SxM6DIpuCAUq21G/level_change.png)\n\nWe should detect this for users."},{"_id":"4ef1fd5fde838b3c0600011b","treeId":"5483ac8e1268987d4019293f","seq":1491242,"position":1.3125,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Anomaly Detection\n\n![](https://g.twimg.com/blog/blog/image/figure_localglobal_anomalies.png)\n\nSeasonal patterns, longer period underlying trend."},{"_id":"4ea913e19524482ffa000102","treeId":"5483ac8e1268987d4019293f","seq":1491251,"position":2,"parentId":"4ea6ddce0721ceef8e0000f9","content":"* Data pattern change, frequency change, new/unknown entries appear, etc\n* Dynamic baseline\n* [Breakout detection](https://github.com/twitter/BreakoutDetection)\n* [Anomaly detection](https://github.com/twitter/AnomalyDetection)\n\nSimple rules, advanced rules with states, really complex rules with lots of calculations ..."},{"_id":"4ea770770721ceef8e0000fc","treeId":"5483ac8e1268987d4019293f","seq":1464099,"position":3,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Notification\n\n![](http://coxy.me.uk/work/aaf/images/main06_l.jpg)\n\nIn time, even better before the incidents."},{"_id":"54aa489123db65801d8eaa6b","treeId":"5483ac8e1268987d4019293f","seq":1464101,"position":4,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### True Detective\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/hUiyReIeuZ5kvdO/sample%20trend%20comparison.png)"},{"_id":"4eb0b6b00c63c513330000ff","treeId":"5483ac8e1268987d4019293f","seq":1464102,"position":5,"parentId":"4ea6ddce0721ceef8e0000f9","content":"* Metrics\n* Logs\n* Events\n* Configurations\n\nWe need complete information."},{"_id":"4ea77670c0a6b40c0a0000f9","treeId":"5483ac8e1268987d4019293f","seq":1491276,"position":6,"parentId":"4ea6ddce0721ceef8e0000f9","content":"With the identified incident's time:\n* Pull out all kinds of data nearby:\n * All plotted on time series charts\n * Easy navigation, ex/inclusion, comparison, annotation\n"},{"_id":"4eb14e610c63c51333000102","treeId":"5483ac8e1268987d4019293f","seq":1464104,"position":7,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Faceting\n![](http://old.searchhub.org//wp-content/uploads/2012/08/CNET_faceted_search.jpg) ![](http://lucene.apache.org/solr/assets/images/quickstart-range-facet.png)\n\n* visual clue at a glance\n* click any to add filter(s)"},{"_id":"4eb16c010c63c51333000103","treeId":"5483ac8e1268987d4019293f","seq":1464105,"position":8,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Pivot Faceting\n\nmulti-level facet across separate fields\n\n![](http://tm.durusau.net/wp-content/uploads/2014/02/hierarchy-screen-216x300.png) ![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/8mjtHKwDt6PNt3P/pivot%20faceting.png)\n\n* exception / java_class\n* url_path / resp_time"},{"_id":"4eb3819a0c63c51333000106","treeId":"5483ac8e1268987d4019293f","seq":1473277,"position":9,"parentId":"4ea6ddce0721ceef8e0000f9","content":"### Time Series Faceting Chart\n\nNormal with message volume over time:\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/T4wVH75fTb60u3R/time%20series%20chart.png)\n\nDrag/drop a facet onto it becomes:\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/WMpSjVodLRwk88E/time-series-facets.png)\n"},{"_id":"4eb10a430c63c51333000101","treeId":"5483ac8e1268987d4019293f","seq":1473307,"position":10,"parentId":"4ea6ddce0721ceef8e0000f9","content":"Learning\n\n* Once resolution identified, the causality is learned\n * for future alerts as hints\n * for refined alerts\n* Guidance and suggestion (common sense and learning)\n * Rules for analysis assistance"},{"_id":"4ea778eec0a6b40c0a0000fa","treeId":"5483ac8e1268987d4019293f","seq":1464112,"position":3.75,"parentId":"4ea834399524482ffa0000fb","content":"### Apache Access Log Analytics"},{"_id":"4ebf31fa0c63c5133300010e","treeId":"5483ac8e1268987d4019293f","seq":1464144,"position":0.5,"parentId":"4ea778eec0a6b40c0a0000fa","content":"![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/zCFt9CH5Xc820LB/goaccess%20report.png)\n\n(from [goaccess](http://goaccess.io))"},{"_id":"4ea94cd79524482ffa000103","treeId":"5483ac8e1268987d4019293f","seq":1491280,"position":1,"parentId":"4ea778eec0a6b40c0a0000fa","content":"A few scenarios requires more than just search with simple aggregation/sort.\n\n1. Daily unique visitors (ip-date-agent)\n2. Top operating systems sorted by unique visitors\n\n(Others are mostly achieved using group-by.)"},{"_id":"4eb1bde70c63c51333000104","treeId":"5483ac8e1268987d4019293f","seq":1491281,"position":2,"parentId":"4ea778eec0a6b40c0a0000fa","content":"Daily unique visitors:\n\n```\n* | parse using public/apache/access \n | timeslice 1d \n | sum(size), count by src_ip, _timeslice, agent \n | sum(_sum) as size, count by _timeslice\n\nTime size count\n12-26-2014 00:00:00 62,277,020 15,070\n12-27-2014 00:00:00 447,010 100\n```\n\n(SumoLogic syntax)"},{"_id":"4eb1beed0c63c51333000105","treeId":"5483ac8e1268987d4019293f","seq":1491282,"position":3,"parentId":"4ea778eec0a6b40c0a0000fa","content":"Top operating systems sorted by unique visitors:\n\n```\n* | extract \"(?<src_ip>\\S+?) \n \\S+ \\S+ \\S+ \\S+ \\\"[A-Z]+ \\S+ \n HTTP/[\\d\\.]+\\\" \\S+ \\S+ \\S+ \n \\\"(?<agent>[^\\\"]+?)\\\"\" \n | if (agent matches \"*Win*\", \"Win\", \"0\") as OS \n | if (agent matches \"*Linux*\", \"Linux\", OS) as OS \n | where OS != \"0\" \n | timeslice 1d \n | count by OS, src_ip, _timeslice, agent \n | top 10 OS\n```\n\n(SumoLogic syntax)"},{"_id":"4eda2f0dde220cd2c3000118","treeId":"5483ac8e1268987d4019293f","seq":1480398,"position":5,"parentId":"4ea834399524482ffa0000fb","content":"### Maximo Log Analytics"},{"_id":"4eda3141de220cd2c3000119","treeId":"5483ac8e1268987d4019293f","seq":1491293,"position":1,"parentId":"4eda2f0dde220cd2c3000118","content":"* UI session > 150 - hosts having excessive UI sessions and timesliced\n\n```\n | timechart span=15m \n max(ui_sessions) as sessions by host\n | untable _time host sessions\n | search sessions > 150\n | timechart first(sessions) by host\n```"},{"_id":"4ef25136de838b3c0600011d","treeId":"5483ac8e1268987d4019293f","seq":1491299,"position":1.5,"parentId":"4eda2f0dde220cd2c3000118","content":"* memory available - host stats with sparkline, min, max, avg in table format\n\n```\n | stats \n sparkline(avg(eval(memory_available/1024/1024))) as memory_available, \n avg(eval(memory_available/1024/1024)) as avg, \n min(eval(memory_available/1024/1024)) as min, \n max(eval(memory_available/1024/1024)) as max \n by host\n```\n"},{"_id":"4ef256e2de838b3c0600011e","treeId":"5483ac8e1268987d4019293f","seq":1491300,"position":1.75,"parentId":"4eda2f0dde220cd2c3000118","content":"* stacked memory available and used - timesliced avg memory used vs total\n\n```\n | timechart span=15m \n avg(eval(memory_available/1024/1024)) as avg_available, \n avg(eval((memory_total - memory_available)/1024/1024)) as avg_used\n```"},{"_id":"4ef25714de838b3c0600011f","treeId":"5483ac8e1268987d4019293f","seq":1491301,"position":1.875,"parentId":"4eda2f0dde220cd2c3000118","content":"* slow query by app, object - sorted query stats (avg,min,max) grouped by (app,object)\n\n```\n | stats \n count, \n avg(sql_time) as avg_exec_time, \n max(sql_time) as max_exec_time, \n min(sql_time) as min_exec_time \n by app, object\n | sort 20 - count, - avg_exec_time \n```"},{"_id":"4eda4411de220cd2c300011a","treeId":"5483ac8e1268987d4019293f","seq":1491307,"position":2,"parentId":"4eda2f0dde220cd2c3000118","content":"* top mbo count - top N mbo stats with sparkline and avg\n\n```\n | stats \n sparkline(avg(mbo_count)), \n avg(mbo_set_count) as set_count, \n avg(mbo_count) as count \n by mbo\n | sort 20 - set_count\n```"},{"_id":"4ef25e06de838b3c06000120","treeId":"5483ac8e1268987d4019293f","seq":1491308,"position":3,"parentId":"4eda2f0dde220cd2c3000118","content":"* users and session - timesliced total user and session counts\n\n```\n | timechart span=15m \n first(users) as total_users, \n avg(ui_sessions) as per_jvm_sessions\n```"},{"_id":"4ef25e1cde838b3c06000121","treeId":"5483ac8e1268987d4019293f","seq":1491310,"position":4,"parentId":"4eda2f0dde220cd2c3000118","content":"* ui sessions' correlations - group by action first then group by session, time rolled up\n\n```\n | stats \n first(elapsed_time) as time \n first(user_id) as user \n first(app) as app \n first(session_id) as session_id \n by correlation_id\n | stats \n count(correlation_id) as count \n sum(time) as time \n first(user) as user \n first(app) as app \n by session_id\n | sort 10 - time\n```"},{"_id":"4eb38a6a0c63c51333000107","treeId":"5483ac8e1268987d4019293f","seq":1473349,"position":0.625,"parentId":"54aa489123db65801d8eaa67","content":"## Ingestion and Collectors\n\n* simple upload for quick try out\n* one for all sources\n* relay/forward\n* sending target\n* central management (ex. tagging)"},{"_id":"4eb38c470c63c51333000109","treeId":"5483ac8e1268987d4019293f","seq":1491311,"position":0.6875,"parentId":"54aa489123db65801d8eaa67","content":"## Advanced Query Functions"},{"_id":"4ebff0270c63c51333000115","treeId":"5483ac8e1268987d4019293f","seq":1464776,"position":0.25,"parentId":"4eb38c470c63c51333000109","content":"### sessionize / transaction\n\n![](https://service.sumologic.com/help/Resources/Images/Sessionize_layout_574x155.png)\n\n```\n* | transaction host cookie\n\nevent=1 host=a\nevent=2 host=a cookie=b\nevent=3 cookie=b\n```\n\nGroup (cross-source) events together (transitive or not) for further aggregation."},{"_id":"4ec062250c63c51333000117","treeId":"5483ac8e1268987d4019293f","seq":1491322,"position":0.375,"parentId":"4eb38c470c63c51333000109","content":"Maximo logs:\n\n```\nSessionId -> ActionId/time -> SqlId/time\n```\n\n* `n` lowest actions taking more than `t` sec.\n* For these actions, the slowest queries.\n * Queries longer than `t` sec\n * Queries taking > `p`% time\n"},{"_id":"4ebff0bd0c63c51333000116","treeId":"5483ac8e1268987d4019293f","seq":1473270,"position":0.5,"parentId":"4eb38c470c63c51333000109","content":"### timeslice / timechart\n\n* freely time bucketed aggregation\n * daily, hourly ... etc\n * by time or by buckets\n* trend analysis\n * user session count trend\n * memory usage (total/available)"},{"_id":"4ec0a4280c63c5133300011b","treeId":"5483ac8e1268987d4019293f","seq":1464781,"position":3,"parentId":"4eb38c470c63c51333000109","content":"### Result Manipulation\n\n* numeric, like bytes -> GB\n* string, like concat, substitute\n"},{"_id":"4ec08c860c63c5133300011a","treeId":"5483ac8e1268987d4019293f","seq":1464796,"position":4,"parentId":"4eb38c470c63c51333000109","content":"### Conditional\n\n* `if (<cond>, <true>, <false>) as <field>`\n* apply classification on-the-fly\n * ex. classify response time range\n* derived filed\n * ex. Browser/OS info out of agent field\n"},{"_id":"4ea908359524482ffa000101","treeId":"5483ac8e1268987d4019293f","seq":1450545,"position":0.75,"parentId":"54aa489123db65801d8eaa67","content":"## End"},{"_id":"4d8e3088b194a8ecc1000065","treeId":"5483ac8e1268987d4019293f","seq":1431559,"position":2,"parentId":null,"content":"# My Thought"},{"_id":"4e7f04eac4b24131b70000ed","treeId":"5483ac8e1268987d4019293f","seq":1432784,"position":0.5,"parentId":"4d8e3088b194a8ecc1000065","content":"## Log Management System in General\n\nI think \"*log*\" should be a generalized term, including at least:\n\n1. states: cpu, memory, disk, network, performance metrics ... etc.\n2. configs: system tunable, application settings ... etc.\n3. logs: system, application logs ... etc.\n4. events: patch applied, application new version deployed, external referral or promotion/campaign (causing traffic surge), security breach ... etc.\n\nTogether, these represent the \"status\" of the target systems so that users can understand whether their systems are healthy or not, and if not, where the problem is and how to resolve it.\n\nFrom the system's perspective, there's really no difference among these. They are all just \"text\" data, needed to be parsed/stored/indexed/aggregated and then fetched/searched/charted/analyzed."},{"_id":"4e824755c4b24131b70000ee","treeId":"5483ac8e1268987d4019293f","seq":1432790,"position":1,"parentId":"4e7f04eac4b24131b70000ed","content":"**User expectation**\n\nThere are 3 main goals:\n\n1. Dashboard: it shows what users care most.\n2. Alerts: it lets users know when and what they need to know.\n3. Diagnosis: it helps users identify the root cause of alerts and problems.\n\nI want to dig into 2 and 3 first. They are more likely the areas a system can differentiate from others."},{"_id":"4e8247c2c4b24131b70000ef","treeId":"5483ac8e1268987d4019293f","seq":1432882,"position":2,"parentId":"4e7f04eac4b24131b70000ed","content":"Conceptually, there are 2 steps in using such a system to resolve problems: 1) identify the problems, 2) resolve the problems.\n\nOf course, users can manually search for problems (based on his experience). But a superior system would \"magically\" identify potential problems to users via alerts. That's the selling point. We'll talk about that point later. Lets assume a potential problem has been identified, and we have the exact timestamp of when and what the problem is.\n\n* Pull out all other kinds of data at around the same time. Everything should be plotted on a time series chart.\n * Automatically Identify and highlight those with abnormal pattern, be it spike, valley, or any level out of normal range.\n * Allow easy in/exclusion of data for further analysis.\n * Allow switching among data series easily, as well as navigating a single data series back and forth in time.\n * Allow free trend overlapping.\n* Allow annotation on data.\n* A secondary alert could also be useful during this process, not necessarily the normal alerts for identifying anomaly. For example, database rows read per transaction over some value might not worth sending out notification, but it could be a hint in this manual analysis and worth pointing out.\n* Once identified, the correlation should be learned. "},{"_id":"4e826995c4b24131b70000f0","treeId":"5483ac8e1268987d4019293f","seq":1437172,"position":3,"parentId":"4e7f04eac4b24131b70000ed","content":"Now back to the abnormal detection. In addition to user alerts, there are also automatic detection:\n\n* [Anomaly detection](https://github.com/twitter/BreakoutDetection), basically detect any change of stable level. cpu level change, web page traffic sudden surge, no. of Java exception increases, etc.\n* dynamic baseline comparison\n* log pattern change. new/unknown log entries appear, log frequency change, etc.\n\n"},{"_id":"4e826b49c4b24131b70000f1","treeId":"5483ac8e1268987d4019293f","seq":1448386,"position":4,"parentId":"4e7f04eac4b24131b70000ed","content":"Then there's situation users would like to manually search to identify problems. We could do something to assist in this process:\n\n* faceting, not just normal count-based facet, but also:\n * multiple selection\n * multiple level (pivot)\n * numeric range facets for special field (cpu for example)\n* auto identify new fields, allow option to \"promote\" to add new fields\n * facet on these \"virtual\" fields\n* time line also shows log frequncy trend line\n * while typically the main time series chart shows the volume of event counts over time. It would be nice to allow \"dragging\" a facet onto the chart to draw faceting over time. need back-end support for sure. I doubt how the default chart was queried, should be what we need.\n * different line for different log level\n * different line for different log pattern, could categorize log entry (exception, system state, etc)\n* similarly, the 2nd ruleset can be used here. think of it as \"best practice\" suggestion.\n"},{"_id":"4e63ce89daab58e280000111","treeId":"5483ac8e1268987d4019293f","seq":1419550,"position":1,"parentId":"4d8e3088b194a8ecc1000065","content":"## Ingestion and Collectors\n\nIn general, there are two different approaches:\n\n1. Centrally managed collectors: Splunk and SumoLogic are examples.\n2. Decentralized collectors: Loggly, Papertrail, Flume, Fluentd, Logstash, etc.\n\nCentrally managed collectors are usually proprietary, like SumoLogic's (MII as well). SumoLogic's central collector (source) management is able to change remote collectors' config: target file(s) location, filters, tags, etc. The up side is, well, centralized management. You manage all collectors in one place (on UI). Don't get it wrong, you still need to install collector on each host, just the configuration is done on the server.\n\nThe 2nd approach configures individual collector separately on the host it is installed. These vendors usually don't implement their own collector but instead relying on commonly seen collectors like rsyslog/Flume/Fluentd/Logstash. All they need to do is support one common endpoint (syslog is the most common protocol) or create their own plugin for those collectors.\n\nFlume/Fluentd/Logstash all have more features (strong plugin eco-system) than SumoLogic's. Though the convenience of managing sets of collectors centrally is really an advantage. I'd argue that the ideal design is to be able to use those common collectors while still able to centrally manage them (to some extent). I don't think these two approaches are mutually exclusive."},{"_id":"4e65db92daab58e28000011b","treeId":"5483ac8e1268987d4019293f","seq":1419653,"position":0.5,"parentId":"4e63ce89daab58e280000111","content":"**Arguments against Custom Collector and Central Management**\n\nEven though central management has its up side, there might be some concern:\n\n1. Many users probably already have some collector installed and running (or are familiar with). It's better not asking them to install a new kind. Instead, we support those directly.\n2. Many enterprise users have automated configuration management system like Puppet/Chef/Ansible. For these users, stick to those is more reasonable, both for security concern (let some components in their environment managed by outsider) and familiarity (they know and are familiar with how to management their environment with tools of their choice).\n\nI think these are more likely the concerns for large shop."},{"_id":"4e64e3b2daab58e280000118","treeId":"5483ac8e1268987d4019293f","seq":1419736,"position":1,"parentId":"4e63ce89daab58e280000111","content":"### Design\n\n* For quick self starter, provide on-page one-time uploading.\n* On server receiving end, support syslog (this covers rsyslog/Flume/Fluentd/Logstash, among others) and http/s (this covers app client side sending)\n * Provide config generation support on UI for major collectors. Obviously, documentation need to be great. Whenever possible, one-line shell script config should be provided. When not feasible (like firewalled), allow simple config file download.\n * Provide a leanest and pre-packaged collector (just for ex. say Fluentd) for users who don't already have/use any collector.\n* Provide SDK for direct sending from application, at least for Java, Javascript (Python, Ruby, and Go would be next, generally the more languages the better).\n* Parsing/processing does not occur on collector side, but on server side.\n * It's better done after Kafka so that it's possible to re-processing without resending to correct human error.\n * Auto detection of source data, whenever possible.\n * Users should be able to tag source(s) or source groups, manageable from UI. Tags could act as hints for (easier or more efficient) later processing. \n * This layer also supports filtering on the server side (also manageable from UI).\n\nBTW, *also need to consider receiving metrics protocol used by most commonly used time-series metrics system.* Ex. Graphite/StatsD. We want more than logs."},{"_id":"4e66ec6ddaab58e28000011d","treeId":"5483ac8e1268987d4019293f","seq":1431140,"position":2,"parentId":"4e63ce89daab58e280000111","content":"### Further Thought\n\nThis is not an elegant design, too \"hacky\". There are too many \"state\" to maintain, inconsistent between client and server side. It might be worthwhile to just enhance an open source collector to allow server controlled configuration. Users can simply add our plugin (trivial thing to Logstash and Fluentd). The critical piece is to see if existent plugin framework allows configuration reloading."},{"_id":"4e63cf9cdaab58e280000112","treeId":"5483ac8e1268987d4019293f","seq":1420171,"position":2,"parentId":"4d8e3088b194a8ecc1000065","content":"## Query Language\n\nThere are generally 2 types of log query implementation: Lucene based and Splunk/SumoLogic style extension.\n\nThe difference is Lucene based query only has, well, Lucene search functions. That basically means you can only search based on fields present in the Lucene index. Generally there's no aggregation or analytics support (in the query). This category includes Solr, ElasticSearch, Loggly (among others) ... You do have limited aggregation and also faceting, which are supported by all 3 products mentioned above, but not directly using the query. Mostly aggregation and faceting are through UI operation. You search something first then aggregating on the resultset, or faceting on the side which indirectly suggest possible filters to use next. This category generally requires UI operation in order to do meaningful aggregation or analytics. There's a feel of \"multi-step\" in order to complete a task.\n\nSplunk and Sumo Logic are in the second category. The \"query\" language supports aggregation and much more directly. You can still use UI to do things but in the end they got translated to a \"query\". Also there's \"pipe\" so you get to apply one after another operation which is very convenient and powerful. Another strong point is you can create new, \"virtual\" fields on the fly. For example, browser agent field in Apache access logs contains multiple information. With this you can \"extract\" the OS part only without actually having an OS field in the index (which would need schema change and re-index from Lucene's perspective).\n\nFrom the indexing perspective, the 1st feels more static while the 2nd more dynamic. Using the 1st, you have to re-index and re-processing if you need a new field. Using the 2nd, just add/change the query. But of course, the 2nd must suffer from some runtime overhead, which is the trade off. We could also argue that the 2nd require the more advanced knowledge of the query language.\n\nI do not believe these are mutually exclusive. It would be great to combine the good of both. In fact, Sumo Logic does mention it and plan to add finer control on specifying static field.\n"},{"_id":"4e674122daab58e28000011f","treeId":"5483ac8e1268987d4019293f","seq":1420893,"position":0.5,"parentId":"4e63cf9cdaab58e280000112","content":"**Limitation of Lucene based query language**\n\nSimply looking for something is not a problem. The problem is with correlating different records in the result set (that is, aggregation and analytics).\n\nTake the experiment scenarios on Apache access log. Loggly (mainly Lucene search only) can only \"partially\" achieve 6 out of 9 scenarios mostly because it can only aggregate and sort on one dimension. All 6 achievable scenarios by Loggly are something like \"top xxx sorted by yyy...\". That translates to search by \"xxx\" then order by \"yyy\" (order by is through the UI, not search language). Even with 6, the result cannot include fields other than \"xxx\" (so it's only for charting purpose, like pie chart). The needed UI operation is not yet available on Loggly.\n\nThe other 3 scenarios are all using so called \"unique visitors\", which is defined by the combination of 3 fields: date, ip, and agent. Here, Lucene/Solr does not support virtual or derived field hence unless we change the schema to add 3 additional fields and re-index we cannot search on \"unique visitors\".\n\nSimilarly, 2 of the 3 scenarios also use \"partial\" field info from field \"agent\" which contains both OS and browser info. If we want to aggregate on OS, we cannot unless we add a new field to contain only the OS part of the original \"agent\" field.\n\nI think since this is not provided by Lucene search engine, it should be the upper layer's responsibility, either Solr or the application. Splunk/SumoLogic query language is one way to support that (on top of search result set), while it is quite possible to achieve the same thing from UI operation. It is just that Loggly has not yet reached that level.\n\nI would say using query language is much more flexible and elegant from implementation perspective. Combined with \"pipe\" or stream concept, it is very powerful."},{"_id":"4e673bdddaab58e28000011e","treeId":"5483ac8e1268987d4019293f","seq":1430689,"position":1,"parentId":"4e63cf9cdaab58e280000112","content":"### Design\n\n* Use Lucene/Solr search language as the base.\n* On top of that, provide aggregation and pipe.\n * Everything can be expressed with a query. Charts, tables, etc.\n * Map the query to Storm topology. The very first part search result would feed the input to the spout. From there, we have all possibility.\n * This also serves alert."},{"_id":"4e6ed50d276263fc430000e5","treeId":"5483ac8e1268987d4019293f","seq":1430550,"position":2.25,"parentId":"4d8e3088b194a8ecc1000065","content":"## Alerts\n\nAll products I've tried using saved search for creating alerts. When creating an alert, you specify the criteria of matching records with a query as a saved search, as well as how often this should be checked. The highest frequency of running alert searches is 1 minutes (of all products I've tried), probably out of performance consideration. Usually there's an additional control on how many matches within a specific time window would be treated as an alert. For example, only more than 5 matches (of the alert's saved search query) in the past 60 minutes would trigger an alert.\n\nSumoLogic does support real time alert based on saved search (\"*human real time*\" as it is explained, in the range of 8-12 seconds), though in real time mode it can only check up to the latest 15 minutes time window. Also some search language operator is not allowed in real time mode.\n\nUsing saved search for alert provides the most flexibility, as whatever you can do with search is available to alerting. From implementation perspective, you don't need to create a separate logic for alert evaluation which is nice. However, performance is a problem if the alert checking runs too often."},{"_id":"4e7c3a38c4b24131b70000ea","treeId":"5483ac8e1268987d4019293f","seq":1430534,"position":0.5,"parentId":"4e6ed50d276263fc430000e5","content":"**QRadar**\n\nFrom my understanding, QRadar does it in the reverse way. Instead of running alerts periodically, it runs all alert checking on each event received. Since events must must go through CRE before persisted into Ariel, there must be some latency added to ingestion process.\n\nOn the other hand, running saved search does not impact ingestion. The only contention is if the query (read) does impact the write to the underlying data store, and how much.\n\nIf using search based alerts, we can't afford to do it in QRadar's way, that is, running all alert queries on every single event received (essentially, QRadar is not using search to implement alert).\n\nHowever, QRadar's alerting does allow a few interesting and powerful concepts, which cannot be covered by search based approach (at least not directly):\n\n1. Rule can depend on other rule, that is, recursive.\n * when an event matches **any|all** of **the following rules**\n * when **these rules** match at least **this many** times in **this many minutes** after **these rules** match\n"},{"_id":"4e7c3522c4b24131b70000e9","treeId":"5483ac8e1268987d4019293f","seq":1431167,"position":1,"parentId":"4e6ed50d276263fc430000e5","content":"### Design\n\nSearch based is the way to go:\n\n* It should be more performant (resource-wise) than all-rule-per-event evaluation.\n* Search is implemented by Storm topology, which also is a perfect fit for alerts. Stateful stream processing can make running total or count based rule evaluation.\n* Recursive alerts could be provided by Storm topology as well, with branching (possible to another topology).\n* Periodic checking is supported by bolt system events. Real time checking is, well, native Storm.\n\nIn addition to user defined alerts, there is another category of anomaly detection alerts. Though users should be able to turn on/off these alerts, even possibly tweak the algorithm used for their needs. This category of alerts should by default turned on.\n\n"},{"_id":"4e729882c4b24131b70000e6","treeId":"5483ac8e1268987d4019293f","seq":1431243,"position":2.375,"parentId":"4d8e3088b194a8ecc1000065","content":"## Pipeline\n\nOne question needs to be answered is:\n\n> How to dynamically add fields into index? Assuming we are going to use Lucene based solution (Solr or QRadar).\n\nDynamically change and reload schema definition does seem feasible (without restart of course). This needs further investigation depends on what we use.\n\nThe other one, re-indexing, is easily done with Kafka. The ultimate source of truth should be on HDFS, which means we'll need to first \"normalize\" input data from Kafka and persist into HDFS, then pour back to Kafka in another topic. When replaying, we simply read from HDFS and re-pouring back to Kafka. This could be done in Storm, with separate bolts for Kafka and HDFS persistence."},{"_id":"4e73377cc4b24131b70000e7","treeId":"5483ac8e1268987d4019293f","seq":1429520,"position":1,"parentId":"4e729882c4b24131b70000e6","content":"### Kafka\n\nThe first and the entry point of our pipeline should be Kafka. For a few reasons:\n\n1. Decouple the ingestion and the processing.\n * This provides a desired property for the system that no matter what happens to the pipeline processing, the collection never got interrupted. Users sending never gets timed out because of the pipeline is full.\n * One of the requirement of a log/event management system is that no log/event is lost, otherwise we might loose critical, even just tiny, piece of data. Kafka persists to disk to it provides the necessary guarantee.\n * We can easily \"replay\" incoming data stream when needed. This simplifies the system in many way, for example when there's a need of re-indexing (for Lucene) because of new fields added. Another example would be to correct human error in the pipeline.\n2. Decouple internal systems.\n * All systems, be it APM, SCALA, QRadar, Analytics, UI, or even \"Storage\", communicate through Kafka. No unnecessary inter dependency among systems. For example, UI could subscribe to \"alert\" queues while all systems publish to those queues.\n * It also servers as a kind of buffer/queue so a system does not get blocked by others.\n3. Serve as the single source of the truth\n * All systems get the same data input stream, in the same order. There's no synchronization needed between the systems receiving the input.\n * Hadoop and Storm could both subscribe to the same queues for receiving input data.\n\nNote that Kafka is not actually the entry point of the whole system as it currently does not have security. There is another light layer in front of Kafka responsible for receiving from collectors and publishing to Kafka queues. This light layer could simply be Logstash or Fluentd behind DNS and/or load balancer."},{"_id":"4e7df5b2c4b24131b70000eb","treeId":"5483ac8e1268987d4019293f","seq":1431406,"position":1.5,"parentId":"4e729882c4b24131b70000e6","content":"### HDFS\n\nThis is where the source of truth stored. It is immutable and normalized. Each atomic piece of data should:\n\n* be as raw as possible, to allow maximum possibility\n* has an uid, for desired idempotent operation\n* timestamped\n\n**Enforceable and Evolvable Schema**\n\n[Thrift](http://thrift.apache.org) or [Protocol Buffer](https://developers.google.com/protocol-buffers/)"},{"_id":"4e73390ac4b24131b70000e8","treeId":"5483ac8e1268987d4019293f","seq":1431161,"position":2,"parentId":"4e729882c4b24131b70000e6","content":"### Storm"},{"_id":"4e63d1addaab58e280000114","treeId":"5483ac8e1268987d4019293f","seq":1448379,"position":2.75,"parentId":"4d8e3088b194a8ecc1000065","content":"## Dashboard\n\n* First, [Banana](https://github.com/LucidWorks/banana) is really slow. If Kibana is the same, I'm not going with it.\n* Should support Kibana json layout, or at least allow conversion.\n\n"},{"_id":"4e7ec733c4b24131b70000ec","treeId":"5483ac8e1268987d4019293f","seq":1431423,"position":2.875,"parentId":"4d8e3088b194a8ecc1000065","content":"## Security"},{"_id":"4e63d129daab58e280000113","treeId":"5483ac8e1268987d4019293f","seq":1417677,"position":3,"parentId":"4d8e3088b194a8ecc1000065","content":"## When you have diversity of event sources, how do you search?\n\nObviously, we search for a problem or root cause of a problem. But upon a \"hit\", here should be a way to pull in all related info. One possible dimension is time. Another might be call trace."},{"_id":"4d8e3280b194a8ecc1000066","treeId":"5483ac8e1268987d4019293f","seq":1515280,"position":3,"parentId":null,"content":"# Architecture\n"},{"_id":"4f20955d5f664ac137000121","treeId":"5483ac8e1268987d4019293f","seq":1515279,"position":0.265625,"parentId":"4d8e3280b194a8ecc1000066","content":"![](https://www.lucidchart.com/publicSegments/view/54b657ef-6ddc-492a-9cad-03760a00de91/image.png)"},{"_id":"4d8da07fb194a8ecc1000063","treeId":"5483ac8e1268987d4019293f","seq":1426377,"position":0.53125,"parentId":"4d8e3280b194a8ecc1000066","content":"## QRadar"},{"_id":"4e49c5c43c570178d40000ce","treeId":"5483ac8e1268987d4019293f","seq":1498837,"position":0.5,"parentId":"4d8da07fb194a8ecc1000063","content":"### My Questions\n\n* pipeline fault tolerant? Kafka?\n* load balance among nodes\n* plug new computation into the pipeline"},{"_id":"4e4a146e3c570178d40000d0","treeId":"5483ac8e1268987d4019293f","seq":1463460,"position":1,"parentId":"4e49c5c43c570178d40000ce","content":"Looking at the architecture, QRadar is composed of one Console and multiple managed hosts (in a distributed way). It is not clear (to me) how the management of this distributed system is done.\n\n* How to add a new managed host (or remove one)? Is it dynamic? ***There's actually one more kind, data node, which is per managed host scaling extension. You can freely add managed and data nodes. Data stored in a managed node stays there forever, as there's no automatic balancing among managed hosts. However, data nodes of a managed host do have rebalancing. Data nodes essentially shard data as well as balance the search workload.***\n* How does the load balance among managed hosts? Is it static for a specific data source? ***Currently there is no balancing among managed hosts.***\n * somewhere I read load balancer is supported, \"*syslog based data sources (or anything \"pushed\" to qradar), we now process those log sources on ANY event processor*\")\n * does that mean it doesn't matter for an event to be stored on which node as the query is distributed to all nodes? or is there's some mechanism to 'forward' an event to the right place?\n* How does Tomcat/UI scale? Are multiple Console nodes allowed? ***Single UI currently, multiple not allowed.***\n* For Ariel, I'll assume Ariel is not distributed database. So no sharding etc. In that case, data collected at a host is stored at the same place and queried by the same host, correct? ***Correct, static.***\n* Multi-tenant support?"},{"_id":"4e4a14d93c570178d40000d1","treeId":"5483ac8e1268987d4019293f","seq":1399474,"position":2,"parentId":"4e49c5c43c570178d40000ce","content":"Source input uses in-memory queues for buffering. Understand events are dropped once queues are full (though it is not really desirable). What about fault-tolerant? If a host is down, are all events are lost?\n\nWhat is the typical (or expected) \"turn-around\" time after an event is sent to Qradar and it is searchable on the UI? "},{"_id":"4e4a6b983c570178d40000d3","treeId":"5483ac8e1268987d4019293f","seq":1399486,"position":2.5,"parentId":"4e49c5c43c570178d40000ce","content":"Ariel schema seems to be static (normalized fields as they are called). If users want to parse their own log source, they need to create Log Source Extension (LSX). It seems only those normalized fields are usable in AQL.\n\n* Schema-less support? Is adding/changing fields allowed? If not, say in my log I have something like \"... elapsed_time=103 ...\", would it be possible to query by \"elapsed_time\" (as a field)?\n* Dynamic (or virtual) fields?"},{"_id":"4e4aade13c570178d40000d5","treeId":"5483ac8e1268987d4019293f","seq":1463461,"position":2.625,"parentId":"4e49c5c43c570178d40000ce","content":"How is free-text search's Lucene index building fit into the pipeline?\n\n***It is only used for full text search, along with the Ariel query. Full text search filter first and then feed to Ariel search.***\n\n* Does it run in parallel to the pipeline? on all nodes?\n * Is free-text search distributed?\n* Exactly which content/field is being indexed?\n * Does it allow specifying custom field (in Lucene index)? or is it only full-text search on the single entity \"*payload*\"?\n* Is it still based on Lucene 3.x?\n * No faceting (even simple faceting) is a major problem."},{"_id":"4e4a777f3c570178d40000d4","treeId":"5483ac8e1268987d4019293f","seq":1399509,"position":2.75,"parentId":"4e49c5c43c570178d40000ce","content":"The pipeline ...\n\n* Is the whole pipeline synchronous or asynchronous?\n * seems asynchronous as there are queues in some components (parser, CRE, forward, etc ...)\n * then what is the internal queues used by the pipeline? scalability and fault-tolerance characteristics? (seems in-memory queues)\n * How to scale the pipeline?"},{"_id":"4e4aaf1a3c570178d40000d6","treeId":"5483ac8e1268987d4019293f","seq":1463462,"position":4,"parentId":"4e49c5c43c570178d40000ce","content":"Rules are defined with pre-defined tests, specifically for security domain. Some example pre-defined test:\n\n* *when the local network is **one of the following networks***\n* *when the **destination** network is **one of the following networks***\n* *when an event matches **any|all** of the following **rules***\n * dependency/relationship among rules\n* *when the event(s) have not been detected by one or more of **these log sources** for **this many** seconds*\n * timed window\n* *when **these rules** match at least **this many** times in **this many minutes** after **these rules** match*\n * counter combined with timed window\n\nSo ...\n\n* How to add new test? ***You cannot.***\n* Where is event streaming used for? for alerts?"},{"_id":"4d9ee778baecc671e6000080","treeId":"5483ac8e1268987d4019293f","seq":1426424,"position":1,"parentId":"4d8da07fb194a8ecc1000063","content":"### [High Level Architecture](https://q1wiki.canlab.ibm.com/display/apollo701/High+Level+QRadar+Architecture#)\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/BxSiiPNdu9aXIZ7/QRadar%20Architecture.png)"},{"_id":"4e2ffc90de41460aab0000b8","treeId":"5483ac8e1268987d4019293f","seq":1387580,"position":0.25,"parentId":"4d9ee778baecc671e6000080","content":"#### Event Correlation Service (ECS) - even/flow pipeline\n * Event Collector\n * Receive events (Protocol sources)\n * Parse and normalize events (DSMs)\n * Coalesce events\n * Optionally forward events\n * Log Source Traffic Analysis & Autodetection\n * Event Processor\n * Correlate events (Custom Rules Engine (CRE))\n * Store events (in our proprietary Ariel database)\n * Forward events matching rules within the CRE to our Magistrate component\n * Magistrate (MPC) (Console only)\n * Associate matching events with offenses"},{"_id":"4e300937dd8ca0c1f40000be","treeId":"5483ac8e1268987d4019293f","seq":1426426,"position":0.75,"parentId":"4e2ffc90de41460aab0000b8","content":"##### Event Pipeline\n\nReceiving, processing (normalize & coalesce), running it against Rules and then storing the data to disk,\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/tvuFcF3YSRe13yz/event%20pipeline.png)"},{"_id":"4da166e9baecc671e6000083","treeId":"5483ac8e1268987d4019293f","seq":1387806,"position":0.875,"parentId":"4e2ffc90de41460aab0000b8","content":"##### Event Collector\n\nThere are two separate paths for events and flows within the event collector. It is responsible for normalizing and parsing the received events.\n\nThere are multiple in-memory \"input\" queues (for different types of souce) that house incoming events. queues are based on \"event counts\" rather than memory usage, and as such, are limited to these set numbers to ensure we don't consume all available memory. Once those queues fill, if more new events come in, they are dropped."},{"_id":"4e30fc3d80b34943060000cb","treeId":"5483ac8e1268987d4019293f","seq":1426427,"position":0.5,"parentId":"4da166e9baecc671e6000083","content":"![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/BSFHMV0L2rR6KHX/EventSideOfCollector.JPG)"},{"_id":"4d9f64e1baecc671e6000082","treeId":"5483ac8e1268987d4019293f","seq":1387634,"position":1,"parentId":"4da166e9baecc671e6000083","content":"###### Events Source\n\n* Push: syslog (tcp/udp), SNMP\n* Pull: files (sftp/ftp/scp), jdbc, WMI (windows domain controller service), https for applications such as cisco ips, and checkpoint where qradar uses a custom application (leapipe2syslog) to connect, query and send the data back into qradar"},{"_id":"4e30228980b34943060000c2","treeId":"5483ac8e1268987d4019293f","seq":1387635,"position":3,"parentId":"4da166e9baecc671e6000083","content":"###### Normalization\n\nExtracting properties: event id, source ip, source port, dest ip, dest port, protocol, username, pre-nat ip/port, post-nat ip/port, ipv6 source/dest ip, source/dest mac, machine name."},{"_id":"4e301b3edd8ca0c1f40000c0","treeId":"5483ac8e1268987d4019293f","seq":1387648,"position":4,"parentId":"4da166e9baecc671e6000083","content":"###### Coalesce\n\nIf events repeat, then the events coalesce based on a multipart key, which uses source ip, dest ip, dest port, username & event id. Once 4 events come in that match the same key, the 4th event is coalesced with any additional events that match that same key, for the next 10 seconds. Once 10 seconds goes by, the counter is reset and coalescing for that \"key\" restarts."},{"_id":"4e30256c80b34943060000c3","treeId":"5483ac8e1268987d4019293f","seq":1387645,"position":5,"parentId":"4da166e9baecc671e6000083","content":"###### Traffic Analysis\n\nautodetect log sources"},{"_id":"4e30266680b34943060000c4","treeId":"5483ac8e1268987d4019293f","seq":1387809,"position":6,"parentId":"4da166e9baecc671e6000083","content":"###### Offsite Target\n\nForward processed, parsed events to another QRadar deployment. This is typically used in \"Disaster Recovery\" (DR) deployments, where customers want to have a second console/installation that has a backup copy of production data."},{"_id":"4e30a11b80b34943060000c7","treeId":"5483ac8e1268987d4019293f","seq":1387829,"position":7,"parentId":"4da166e9baecc671e6000083","content":"###### Selective Forwarding\n\nconfigured in the Admin tab under the item \"Routing Rules\". Options in the routing rules allow a user to filter/drop/modify events that are going through the pipeline. \n\nThis happens AFTER licensing, occurring between the EC and EP processes, so events that say, are marked to be dropped, STILL count against your EPS license.\n\nThe option in the 'routing options\" include:\n- forward: to another destination than the EP\n- drop: completely drop the event from the pipeline, so that it is NOT stored to disk, forwarded, etc\n- bypass correlation: this puts a tag on the event itself (and is written to disk w/ the ariel record) to NOT process the event in the CRE. Not even the false positive rule would match the event.\n"},{"_id":"4da17767baecc671e6000084","treeId":"5483ac8e1268987d4019293f","seq":1387686,"position":1,"parentId":"4e2ffc90de41460aab0000b8","content":"##### Event Processor\n\nThe Event Processor houses the Custom Rule Engine (CRE) and the storage filters for Flows and Events.\n"},{"_id":"4e4947973c570178d40000cb","treeId":"5483ac8e1268987d4019293f","seq":1399017,"position":0.5,"parentId":"4da17767baecc671e6000084","content":"###### Sources\n\n* Local EC\n* Local CRE (events created by rules)\n* External EC\n* External EP\n * Processing by Magistrate to create or contribute to an offense; or\n * Centralized processing on the CRE because the event has been deferred by the Managed Host; or\n * Both of the above."},{"_id":"4e2ffc4dde41460aab0000b7","treeId":"5483ac8e1268987d4019293f","seq":1399529,"position":1,"parentId":"4da17767baecc671e6000084","content":"###### Custom Rule Engine\n\nThe CRE runs on each event processor and monitors data as it comes through the pipeline. There are multiple types of rules, which check for either properties of single events, and functions which can track multiple events over a period of time or other \"counter\" based parameter. \n\nThe CRE is an asynchronous filter which performs:the processing of events and flows on all the rules currently active in the system. Some of the activities include:\n\n* Marking events with full or partial matches on tested rules; and\n* Identifying events with tags for Magistrate or Central Processing; and\n* Delegating responses to rules (SNMP, Reference Set, Notications ... etc); and\n* Emitting new CRE events when specified by a given rule."},{"_id":"4e4ae3823c570178d40000d7","treeId":"5483ac8e1268987d4019293f","seq":1399535,"position":0.25,"parentId":"4e2ffc4dde41460aab0000b7","content":"This is the heart of our correlation abilities. The UI component allows a rich set of real-language based tests such as \"and when the Source IP is one of ....\" or \"and when the payload matches the following regular expression\". Customers configure these rules to target specific incidents that they may be interested in. The output of the CRE is typically an offense, an e-mail, a system notification, an entry into a reference set, to forward the event on somewhere else, etc..."},{"_id":"4e30c9a280b34943060000ca","treeId":"5483ac8e1268987d4019293f","seq":1387887,"position":0.5,"parentId":"4e2ffc4dde41460aab0000b7","content":"Rules are defined with a test preset, specifically for security domain. Rules have multiple variables for users to fill. A rule can have multiple tests, with or/and relationship.\n\nOne kind of test could be when a specific set of \"rules\" matched ...\n\nRule action can be:\n* change event severity, etc\n* annotate event\n* drop event\n\nRule response can be:\n* create a new event\n* email\n* syslog\n* forward\n* notify\n* trigger scan\n\nThere could also be limiter on no more than N per X (time) per unit (rule/source/etc)."},{"_id":"4e30b14d80b34943060000c8","treeId":"5483ac8e1268987d4019293f","seq":1399065,"position":1,"parentId":"4e2ffc4dde41460aab0000b7","content":"**Global Rule**\n\nRules that have counters or functions and are marked as \"test globally\", if they have partial matches, their \"partial matches\" are sent up to the EP on the console for counting.\n\nFor example, if your rule says look for windows events and 5 login failures from the same ip and same username, and it's marked test globally, each login failure for windows events is sent to the EP on the console from the EP that sees them. If there were 3 failures on one EP, then 2 on another, the central EP will get notification of all 5 and fire the rule.\n\nWhen you have a rule that's marked to test globally, it only really triggers if it's a rule with a function. A rule that only tests a single event, even for multiple properties, is automatically NOT tested globally to avoid unneccessary load on the event pipeline."},{"_id":"4e497fb63c570178d40000cc","treeId":"5483ac8e1268987d4019293f","seq":1399068,"position":2,"parentId":"4e2ffc4dde41460aab0000b7","content":"For example if you had a rule looking for an event with user = x and source ip = y and event type is \"login failed\", that will automatically NOT run a global test, because it's only testing 3 properties of a single event. However, if you have a rule marked test globally, and you have a function looking for a count of events over a period of time, that's called a partial match. When you have partial matches across multiple event processors, this is why we need to have the concept of \"global rules\". \n\nWhat happens then, is that event processor with the partial match, wll send that event 'reference\" to the central event processor on the console. Then, if another EP in another location also finds an event that matches, it also sends it to the console for global testing, and we can then do correlation across multiple event processors."},{"_id":"4e4981dd3c570178d40000cd","treeId":"5483ac8e1268987d4019293f","seq":1399069,"position":3,"parentId":"4e2ffc4dde41460aab0000b7","content":"Global tests are turned -off- by default, currently, since the load on the central EP can be substantial, if alot of rules are tested globally.\n\nWe must store duplicates in GCC mode for magistrate to be able to properly handle pending offenses and partial matches of rules along the way. Essentially what happens is:\n\n1. Local CRE on an EP runs through the list of tests. If the event/flow matches a rule tagged as global, that event/flow gets sent to the console (central CRE) for processing.\n2. Local CRE on the EP continues to run through the rest of the tests and stores the results locally, since it was the EP that received the event.\n3. Central CRE on the console receives the event and runs it through the list of global rules and tags the event with any matches it finds. This gets stored with the duplicate flag. The custom rules matched/partially matched will only be the ones from the global rules.\n\nSearches WILL show both events because they are both there. \n\nMagistrate can now create offenses and run pending offense queries to backfill offenses that had lead-up global events."},{"_id":"4e30295b80b34943060000c5","treeId":"5483ac8e1268987d4019293f","seq":1387754,"position":1.5,"parentId":"4da17767baecc671e6000084","content":"###### Event Streaming\n\nThe event streaming code bypasses normal ariel searching and uses a connection from the \"ecs\" service directly to tomcat."},{"_id":"4e30bbcf80b34943060000c9","treeId":"5483ac8e1268987d4019293f","seq":1399019,"position":2,"parentId":"4e2ffc90de41460aab0000b8","content":"##### Magistrate\n\nIt is responsible for creating offenses and tabulating all the information associated with those offenses."},{"_id":"4e2ee5260e149fb7930000ea","treeId":"5483ac8e1268987d4019293f","seq":1387586,"position":0.3125,"parentId":"4d9ee778baecc671e6000080","content":"#### Ariel and AQL\n\n * Time-series DB\n * Proxy (console only) is responsible for issuing distributed search (and caching)\n\nIt is specifically for events/flows (at least originally). Not sure how suitable for arbitrary data. In comes down to whether it supports dynamic schema. Say ingestion has a new field, can this field be created in Ariel schema?\n\nOf course, there are only 2 \"tables\" (events and flows) so naturally AQL does not have join.\n\nThe point of a custom query language (like Splunk/Sumo-Logic) over Lucene/Solr query language is to be able to conveniently \"stream\" operation, manipulate field(s), aggregate, etc.\n"},{"_id":"4e2f059b0e149fb7930000eb","treeId":"5483ac8e1268987d4019293f","seq":1387755,"position":1,"parentId":"4e2ee5260e149fb7930000ea","content":"##### Internals\n\n* https://q1wiki.canlab.ibm.com/display/core/Ariel+Query+Server+internals\n\nQuery server is in Java, queuing system, tasks/subtasks based.\n\n* https://q1wiki.canlab.ibm.com/display/core/Ariel+Proxy+Server+and+aggregations\n\n* https://q1wiki.canlab.ibm.com/display/core/Storage\n\nAppend-only, immutable data files, with time interval based."},{"_id":"4e2ff853de41460aab0000b6","treeId":"5483ac8e1268987d4019293f","seq":1387758,"position":0.421875,"parentId":"4d9ee778baecc671e6000080","content":"#### Distributed PostgreSQL\n\nMaster in console, slaves (read-only) in all managed hosts\n\n * Much of QRadar's configuration and reference data comes from a local postgres database.\n * Proprietary database replication scheme - periodically packages up changes from \"interesting\" tables on the console and serves them up to MHs that will periodically hit a webservice in tomcat to get the latest DB deltas."},{"_id":"4e304e0a80b34943060000c6","treeId":"5483ac8e1268987d4019293f","seq":1387763,"position":0.4296875,"parentId":"4d9ee778baecc671e6000080","content":"#### Hostcontext\n\nnode process manager\n\nRuns on each appliance in a deployment\n* Runs the \"ProcessManager\" component that is responsible for starting, stopping and verifying status for each component within the deployment.\n* Is responsible for the packaging (console) and the download/apply (MH) of our DB replication bundles.\n* Responsible for requesting, downloading, unpacking, and notifying other components within an appliance of updated configuration files.\n* Responsible for monitoring postgresql transactions and restarting any process that exceeds the pre-determined time limit. This portion is referred to as \"TxSentry\".\n* Responsible for disk maintenance routines for disk cleanup."},{"_id":"4e2ffda4de41460aab0000ba","treeId":"5483ac8e1268987d4019293f","seq":1387591,"position":0.4375,"parentId":"4d9ee778baecc671e6000080","content":"#### Accumulator\n\npre-populate by-minute, by-hour, and by-day counts of data for efficient graphing"},{"_id":"4e2ffdeade41460aab0000bb","treeId":"5483ac8e1268987d4019293f","seq":1387592,"position":0.46875,"parentId":"4d9ee778baecc671e6000080","content":"#### Asset Profiler\n\nowns and is responsible for persisting our asset and identity models into the DB"},{"_id":"4e2ffdffde41460aab0000bc","treeId":"5483ac8e1268987d4019293f","seq":1387593,"position":0.484375,"parentId":"4d9ee778baecc671e6000080","content":"#### Vis\n\nresponsible for communicating with third party vulnerability scanners (through our plugins), scheduling scans and processing results"},{"_id":"4e2ffe21de41460aab0000bd","treeId":"5483ac8e1268987d4019293f","seq":1387594,"position":0.4921875,"parentId":"4d9ee778baecc671e6000080","content":"#### QFlow\n\nreceiving flows from a number of sources, creating flow records, and periodically sending these records to ECS for further processing. sources could be:\n * Raw packets from a network interface\n * Raw packets from a high-speed packet capture device (DAG or Napatech)\n * JFlow (Juniper), SFlow, Packeteer, Netflow / IPFIX (Cisco)"},{"_id":"4e2ffe3cde41460aab0000be","treeId":"5483ac8e1268987d4019293f","seq":1387595,"position":0.49609375,"parentId":"4d9ee778baecc671e6000080","content":"#### Reporting Executor\n\nResponsible for the scheduling and execution of the \"report runner\". The report runner generates the reports that the executor requests."},{"_id":"4d9f3011baecc671e6000081","treeId":"5483ac8e1268987d4019293f","seq":1398964,"position":2,"parentId":"4d9ee778baecc671e6000080","content":"#### Terms\n\n* Flow: The record of network communication between two end points. A flow is based on the usual 5-tuple: source IP, source port, destination IP, destination port, protocol.\n * A flow is different from an event, in that flows (for the most part) will have a start and end time, or, a life of multiple seconds.\n* Offense: A container used to represent the flows and events that caused an alert within QRadar. Example: \nA CRE rule exists with the following tests: \n-Create an offense when there have been at least 10 login failures followed by a login success from a single source to a single destination within a 10 minute window. \nThis rule could be used to try and find people guessing (and succeeding) at passwords. In this case, an offense would get created that contained all the login failures and the login success. It would appear under the \"offenses\" tab in the UI and would appear on the dashboard. \n* Reference set: A set of reference data within QRadar that is able to be populated by the CRE and able to be consumed and tested against in the CRE. This is a case where the response to a CRE rule could be \"add the source IP to this reference set\" and another rule could be \"if the destination IP matches an IP in this reference set....\" "},{"_id":"4ebd97e10c63c5133300010c","treeId":"5483ac8e1268987d4019293f","seq":1463477,"position":1.25,"parentId":"4d8da07fb194a8ecc1000063","content":"### Scaling\n\n**Data nodes**\n\nEP could have additional data nodes which shard the data as well as spread the search workload. Adding data nodes is dynamic.\n\n**ECS**\n\nAdding ECS on the other hand is not really dynamic (distribution in that sense). Data stays to where it was processed, which means users need to manually distribute the sending among multiple ECS."},{"_id":"4e2e60ec0e149fb7930000e7","treeId":"5483ac8e1268987d4019293f","seq":1426425,"position":1.5,"parentId":"4d8da07fb194a8ecc1000063","content":"### [FTS - Free-Text Search](https://q1wiki.canlab.ibm.com/pages/viewpage.action?pageId=9634554)\n\nOn the UI, it is a 2nd search mode: \"Quick Filter\".\n\nUse Lucene library for free-form search capabilities inside raw Ariel payload data\n\n![](https://s3.amazonaws.com/uploads.hipchat.com/65647/457333/Y6gNMnlPt2QkS6t/LuLayout%282%29.png)\n"},{"_id":"4e2e87d00e149fb7930000e9","treeId":"5483ac8e1268987d4019293f","seq":1387500,"position":0.5,"parentId":"4e2e60ec0e149fb7930000e7","content":"#### Design\n\nData is from Ariel payload and feed to FTS. Essentially create one index file per time slot (default is 8h) and keeps N slots (configurable). Search can be parallel down to multiple slots (then aggregated). Indexing must be done and committed to the index file before a searcher instance can be created.\n\nHence the last, most recent slot is special in that it is using Lucene's NRT search and has limitation in that indexer and searcher must be in the same Java process (performance).\n\n**Solr utilizes update log and hence is reading from the log uncommited change. Benefit?** This real-time get might only be limited to get by unique key."},{"_id":"4e2e6b6a0e149fb7930000e8","treeId":"5483ac8e1268987d4019293f","seq":1387499,"position":1,"parentId":"4e2e60ec0e149fb7930000e7","content":"#### Concern\n\nWhy not using Solr, as it is together with Lucene in Apache.\n\n* Scaling/sharding\n* Faceting\n* Searching API\n\nSolr is pretty mature now, why build from scratch? any special feature?\n"},{"_id":"4ebe7a0f0c63c5133300010d","treeId":"5483ac8e1268987d4019293f","seq":1463794,"position":3,"parentId":"4d8da07fb194a8ecc1000063","content":"### Anomaly Detection"},{"_id":"548e38d923f24c1854c0b553","treeId":"5483ac8e1268987d4019293f","seq":1426378,"position":0.796875,"parentId":"4d8e3280b194a8ecc1000066","content":"## Solr"},{"_id":"548e38d923f24c1854c0b554","treeId":"5483ac8e1268987d4019293f","seq":1397441,"position":1,"parentId":"548e38d923f24c1854c0b553","content":"### [Solr v.s. Lucene](http://www.lucenetutorial.com/lucene-vs-solr.html)\n\nSolr adds functionality like\n\n* XML/HTTP and JSON APIs\n* Hit highlighting\n* Faceted Search and Filtering (Lucene has basic facet search already since 4.x, see [this](http://lucene.apache.org/core/4_2_0/facet/org/apache/lucene/facet/doc-files/userguide.html))\n* Geospatial Search\n* Fast Incremental Updates and Index Replication\n* Caching\n* Replication\n* Web administration interface etc\n* [Copy Fields](https://cwiki.apache.org/confluence/display/solr/Copying+Fields)\n* [Dynamic Fields](https://cwiki.apache.org/confluence/display/solr/Dynamic+Fields)\n* [Schemaless Mode](https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode)"},{"_id":"548e38d923f24c1854c0b557","treeId":"5483ac8e1268987d4019293f","seq":1387462,"position":2,"parentId":"548e38d923f24c1854c0b553","content":"### Limitation\n\n* No automatically resharding or dynamic adding replicas\n\n* Fields cannot be updated/added easily/efficiently\n* No join fields across different documents. Impossible across multiple servers"},{"_id":"4d8e00dcb194a8ecc1000064","treeId":"5483ac8e1268987d4019293f","seq":1387463,"position":3,"parentId":"548e38d923f24c1854c0b553","content":"### Facet\n\nThe most basic form of faceting shows only a breakdown of unique values within a field shared across many documents (a list of categories, for example), Solr also provides many advanced faceting capabilities. These include faceting based upon \n\n* the resulting values of a function, \n* the ranges of values, \n* or even arbitrary queries. \n\nSolr also allows for hierarchical and multidimensional faceting and multiselect faceting, which allows returning facet counts for documents that may have already been filtered out of a search result."},{"_id":"4d91d95eb7b8bb0ed60000b0","treeId":"5483ac8e1268987d4019293f","seq":1387502,"position":0.5,"parentId":"4d8e00dcb194a8ecc1000064","content":"#### Facet Types\n\n* Field faceting - specify filed(s) to facet while searching\n* Query faceting - use arbitrary queries as facet, each query result would be one facet term\n* Range faceting - kind of special case of query faceting for numeric and date fields\n* Multi-select faceting - tag/ex allows to ignore filtering on indicated faceting, allowing while filtering on a facet term, the count of other terms of that facet are still preserved. Useful for UIs allowing multi-selection (without remembering those counts by UI layer)\n* Pivot faceting - (or nested facet) facet across multiple dimensions (ex. how many 4- and 5-star restaurants exist in the top 3 cities in the top 3 states), currently on allowed on field faceting - Pivot faceting also has practical performance constraints."},{"_id":"4d91996f3da03dba17000077","treeId":"5483ac8e1268987d4019293f","seq":1387503,"position":1,"parentId":"4d8e00dcb194a8ecc1000064","content":"#### Ideas\n\n* Comparison between different filtering results\n * While experimenting filtering on different facets, it would be great to have a way to quickly compare with previous faceting result(s), if users desire.\n* Multi-selection of facets, Hierarchy facets selection"},{"_id":"4d9273d6b7b8bb0ed60000b1","treeId":"5483ac8e1268987d4019293f","seq":1387464,"position":4,"parentId":"548e38d923f24c1854c0b553","content":"### Clustering and Scalability\n"},{"_id":"4d9bd743baecc671e600007a","treeId":"5483ac8e1268987d4019293f","seq":1339706,"position":1.0625,"parentId":"4d8e3280b194a8ecc1000066","content":"## Kafka\n\nWhat is the best way of receiving into Kafka on the internet?"},{"_id":"4d9bf3c1baecc671e600007b","treeId":"5483ac8e1268987d4019293f","seq":1316103,"position":1,"parentId":"4d9bd743baecc671e600007a","content":"I found having a single source of stream data is great in that we could have multiple, many consumers: APM, Alerting, QRadar, Analytics ...\n\nBut then, each would need to be stream processing system. However, how to keep \"history\" available for the processing purpose?\n\nIf we keep the canonical/atomic data in Kafka as the source stream, each consumer then would be equal in that they process by their own. If any failure, they just replay from Kafka. There's no need to read from Hadoop from other systems for example .\n\nPlus, any derived stream would also be available if needed by some systems.\n\n**We still need a kind of snapshot/restore facility so we don't have to keep complete log in Kafka.**"},{"_id":"4d9c0a77baecc671e600007d","treeId":"5483ac8e1268987d4019293f","seq":1316108,"position":2,"parentId":"4d9bd743baecc671e600007a","content":"Storm processing output is better written back to Kafka instead of directly to data stores (say Cassandra). This decouples Strom from the underlying data store."},{"_id":"4dc812bff9b92f9c890000dd","treeId":"5483ac8e1268987d4019293f","seq":1345424,"position":3,"parentId":"4d9bd743baecc671e600007a","content":"### Tuning\n\nhttp://www.slideshare.net/Hadoop_Summit/analyzing-12-million-network-packets-per-second-in-realtime (p26)\n\n* JBOD better performance compared to RAID\n* num.io.threads: same as disk\n* num.network.threads: based on sum of consumer/producer/replication-factor\n\n**Kafka Spout**\n\n* parallelism == #partitions\n* fetchSizeBytes\n* bufferSizeBytes"},{"_id":"4db955bbf9b92f9c890000d9","treeId":"5483ac8e1268987d4019293f","seq":1332770,"position":5,"parentId":"4d8e3280b194a8ecc1000066","content":"## Storm\n"},{"_id":"4db9560cf9b92f9c890000da","treeId":"5483ac8e1268987d4019293f","seq":1332771,"position":1,"parentId":"4db955bbf9b92f9c890000d9","content":"### Trident\n\n"},{"_id":"4dc5d85cf9b92f9c890000dc","treeId":"5483ac8e1268987d4019293f","seq":1387498,"position":1,"parentId":"4db9560cf9b92f9c890000da","content":"#### VS Spark Streaming\n\nSpark only has exactly once semantics and under some cases fall back to at least once.\n\nFault tolerance require HDFS backed data source ([see](https://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics)). Putting data onto HDFS first adds some latency.\n\n* http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming\n\nSpark streaming is good for iterative batch processing, mostly machine learning. Scaling is not there yet (for Yahoo), and tuning could be difficult. Streaming latency could be at least 1s, has single point of failure. All streaming inputs are replicated in memory.\n\n* http://www.slideshare.net/ChicagoHUG/yahoo-compares-storm-and-spark"},{"_id":"4d9c5f8bbaecc671e600007e","treeId":"5483ac8e1268987d4019293f","seq":1340408,"position":2,"parentId":"4db955bbf9b92f9c890000d9","content":"How do you implement stateful stream processing in Storm?\n\n[Samza State Management](http://samza.incubator.apache.org/learn/documentation/0.7.0/container/state-management.html)"},{"_id":"4dd06210f9b92f9c890000e0","treeId":"5483ac8e1268987d4019293f","seq":1345433,"position":3,"parentId":"4db955bbf9b92f9c890000d9","content":"### Lesson Learned\n\nhttp://www.slideshare.net/Hadoop_Summit/analyzing-12-million-network-packets-per-second-in-realtime (p41)\n\n* per tuple performance overhead counts, since we got \"a lot\" of tuples. optimize.\n* optimize individual spout/bolt first, then the topology as a whole.\n* failed to properly handle errors could lead to topology failure and hence loose data (or mess it up) \n"},{"_id":"4d24d2d0353767dea6000143","treeId":"5483ac8e1268987d4019293f","seq":1387457,"position":4,"parentId":null,"content":"# Case Study"},{"_id":"4e1f0c87205c4ee7ab0000a5","treeId":"5483ac8e1268987d4019293f","seq":1431864,"position":0.5,"parentId":"4d24d2d0353767dea6000143","content":"## Scenarios Tested\n\n### Apache access log\n\nAnalytics on the simple access log from Apache. Of course, these scenarios are meant to be put on a dashboard.\n\n1. unique visitor counts per day (ip/date/agent combined)\n * possibly excluding spiders\n2. Top Requested URL sorted by hits\n3. Top Requested Static Files sorted by hits\n4. Top 404 Not Found URLs sorted by hits\n5. Top Hosts sorted by hits\n6. Top Operating Systems sorted by unique visitors\n7. Top Browsers sorted by unique visitors\n8. Top Requested Referrers sorted by hits\n9. Top HTTP Status Codes sorted by hits\n\nUnique visitor requires the ability to \"combine\" fields. OS or browsers requires virtual field derived from a field.\n"},{"_id":"4d24d334353767dea6000144","treeId":"5483ac8e1268987d4019293f","seq":1227758,"position":1,"parentId":"4d24d2d0353767dea6000143","content":"## Loggly\n\n* +750 billion events logged to date\n* Bursts of 100k+ events per second over the span of hours\n* Events vary in size from 100 bytes to several megabytes\n* End-to-end processing in 10 seconds (form ingestion to searchable on web)\n"},{"_id":"4d24d45c353767dea6000148","treeId":"5483ac8e1268987d4019293f","seq":1225363,"position":1,"parentId":"4d24d334353767dea6000144","content":"### First generation\n\n(2011)\n\n* Near real-time searchable logs\n* Thousands of users\n * 10-100k events/sec\n * Typically, when user systems are busy, they send even more logs (***Challenges***)\n * Log traffic has distinct bursts, can last for hours\n* EC2 (only)\n* Use Solr cloud\n * Tens of thousands of shards (with special 'sleep shard' logic)\n* ZeroMQ\n"},{"_id":"4d24f135353767dea6000149","treeId":"5483ac8e1268987d4019293f","seq":1387497,"position":1,"parentId":"4d24d45c353767dea6000148","content":"#### Lessons Learned\n\n* Event ingestion too tightly coupled to indexing\n * Manual re-indexing for temporary SOLR issues\n* Multiple Indexing strategies needed\n * 4 orders of magnitude difference between our high volume users and low volume users (10 eps vs. 100000+ eps)\n * Too much system overhead for low volume users\n * Difficult to support changing indexing strategies for a customer\n\nThat's how Kafka and Storm entered."},{"_id":"4d24ff1d353767dea600014a","treeId":"5483ac8e1268987d4019293f","seq":1387496,"position":2,"parentId":"4d24d45c353767dea6000148","content":"#### ElasticSearch\n\n* Add indices with REST API\n* Indices and Nodes have attributes\n * Indices automatically moved to best Nodes\n* Indices can be sharded\n* Supports bulk insertion\n* Plugins for monitoring cluster"},{"_id":"4d453f12b194a8ecc1000051","treeId":"5483ac8e1268987d4019293f","seq":1387495,"position":3,"parentId":"4d24d45c353767dea6000148","content":"#### Solr\n\n* The collection API in Solr was branch new and very primitive, while ES had native, robust battle-tested index management.\n* Both had sensible default behavior for shard allocation to cluster nodes, but the routing framework in ES was significantly more powerful and stable than the collection API.\n* Solr consistently indexed at about 18k eps, while ES started at about the same rate then gradually slowed down to around 12k eps (though ES is not yet using newer Lucene version)\n* Neither allowed for dynamic changes to the number of shards in an index after creation.\n\n"},{"_id":"4d250228353767dea600014c","treeId":"5483ac8e1268987d4019293f","seq":1225408,"position":2,"parentId":"4d24d334353767dea6000144","content":"### Second geenration\n\n(2013/09)\n\n* Always accept log data - never drag the source down because of some problems in ingestion part \n* Never drop log data - some user are willing to trade latency for not losing any data\n* True elasticity - truly and easily scale horizontally and vertically\n* AWS deployment\n* Kafka, Storm, ElasticSearch\n"},{"_id":"4d250687353767dea600014e","treeId":"5483ac8e1268987d4019293f","seq":1387494,"position":1,"parentId":"4d250228353767dea600014c","content":"#### Kafka is perfect match for real time log events\n\n* High throughput\n* Consumer tracks their own location\n* Use multiple brokers\n * Multiple brokers per region\n * Availability zone separation\n\n#### Storm\n\n* \"Pull\"\n * Provisioned for average load, not peak - because Kafka decouples Storm from ingestion point and hence you don't need to always keep up with the peak workload\n * Worker nodes can be scaled dynamically"},{"_id":"4d255067353767dea600014f","treeId":"5483ac8e1268987d4019293f","seq":1387493,"position":2,"parentId":"4d250228353767dea600014c","content":"#### Ingestion Architecture\n\n* Multiple collectors at first layer (multiple zone)\n * C++ multi-threaded\n * Boost ASIO\n * Each can handle +250k/sec\n* Collectors to Kafka brokers\n* Then Strom picks from Kafka, after processing sent to Kafka\n * Storm topology: first (multiple) classification bouts (identifying customers) and summary statistics bouts in parallel, then rate determination bout (monitoring volume limits), last (mutiple) analysis bouts and archiving bouts in parallel (to S3), finally emits to Kafka (stage 2)\n * Snapshot (stage 2) last day of Kafka events to S3\n * Also send to S3 for customer's own usage\n* ElasticSearch picks from Kafka\n * deferred problematic events to Kafka for later re-processing"},{"_id":"4d25f04e353767dea6000156","treeId":"5483ac8e1268987d4019293f","seq":1223851,"position":0.5,"parentId":"4d255067353767dea600014f","content":"What's good about Storm?\n\n* Spout and bolts fit network approach, where logs could move from bolt to bolt sequentially or need to be consumed by several bolts in parallel\n* Guranteed data processing\n* Dynamic deployment"},{"_id":"4d257bb1353767dea6000150","treeId":"5483ac8e1268987d4019293f","seq":1223671,"position":1,"parentId":"4d255067353767dea600014f","content":"ElasticSearch index management\n\n* Indices are time-series data\n * Separated by customer\n * Represent slices of time\n * Higher volume index will have shorter time slice\n* Multi-tier architecture for efficient indexing \n * Multiple indexing tiers mapped to different AWS instance types\n"},{"_id":"4d258495353767dea6000151","treeId":"5483ac8e1268987d4019293f","seq":1387492,"position":3,"parentId":"4d250228353767dea600014c","content":"#### Pre-production system\n\n* Parallel to production, picking events from the same Kafka topics (one of the partition, as it does not need full data set)\n"},{"_id":"4d2587e3353767dea6000152","treeId":"5483ac8e1268987d4019293f","seq":1387491,"position":4,"parentId":"4d250228353767dea600014c","content":"#### AWS deployment\n\n* Collector needs high compute power, disk is not important\n* Kafka is memory optimized, needs disk buffer caching\n"},{"_id":"4d258c11353767dea6000153","treeId":"5483ac8e1268987d4019293f","seq":1387490,"position":5,"parentId":"4d250228353767dea600014c","content":"#### False starts\n\n* AWS ELB for load balancing between collectors\n * No port forwarding 514 (syslog)\n * No UDP forwarding\n * Burst hit its ELB limit\n* Ended use Route 53 DNS Round Robin\n* Original use Cassandra both for persistence store and as queue\n * Cassandra not designed for queue usage\n * Added complexity as Multi-tenancy require handling data bursts. Collectors still needed to be able to buffer to disk."},{"_id":"4d259594353767dea6000154","treeId":"5483ac8e1268987d4019293f","seq":1387489,"position":6,"parentId":"4d250228353767dea600014c","content":"#### Big Wins\n\n* Easy to scale Storm resources\n* Route 53 DNS supports with latency resolution\n* Pre-production staging system\n"},{"_id":"4d4547a8b194a8ecc1000052","treeId":"5483ac8e1268987d4019293f","seq":1387488,"position":7,"parentId":"4d250228353767dea600014c","content":"#### Dynamic Fields\n\nAt a glance, you would see a summary of the different kinds of exceptions that are currently occurring and the frequency of these exceptions between desktop vs mobile or across different browsers (kind of sub-grouping). It's this instant and continually customized overview that is new and different. it allows you to get a much fuller view of your logs and to more quickly zero in on what's important.\n"},{"_id":"4d454bffb194a8ecc1000053","treeId":"5483ac8e1268987d4019293f","seq":1387487,"position":8,"parentId":"4d250228353767dea600014c","content":"#### Ingestion syslog utility\n\n[PatternDB](https://github.com/balabit/syslog-ng-patterndb) to parse log messages and generate name-value pairs from them. These can then be forwarded to Loggly using the JSON output.\n\nSee [https://www.loggly.com/blog/sending-json-format-logs-syslog-ng/](https://www.loggly.com/blog/sending-json-format-logs-syslog-ng/)\n\nShould you send us a log format we don't currently recognize, we'll want you to tell us about it."},{"_id":"4d25e2c9353767dea6000155","treeId":"5483ac8e1268987d4019293f","seq":1227273,"position":3,"parentId":"4d24d334353767dea6000144","content":"### Storm Removed\n\n(2014/08)\n\nSee [What We Learned About Scaling with Apache Storm: Pushing the Performance Envelope](https://www.loggly.com/blog/what-we-learned-about-scaling-with-apache-storm/)\n\n* Guaranteed delivery feature causes big performance hit\n * **2.5x hit**\n * Potential workaround: batch logs. Ack a set of logs. But this is hacks and not consistent with Storm's concept. And it's not trivial to do.\n * (now) Trident can be used for this purpose though.\n* Custom approach: 100K+ events per second\n"},{"_id":"4e086341f86c56fb5b000090","treeId":"5483ac8e1268987d4019293f","seq":1379975,"position":4,"parentId":"4d24d334353767dea6000144","content":"### Try\n\nscenarios done: 2/3/4/5 (no table form, just chart), 8/9\n\nProvisioning takes about 1 min.\n\nBuilt on Angular, load page very slow ... Also, it could be network but I don't feel it fast, in fact it feels very slow, like when faceting.\n\n"},{"_id":"4e157545f86c56fb5b00012b","treeId":"5483ac8e1268987d4019293f","seq":1375325,"position":0.5,"parentId":"4e086341f86c56fb5b000090","content":"#### Ingestion and Source setup\n\nProvides a quick upload way (curl) to try first. Very simple and straight forward (though it's even simpler to upload on the page I suppose).\n\nAuto detected apache access log as the type.\n\nDoes seem to be syslog based (receiving end). In fact, all investors do support syslog end point.\nEven file monitoring instruction uses syslog configuration method. Even log4j is instructed to be configured to send to local syslog daemon first.\n\nFrom client side, you can also easily send logs via http/https, in plain text or json format.\n\nUsing tokens for security/identify. (https://logs-01.loggly.com/bulk/95179f81-503b-4d70-a2ed-728955261cc0/tag/file_upload)\n\nIngestion takes some time to show up on search page (~20s).\n\nSource group: logical grouping on host/app/tag (AND). Used for like production/staging.\n\nThere doesn't seem to be a way to re-parsing existent data and dynamically on UI. That is, to modify the parsing method (like add a new field like in Splunk), you'd have to re-sending data (after changing the collector).\n\njson are automatically parsed and fields are automatically extracted. There are a few supported log formats, but other than that, you have to parse yourself and send in in json\n\nIn general, only has syslog and http/https endpoints.\n\nOne interesting thing: wikified strings are considered while indexing. That is, S3Stats.processStatEvents is indexed as \"S 3 Stats process Stat Events\"."},{"_id":"4e1585a8f86c56fb5b00012c","treeId":"5483ac8e1268987d4019293f","seq":1375285,"position":0.75,"parentId":"4e086341f86c56fb5b000090","content":"#### Search\n\nSearch syntax similar/same to Lucene/Solr with minor extension. Like field:>40, _exists_:field:value, category range field:[xxx TO yyy]\n\nFaceting is on the left side. It's not very fast (feeling so slow might be my network). 2-level faceting is possible and clicking creates a filter. Multiple filtering is allowed. You can easily remove a filter in active."},{"_id":"4e157492f86c56fb5b00012a","treeId":"5483ac8e1268987d4019293f","seq":1379945,"position":1,"parentId":"4e086341f86c56fb5b000090","content":"#### UI\n\nI like the timeline google finance like operation.\n\nSave current 'search' easily by \"star\" button.\n\nAlert is based on saved search and run periodically (quickest is ran every minute). Can only alert if the saved search condition matched over some number in some past period, like over 5 times in the past 5 minutes. In addition to email, you only have http call back and page duty choices.\n\nCharting/trending/grouping is done on UI and split into multi-step. First select chart type, then grouping accordingly. There's also a grid view that you can compose your own table view freely.\n\nCurrently only timeline/pie/bar/single-value. Charting should really include grid/table form.\n\nThe 'dimension' in charting is fixed. That is, you cannot combine multiple fields. For example, unique visitors in Apache access logs requires combination of IP/Date/Agent. This is more fundamental to the faceting, which only allow faceting on existent field, no composition.\n\nAlso, there's no control over the time unit. Saved widget used current time range as it is (which is not easy to use btw). Say I want to have a trend line on the unit of per day.\n\nI like the tabbed interface, able to preserve a diagnosis session (most likely the time range filter).\n\nWhen charting/trending, can't do roll-up. Like group by then roll up some other columns (of course, it can only show single filed used as the group by). For example, top hosts also showing total bytes requests for each host."},{"_id":"4d24d414353767dea6000147","treeId":"5483ac8e1268987d4019293f","seq":1375337,"position":1.5,"parentId":"4d24d2d0353767dea6000143","content":"## Sumo Logic"},{"_id":"4e221a35205c4ee7ab0000aa","treeId":"5483ac8e1268987d4019293f","seq":1380721,"position":0.5,"parentId":"4d24d414353767dea6000147","content":"### LogReduce\n\nAutomatically apply the Summarize operator to the current results (non-aggregate).\n\nThe Summarize operator's algorithm uses fuzzy logic to cluster messages together based on string and pattern similarity. You can use the summarize operator to quickly assess activity patterns for things like a range of devices or traffic on a website. Focus the summarize algorithm on an area of interest by defining that area in the keyword expression.\n\nReduce logs into patterns, works across multiple sources. By identifying the change of log patterns, say a new pattern emerges, it is able to do anomaly detection.\n"},{"_id":"4e22bea8205c4ee7ab0000ab","treeId":"5483ac8e1268987d4019293f","seq":1387465,"position":0.75,"parentId":"4d24d414353767dea6000147","content":"### Transaction\n\nallows cross event tracing of transactions based on a common field, with states defiend:\n\n```\n... | transaction on sessionid\nwith \"Starting session *\" as init,\nwith \"Initiating countdown *\" thresh=2 as countdown_start,\nwith \"Countdown reached *\" thresh=2 as countdown_done,\nwith \"Launch *\" as launch\nresults by transactions showing max(_messagetime),\nsum(\"1\") for init as initcount\n```\n\nalso able to draw flow chart directly:\n\n```\n_sourceCategory=[sourceCategoryName]\n| parse \"] * - \" as sessionID\n| parse \"- * accepting request.\" as module\n| transaction on SessionID with states Frontend, QueryProcessor, Dashboard, DB1, DB2, DB3 in module\nresults by flow\n| count by fromstate, tostate\n```\n\n![](https://service.au.sumologic.com/help/Resources/Images/Flow_Diagram_simple_example_655x221.png)"},{"_id":"4e22d468205c4ee7ab0000ac","treeId":"5483ac8e1268987d4019293f","seq":1387466,"position":0.875,"parentId":"4d24d414353767dea6000147","content":"### Scheduled view\n\nLike materialized search, pre-aggregated result.\n\n### Indexes Paritions\n\nYou can create a partition specific for a search sub-range. for performance."},{"_id":"4e086b85f86c56fb5b000091","treeId":"5483ac8e1268987d4019293f","seq":1380664,"position":1,"parentId":"4d24d414353767dea6000147","content":"### Try\n\nall 9 scenarios can be done.\n\nit is very much like Splunk, which does all the processing at search phase. This would typically cause long duration especially for aggregation type operation. Loggly style usually come back in less than 5 sec.\n\nIt does support field extraction at collection phase. Define the `parse` operator as a rule and it would be applied automatically. This combines the best of both world.\n\nprovisioning not fast ...\n\ncollector setup is not flexible. requires installation of its collector.\n\nnow I know why IBM is looking for it. enterprise..."},{"_id":"4e1f5588205c4ee7ab0000a6","treeId":"5483ac8e1268987d4019293f","seq":1418246,"position":1,"parentId":"4e086b85f86c56fb5b000091","content":"#### Ingestion\n\nRequires custom collector. Once collector is installed, it is managed via web UI. Sources could have different static tags for search purpose. Filters could also be defined on sources (ex/include, hash, mask).\n\nCurrently sources could be : \n* Local File Source - files on the same collector host.\n* Remote File Source - ssh / unc\n* Syslog Source\n* Windows Events Source\n* Script Source - pretty limited, must be on the same host as the collector. only bat/vb/bash/python/perl\n\nThere's another \"hosted\" option which provides HTTP/S and S3 endpoints.\n\nCollector takes job from server, scheduled seems to be every 2 minutes. Other than that, ingestion speed is fine compared to Loggly (can't really tell how slow or fast it is due to the lag).\n\nThe parsing is done during search time, just like Splunk."},{"_id":"4e1ff43c205c4ee7ab0000a8","treeId":"5483ac8e1268987d4019293f","seq":1380097,"position":1.5,"parentId":"4e086b85f86c56fb5b000091","content":"#### Search\n\nIt has pipe in search just like Splunk.\n\nThere's a `summary` operator which cluster messages based on pattern similarites. Ex. \n\n```\nerror or fail* | summarize| parse \"message: *\" as msg | summarize field=msg\n```\n\nmore ...\n\n```\ncount _size | top 10 _size by _count\n```\n\nParsing apache access log:\n\n```\n* | parse using public/apache/access | where status_code matches \"4*\"\n```"},{"_id":"4e1fdab0205c4ee7ab0000a7","treeId":"5483ac8e1268987d4019293f","seq":1380385,"position":2,"parentId":"4e086b85f86c56fb5b000091","content":"#### UI\n\nActually more mature than Loggly.\n\nI like the 'Library' overlayer that shows common/stock search categorized by application.\n\nWhen I try to create a widget out of the unique visitors result, it failed as not supported operator.\n\nDashboard: for some reason, can't get it to show data successfully.\n"},{"_id":"4e20dd4b205c4ee7ab0000a9","treeId":"5483ac8e1268987d4019293f","seq":1380192,"position":3,"parentId":"4e086b85f86c56fb5b000091","content":"#### Alert\n\nAlso via saved search. Even though on UI it says it could be real-time, but I think it is still running periodically.\n\nAlert actions could be email, script, or save to index."},{"_id":"4d24d377353767dea6000145","treeId":"5483ac8e1268987d4019293f","seq":1223062,"position":2,"parentId":"4d24d2d0353767dea6000143","content":"## Papertrail\n"},{"_id":"4e2c288a0e149fb7930000e4","treeId":"5483ac8e1268987d4019293f","seq":1386056,"position":1,"parentId":"4d24d377353767dea6000145","content":"### Try\n\nMore developer centric mindset. It doesn't have rich feature set. Much of its functions are basic \"tail\" like operation.\n\nAnalytics side, it does not have any 'native' function. It allows daily archive to S3 and then provides some scripts/instructions for loading archives into different tools/envs for self analytics work. Ex. Hive or Redshift.\n\nOne interesting [tool](https://github.com/papertrail/papertrail-cli) allows tails locally the centralized logs.\n\nA Ruby shop."},{"_id":"4e2c297f0e149fb7930000e5","treeId":"5483ac8e1268987d4019293f","seq":1386054,"position":1,"parentId":"4e2c288a0e149fb7930000e4","content":"#### Ingestion\n\nBasically syslogd. It also provides a small remote_syslog daemon.\n\nLeast feature reach tried so far. Essentially positioned as a centralized \"tail\" service. Minimum search function (basic boolean operators and full text). Also seems to only allow syslog messages.\n\nAllow log filters to removes logs never wanted while ingestion."},{"_id":"4e2cbdf90e149fb7930000e6","treeId":"5483ac8e1268987d4019293f","seq":1386059,"position":2,"parentId":"4e2c288a0e149fb7930000e4","content":"#### Alert\n\nScheduled saved search. Support many web-hooks as notification."},{"_id":"4e56bc0bdaab58e28000010e","treeId":"5483ac8e1268987d4019293f","seq":1437578,"position":2.5,"parentId":"4d24d2d0353767dea6000143","content":"## Solr/Banana\n\n"},{"_id":"4e56bcaedaab58e28000010f","treeId":"5483ac8e1268987d4019293f","seq":1437692,"position":1,"parentId":"4e56bc0bdaab58e28000010e","content":"### Try\n\nTried local Banana/Solr. To be honest, it's not fast. Any action on the page take maybe at least 1 second. (This may be due to local Solr slowness, not sure.)\n\nAlso, sometimes the page just \"locks\".\n\nWhat I like about Banana (Kibana I assume):\n\n1. Whole page is controlled by a single json file (not just widget, but also the active filter, the \"value\"), and better you get to load the json from anywhere lively (include gist).\n2. json can be saved to solr as well\n3. Purely in browser.\n"},{"_id":"4e8c13d0c4b24131b70000f2","treeId":"5483ac8e1268987d4019293f","seq":1437684,"position":1,"parentId":"4e56bcaedaab58e28000010f","content":"#### Dashboard\n\nThe whole dashboard is row based, with each row containing multiple \"panels\". You can specify the height of each row in pixel, width of panels in x/12 (like bootstrap). There are about 17 types of panel for you to choose, more like pre-defined panel for use. Ex. timepicker, query, facet, filtering, etc. Think of widget.\n\n* bettermap: Map\n* column: container panel allow vertical arrangement of multiple panels within a row.\n* facet:\n* filtering: shows current active filters\n* fulltextsearch\n* heatmap: for showing pivot facet\n* histogram: A bucketed time series chart of the current query, including all applied time and non-time filters. When used in count mode, uses Solr’s facet.range query parameters. In values mode, it plots the value of a specific field over time, and allows the user to group field values by a second field.\n* hits: shows total events count\n* map\n* query: search bar\n* rangeFacet\n* scatterplot: choose 2 variables\n* table: A paginated table of records matching your query (including any filters that may have been applied). Click on a row to expand it and review all of the fields associated with that document. Provides the capability to export your result set to CSV, XML or JSON for further processing using other systems.\n* term: Displays the results of a Solr facet as a pie chart, bar chart, or a table. Newly added functionality displays min/max/mean/sum of a stats field, faceted by the Solr facet field, again as a pie chart, bar chart or a table.\n* text\n* ticker: stock ticker style\n* timepicker\n\nEach panel defines its own config section.\n\nSome panels apply globally to the whole page's panel (like timepicker and query).\n\n\"Display help message when Inspect\" allows html, markdown, and text modes, with some content specified in panel config. For example, bettermap allows showing active query when inspect.\n\nOne pitfall is some faceting only can work on numeric fields. which is kind of inconvenient."},{"_id":"4ec0823a0c63c51333000118","treeId":"5483ac8e1268987d4019293f","seq":1464753,"position":2.75,"parentId":"4d24d2d0353767dea6000143","content":"## Splunk"},{"_id":"4ec0827d0c63c51333000119","treeId":"5483ac8e1268987d4019293f","seq":1464754,"position":1,"parentId":"4ec0823a0c63c51333000118","content":"Searches and Reports\n\n```\nTPAE UI session > 150 \n\nsourcetype=tpae eventtype=\"tpae_total_users_and_jvm_ui_sessions\" OR eventtype=\"tpae_total_users\" OR eventtype=\"tpae_jvm_ui_sessions\"|timechart span=15m max(ui_sessions) as sessions by host|untable _time host sessions|search sessions > 150|timechart first(sessions) by host\n\nTPAE UI session tracks \n\ncorrelation_id=* ElapsedTime session_id=*|table session_id correlation_id _time user_id elapsed_time app event|sort session_id correlation_id\n\nTPAE memory available \n\nsourcetype=tpae eventtype=\"tpae_memory_status\"|stats sparkline(avg(eval(memory_available/1024/1024))) as memory_available, avg(eval(memory_available/1024/1024)) as avg, min(eval(memory_available/1024/1024)) as min, max(eval(memory_available/1024/1024)) as max by host\n\nTPAE slow query by app, object \n\nsourcetype=tpae eventtype=\"tpae_slow_sql\"|stats count, avg(sql_time) as avg_exec_time, max(sql_time) as max_exec_time, min(sql_time) as min_exec_time by app, object|sort 20 - count, - avg_exec_time \n\nTPAE stacked memory available and used \n\nsourcetype=tpae eventtype=\"tpae_memory_status\"|timechart span=15m avg(eval(memory_available/1024/1024)) as avg_available, avg(eval((memory_total - memory_available)/1024/1024)) as avg_used\n\nTPAE top mbo count \n\nsourcetype=tpae eventtype=\"tpae_mbo\"|stats sparkline(avg(mbo_count)), avg(mbo_set_count) as set_count, avg(mbo_count) as count by mbo|sort 20 - set_count\n\nTPAE ui sessions' correlations \n\ncorrelation_id=\"*\" ElapsedTime|stats first(elapsed_time) as time first(user_id) as user first(app) as app first(session_id) as session_id by correlation_id|stats count(correlation_id) as count first(time) as time first(user) as user first(app) as app by session_id|sort 10 - time\n\nTPAE users and session \n\nsourcetype=tpae eventtype=\"tpae_total_users_and_jvm_ui_sessions\" OR eventtype=\"tpae_total_users\" OR eventtype=\"tpae_jvm_ui_sessions\" | timechart span=15m first(users) as total_users, avg(ui_sessions) as per_jvm_sessions\n```"},{"_id":"4d24d3b2353767dea6000146","treeId":"5483ac8e1268987d4019293f","seq":1261055,"position":3,"parentId":"4d24d2d0353767dea6000143","content":"## New Relic\n"},{"_id":"4d4d7979b194a8ecc1000056","treeId":"5483ac8e1268987d4019293f","seq":1261051,"position":0.5,"parentId":"4d24d3b2353767dea6000146","content":"### New Relic Insights\n\n[New Relic Insights Technology](http://newrelic.com/insights/technology)"},{"_id":"4d4562f8b194a8ecc1000054","treeId":"5483ac8e1268987d4019293f","seq":1387483,"position":1,"parentId":"4d4d7979b194a8ecc1000056","content":"#### The Database\n\nNew Relic Insights is powered by a highly distributed cloud-hosted event database with an innovative architecture that does not require indexing.\n\nData is distributed across cluster nodes so each query can leverage the full power of the cluster to return billions of metrics in seconds.\n\nThe New Relic Insights database does not require indexing. This eliminates the need to configure the database in advance based on the types of queries expected. Query speed is fast regardless of what attributes you query."},{"_id":"4d4569c7b194a8ecc1000055","treeId":"5483ac8e1268987d4019293f","seq":1387481,"position":2,"parentId":"4d4d7979b194a8ecc1000056","content":"#### Presenting Data\n\nNew Relic Insights uses the SQL-flavored query language - NRQL (New Relic Query Language) for querying the events database.\n"},{"_id":"4d4d7b0db194a8ecc1000057","treeId":"5483ac8e1268987d4019293f","seq":1261127,"position":2,"parentId":"4d24d3b2353767dea6000146","content":"### APM as a Service\n\n[How to Build a SaaS App With Twitter-like Throughput on Just 9 Servers](http://www.slideshare.net/newrelic/how-to-build-a-saas-app-with-twitterlike-throughput-on-just-9-servers)\n\n[New Relic Architecture - Collecting 20+ Billion Metrics A Day](http://highscalability.com/blog/2011/7/18/new-relic-architecture-collecting-20-billion-metrics-a-day.html)"},{"_id":"4d4d811bb194a8ecc1000059","treeId":"5483ac8e1268987d4019293f","seq":1261103,"position":2,"parentId":"4d4d7b0db194a8ecc1000057","content":"The amount of data we collect every day is staggering. Initially all data is captured at full resolution for each metric. Over time we reperiodize the data, going from minute-by-minute to hourly and then finally to daily averages. For our professional accounts, we store daily metric data indefinitely, so customers can see how they've improved over the long haul.\n\nOur data strategy is optimized for reading, since our core application is constantly needing to access metric data by time series. It's easier to pay a penalty on write to keep the data optimized for faster reads, to ensure our customers can quickly access their performance data any time of the day. \n\nCreates a database table per account per hour to hold metric data. This table strategy is optimized for reads vs. writes\n\nHaving so many tables with this amount of data in them makes schema migrations impossible. Instead, \"template\" tables are used from which new timeslice tables are created. New tables use the new definition while old tables are eventually purged from the system. The application code needs to be aware that multiple table definitions may be active at one time."},{"_id":"4d4d7c0ab194a8ecc1000058","treeId":"5483ac8e1268987d4019293f","seq":1387475,"position":2.5,"parentId":"4d4d7b0db194a8ecc1000057","content":"Data collection handled by 9 sharded MySQL servers.\n\nThe \"collector\" service digests app metrics and persists them in the right MySQL shard.\n\nThese back-end metrics are persisted in the customer's New Relic agent for up to one minute, where they are then sent back to the New Relic data collection service\n\nReal User Monitoring javascript snippet sends front-end performance data to the \"beacon\" service for every single page view.\n"},{"_id":"4d4d9320b194a8ecc100005b","treeId":"5483ac8e1268987d4019293f","seq":1261115,"position":1,"parentId":"4d4d7c0ab194a8ecc1000058","content":"Use the right tech for the job. The main New Relic web application has always been a Rails app. The data collection tier was originally written in Ruby, but was eventually ported over to Java. The primary driver for this change was performance. This tier currently supports over 180k requests per minute and responds in around 2.5 milliseconds with plenty of headroom to go. "},{"_id":"4d4d8e60b194a8ecc100005a","treeId":"5483ac8e1268987d4019293f","seq":1387476,"position":3,"parentId":"4d4d7b0db194a8ecc1000057","content":"#### Challenges\n\n* Data purging: Summarization of metrics and purging granular metric data is an expensive and nearly continuous process\n* Determining what metrics can be pre-aggregated\n* Large accounts: Some customers have many applications, while others have a staggering number of servers\n* Load balancing shards: Big accounts, small accounts, high-utilization accounts"},{"_id":"4d4daf64b194a8ecc100005c","treeId":"5483ac8e1268987d4019293f","seq":1387478,"position":4,"parentId":"4d4d7b0db194a8ecc1000057","content":"#### Plugin architecture\n\n[An In-Depth Look at the New Relic Platform](http://architects.dzone.com/articles/depth-look-new-relic-platform)\n\nPlugins are made up of two parts:\n\n* An agent that observes the system being monitored and reports metrics about that system up to the New Relic data center\n* A matching set of dashboards that display those metrics"},{"_id":"4d4dcda4b194a8ecc100005d","treeId":"5483ac8e1268987d4019293f","seq":1387479,"position":5,"parentId":"4d4d7b0db194a8ecc1000057","content":"#### Alerting\n\nAlerting: AppDynamics computes your response time thresholds by itself and might take some time to learn your system. New Relic relies on custom thresholds defined by you for its Apdex index."},{"_id":"4e63ae82daab58e280000110","treeId":"5483ac8e1268987d4019293f","seq":1417446,"position":5,"parentId":null,"content":"# References"},{"_id":"4d24cfb4353767dea6000142","treeId":"5483ac8e1268987d4019293f","seq":1417457,"position":1,"parentId":"4e63ae82daab58e280000110","content":"## What is needed in a log management system\n\n* Search (for something), better yet, search API\n* Chart (for spotting trend)\n* Realtime (fast enough, no lag)\n* Flexible ingestion API (or standard API)\n* High fault tolerance (no log lost allowed)\n"},{"_id":"4d24c948353767dea6000141","treeId":"5483ac8e1268987d4019293f","seq":1417449,"position":2,"parentId":"4e63ae82daab58e280000110","content":"## Best Practice in Log Management\n\n* Use existing logging infrastructure\n * Real time syslog\n * Application log file watching\n* Store log externally\n* Log messages in machine parsable format\n * JSON for structured information\n * Key-value pairs\n"},{"_id":"4d5dcc11b194a8ecc100005e","treeId":"5483ac8e1268987d4019293f","seq":1417453,"position":4,"parentId":"4e63ae82daab58e280000110","content":"## Lambda Architecture\n\n* Batch layer\n* Serving layer\n* Speed layer\n\nData goes to both batch and speed layer. Batch layer contains the master dataset, serving layer generates query views, and speed layer provides query views of the most recent data which batch/serving layer has not yet caught up.\n\nThe batch layer runs functions over the master dataset to precompute intermediate data for your queries, which are stored in batch views in the serving layer. The speed layer compensates for the high latency of the batch layer by providing low latency updates using data that has yet to be precomputed into a batch view. Queries are then satisfied by processing data from the serving layer views and the speed layer views and merging the results.\n"},{"_id":"4d5dd4c1b194a8ecc100005f","treeId":"5483ac8e1268987d4019293f","seq":1387458,"position":1,"parentId":"4d5dcc11b194a8ecc100005e","content":"### Batch Layer\n\nMaster dataset must be atomic, immutable, and perpetual."},{"_id":"4d5dd502b194a8ecc1000060","treeId":"5483ac8e1268987d4019293f","seq":1387459,"position":2,"parentId":"4d5dcc11b194a8ecc100005e","content":"### Serving Layer\n\nDatabase requirements:\n\n* Batch writes\n* Random reads\n* Scalable\n* Fault-tolerant\n"},{"_id":"4d5dd633b194a8ecc1000061","treeId":"5483ac8e1268987d4019293f","seq":1387460,"position":3,"parentId":"4d5dcc11b194a8ecc100005e","content":"### Speed Layer\n"},{"_id":"4dd0a186f9b92f9c890000e1","treeId":"5483ac8e1268987d4019293f","seq":1345599,"position":4,"parentId":"4d5dcc11b194a8ecc100005e","content":"https://www.bro.org/brocon2014/brocon2014_grutzmacher_opensoc.pdf (p23)\n\nAlgorithms for batch and streaming processing."},{"_id":"4dc8190af9b92f9c890000de","treeId":"5483ac8e1268987d4019293f","seq":1417465,"position":5,"parentId":"4e63ae82daab58e280000110","content":"## HBase\n\nA key/value store providing random and realtime read/write access."},{"_id":"4dc81944f9b92f9c890000df","treeId":"5483ac8e1268987d4019293f","seq":1345409,"position":1,"parentId":"4dc8190af9b92f9c890000de","content":"### Tuning\n\nhttp://www.slideshare.net/Hadoop_Summit/analyzing-12-million-network-packets-per-second-in-realtime (p29)\n"},{"_id":"4dd43cac44963b59cf0000a0","treeId":"5483ac8e1268987d4019293f","seq":1346764,"position":2,"parentId":"4dc8190af9b92f9c890000de","content":"### VS Cassandra\n\n* http://bigdatanoob.blogspot.com/2012/11/hbase-vs-cassandra.html"},{"_id":"4dd307b144963b59cf00009e","treeId":"5483ac8e1268987d4019293f","seq":1417467,"position":6,"parentId":"4e63ae82daab58e280000110","content":"## Hive\n\nThe Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.\n\nIt does not provide random access (read/write).\n\n**Think of it as an higher abstraction for using MapReduce with SQL-like method.**\n\n* http://kb.tableausoftware.com/articles/knowledgebase/hadoop-hive-performance\n\nHive is useful only if the data is structured.\n\n* http://www.quora.com/What-are-the-benefits-and-drawbacks-of-using-MapReduce-over-Hive"},{"_id":"4ded7307f86c56fb5b00008f","treeId":"5483ac8e1268987d4019293f","seq":1417469,"position":7,"parentId":"4e63ae82daab58e280000110","content":"## CloudSight\n\nhttps://github.rtp.raleigh.ibm.com/cloudsight/cloudsight/wikis/home\n"},{"_id":"509bcade8f2a80fe2b00013a","treeId":"5483ac8e1268987d4019293f","seq":1720194,"position":8,"parentId":"4e63ae82daab58e280000110","content":"## Dashboard References\n\n* https://www.thedash.com - frankly, it is slow ... but it raises one good thought on simplicity. **Need try more**\n* http://blog.trifork.com/2014/05/20/advanced-kibana-dashboard/ - using Kibana dashboard templates or scripted dashboards, We want to be able to customize dashboards, prepare dashboards to be used by others. \n* http://www.algolia.com/ - \n * http://blog.algolia.com/full-text-search-in-your-database-algolia-versus-elasticsearch/\n * multiple ranking consideration, mixed together."}],"tree":{"_id":"5483ac8e1268987d4019293f","name":"SaaS Log Management","publicUrl":"3u0c42"}}